[GH-ISSUE #11471] Flash Attention not supported? #33332

Closed
opened 2026-04-22 15:54:19 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @mirage335 on GitHub (Jul 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11471

What is the issue?

Laptop RTX 4090 16GB . Seems ollama thinks flash attention is not supported.

Yes, as the log shows I am simultaneously trying to diagnose getting another AI LLM model to use maximum VRAM when an eGPU is plugged in, yet somehow maintain enough of an overhead buffer to at least work when the eGPU is not available.

Relevant log output

time=2025-07-18T12:33:42.475-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\mirag\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]"
time=2025-07-18T12:33:42.484-04:00 level=INFO source=images.go:476 msg="total blobs: 94"
time=2025-07-18T12:33:42.487-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-18T12:33:42.489-04:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.7-rc0)"
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20
time=2025-07-18T12:33:42.607-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e774f15-e659-849b-aa06-4415cca19573 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="16.0 GiB" available="14.7 GiB"
time=2025-07-18T12:34:01.865-04:00 level=INFO source=server.go:135 msg="system memory" total="63.7 GiB" free="44.2 GiB" free_swap="110.9 GiB"
time=2025-07-18T12:34:01.885-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=999 layers.model=81 layers.offload=0 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.6 GiB" memory.required.partial="0 B" memory.required.kv="2.2 GiB" memory.required.allocations="[0 B]" memory.weights.total="12.4 GiB" memory.weights.repeating="11.7 GiB" memory.weights.nonrepeating="688.9 MiB" memory.graph.full="23.3 GiB" memory.graph.partial="23.3 GiB"
time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0
llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deci
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Super
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.finetune str              = 3_3-Nemotron-Super
llama_model_loader: - kv   5:                           general.basename str              = Llama
llama_model_loader: - kv   6:                         general.size_label str              = 49B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv   9:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                        deci.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:               deci.attention.head_count_kv arr[i32,80]      = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ...
llama_model_loader: - kv  14:                  deci.attention.head_count arr[i32,80]      = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64...
llama_model_loader: - kv  15:                   deci.feed_forward_length arr[i32,80]      = [14336, 28672, 28672, 28672, 28672, 2...
llama_model_loader: - kv  16:                           deci.block_count u32              = 80
llama_model_loader: - kv  17:                        deci.context_length u32              = 131072
llama_model_loader: - kv  18:                      deci.embedding_length u32              = 8192
llama_model_loader: - kv  19:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                  deci.attention.key_length u32              = 128
llama_model_loader: - kv  21:                deci.attention.value_length u32              = 128
llama_model_loader: - kv  22:                            deci.vocab_size u32              = 128256
llama_model_loader: - kv  23:                  deci.rope.dimension_count u32              = 128
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{- bos_token }}{%- if messages[0]['r...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 19
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Llama-3_3-Nemotron-Super-...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 436
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  131 tensors
llama_model_loader: - type q2_K:   11 tensors
llama_model_loader: - type q4_K:   49 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xxs:  377 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ2_XXS - 2.0625 bpw
print_info: file size   = 12.71 GiB (2.19 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = deci
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 49.87 B
print_info: general.name     = Llama_Nemotron_Super
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-18T12:34:02.883-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\mirag\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\mirag\\.ollama\\models\\blobs\\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 --ctx-size 14336 --batch-size 512 --n-gpu-layers 999 --threads 14 --no-mmap --parallel 4 --port 57803"
time=2025-07-18T12:34:02.898-04:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-07-18T12:34:02.898-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-18T12:34:02.900-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2025-07-18T12:34:02.932-04:00 level=INFO source=runner.go:815 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-07-18T12:34:03.017-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-07-18T12:34:03.018-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:57803"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 Laptop GPU) - 15048 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deci
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Super
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.finetune str              = 3_3-Nemotron-Super
llama_model_loader: - kv   5:                           general.basename str              = Llama
llama_model_loader: - kv   6:                         general.size_label str              = 49B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv   9:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                        deci.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:               deci.attention.head_count_kv arr[i32,80]      = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ...
llama_model_loader: - kv  14:                  deci.attention.head_count arr[i32,80]      = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64...
llama_model_loader: - kv  15:                   deci.feed_forward_length arr[i32,80]      = [14336, 28672, 28672, 28672, 28672, 2...
llama_model_loader: - kv  16:                           deci.block_count u32              = 80
llama_model_loader: - kv  17:                        deci.context_length u32              = 131072
llama_model_loader: - kv  18:                      deci.embedding_length u32              = 8192
llama_model_loader: - kv  19:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                  deci.attention.key_length u32              = 128
llama_model_loader: - kv  21:                deci.attention.value_length u32              = 128
llama_model_loader: - kv  22:                            deci.vocab_size u32              = 128256
llama_model_loader: - kv  23:                  deci.rope.dimension_count u32              = 128
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = llama-bpe
time=2025-07-18T12:34:03.152-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{- bos_token }}{%- if messages[0]['r...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 19
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Llama-3_3-Nemotron-Super-...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 436
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  131 tensors
llama_model_loader: - type q2_K:   11 tensors
llama_model_loader: - type q4_K:   49 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xxs:  377 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ2_XXS - 2.0625 bpw
print_info: file size   = 12.71 GiB (2.19 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = deci
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64, 64, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64]
print_info: n_head_kv        = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8]
print_info: n_embd_k_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
print_info: n_embd_v_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = [14336, 28672, 28672, 28672, 28672, 28672, 14336, 14336, 28672, 28672, 28672, 17920, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 7168, 14336, 14336, 7168, 28672, 7168, 14336, 7168, 7168, 7168, 28672, 7168, 5632, 5632, 7168, 5632, 5632, 5632, 7168, 7168, 2816, 2816, 5632, 5632, 2816, 2816, 5632, 2816, 2816, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672]
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 49.87 B
print_info: general.name     = Llama_Nemotron_Super
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 12690.13 MiB
load_tensors:          CPU model buffer size =   328.78 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 14336
llama_context: n_ctx_per_seq = 3584
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (3584) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.08 MiB
llama_kv_cache_unified: kv_size = 14336, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1, padding = 32
CUDA error: out of memory
  current device: 0, in function ggml_backend_cuda_buffer_clear at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:622
  cudaDeviceSynchronize()
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:76: CUDA error
time=2025-07-18T12:34:07.112-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-18T12:34:07.225-04:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-07-18T12:34:07.363-04:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
[GIN] 2025/07/18 - 12:34:07 | 500 |    5.9167696s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.9.7-rc0

Originally created by @mirage335 on GitHub (Jul 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11471 ### What is the issue? Laptop RTX 4090 16GB . Seems ollama thinks flash attention is not supported. Yes, as the log shows I am simultaneously trying to diagnose getting another AI LLM model to use maximum VRAM when an eGPU is plugged in, yet somehow maintain enough of an overhead buffer to at least work when the eGPU is not available. ### Relevant log output ```shell time=2025-07-18T12:33:42.475-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\mirag\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]" time=2025-07-18T12:33:42.484-04:00 level=INFO source=images.go:476 msg="total blobs: 94" time=2025-07-18T12:33:42.487-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-18T12:33:42.489-04:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.7-rc0)" time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20 time=2025-07-18T12:33:42.607-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e774f15-e659-849b-aa06-4415cca19573 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="16.0 GiB" available="14.7 GiB" time=2025-07-18T12:34:01.865-04:00 level=INFO source=server.go:135 msg="system memory" total="63.7 GiB" free="44.2 GiB" free_swap="110.9 GiB" time=2025-07-18T12:34:01.885-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=999 layers.model=81 layers.offload=0 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.6 GiB" memory.required.partial="0 B" memory.required.kv="2.2 GiB" memory.required.allocations="[0 B]" memory.weights.total="12.4 GiB" memory.weights.repeating="11.7 GiB" memory.weights.nonrepeating="688.9 MiB" memory.graph.full="23.3 GiB" memory.graph.partial="23.3 GiB" time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0 llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deci llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama_Nemotron_Super llama_model_loader: - kv 3: general.version str = v1 llama_model_loader: - kv 4: general.finetune str = 3_3-Nemotron-Super llama_model_loader: - kv 5: general.basename str = Llama llama_model_loader: - kv 6: general.size_label str = 49B llama_model_loader: - kv 7: general.license str = other llama_model_loader: - kv 8: general.license.name str = nvidia-open-model-license llama_model_loader: - kv 9: general.license.link str = https://www.nvidia.com/en-us/agreemen... llama_model_loader: - kv 10: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex... llama_model_loader: - kv 11: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 12: deci.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 13: deci.attention.head_count_kv arr[i32,80] = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ... llama_model_loader: - kv 14: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64... llama_model_loader: - kv 15: deci.feed_forward_length arr[i32,80] = [14336, 28672, 28672, 28672, 28672, 2... llama_model_loader: - kv 16: deci.block_count u32 = 80 llama_model_loader: - kv 17: deci.context_length u32 = 131072 llama_model_loader: - kv 18: deci.embedding_length u32 = 8192 llama_model_loader: - kv 19: deci.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 20: deci.attention.key_length u32 = 128 llama_model_loader: - kv 21: deci.attention.value_length u32 = 128 llama_model_loader: - kv 22: deci.vocab_size u32 = 128256 llama_model_loader: - kv 23: deci.rope.dimension_count u32 = 128 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 31: tokenizer.chat_template str = {{- bos_token }}{%- if messages[0]['r... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - kv 33: general.file_type u32 = 19 llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Llama-3_3-Nemotron-Super-... llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 436 llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 131 tensors llama_model_loader: - type q2_K: 11 tensors llama_model_loader: - type q4_K: 49 tensors llama_model_loader: - type q5_K: 1 tensors llama_model_loader: - type iq2_xxs: 377 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ2_XXS - 2.0625 bpw print_info: file size = 12.71 GiB (2.19 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = deci print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 49.87 B print_info: general.name = Llama_Nemotron_Super print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-07-18T12:34:02.883-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\mirag\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\mirag\\.ollama\\models\\blobs\\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 --ctx-size 14336 --batch-size 512 --n-gpu-layers 999 --threads 14 --no-mmap --parallel 4 --port 57803" time=2025-07-18T12:34:02.898-04:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-07-18T12:34:02.898-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-18T12:34:02.900-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" time=2025-07-18T12:34:02.932-04:00 level=INFO source=runner.go:815 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-07-18T12:34:03.017-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-07-18T12:34:03.018-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:57803" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 Laptop GPU) - 15048 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deci llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama_Nemotron_Super llama_model_loader: - kv 3: general.version str = v1 llama_model_loader: - kv 4: general.finetune str = 3_3-Nemotron-Super llama_model_loader: - kv 5: general.basename str = Llama llama_model_loader: - kv 6: general.size_label str = 49B llama_model_loader: - kv 7: general.license str = other llama_model_loader: - kv 8: general.license.name str = nvidia-open-model-license llama_model_loader: - kv 9: general.license.link str = https://www.nvidia.com/en-us/agreemen... llama_model_loader: - kv 10: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex... llama_model_loader: - kv 11: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 12: deci.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 13: deci.attention.head_count_kv arr[i32,80] = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ... llama_model_loader: - kv 14: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64... llama_model_loader: - kv 15: deci.feed_forward_length arr[i32,80] = [14336, 28672, 28672, 28672, 28672, 2... llama_model_loader: - kv 16: deci.block_count u32 = 80 llama_model_loader: - kv 17: deci.context_length u32 = 131072 llama_model_loader: - kv 18: deci.embedding_length u32 = 8192 llama_model_loader: - kv 19: deci.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 20: deci.attention.key_length u32 = 128 llama_model_loader: - kv 21: deci.attention.value_length u32 = 128 llama_model_loader: - kv 22: deci.vocab_size u32 = 128256 llama_model_loader: - kv 23: deci.rope.dimension_count u32 = 128 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = llama-bpe time=2025-07-18T12:34:03.152-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 31: tokenizer.chat_template str = {{- bos_token }}{%- if messages[0]['r... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - kv 33: general.file_type u32 = 19 llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Llama-3_3-Nemotron-Super-... llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 436 llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 131 tensors llama_model_loader: - type q2_K: 11 tensors llama_model_loader: - type q4_K: 49 tensors llama_model_loader: - type q5_K: 1 tensors llama_model_loader: - type iq2_xxs: 377 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ2_XXS - 2.0625 bpw print_info: file size = 12.71 GiB (2.19 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = deci print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 8192 print_info: n_layer = 80 print_info: n_head = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64, 64, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64] print_info: n_head_kv = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8] print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8] print_info: n_embd_k_gqa = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] print_info: n_embd_v_gqa = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = [14336, 28672, 28672, 28672, 28672, 28672, 14336, 14336, 28672, 28672, 28672, 17920, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 7168, 14336, 14336, 7168, 28672, 7168, 14336, 7168, 7168, 7168, 28672, 7168, 5632, 5632, 7168, 5632, 5632, 5632, 7168, 7168, 2816, 2816, 5632, 5632, 2816, 2816, 5632, 2816, 2816, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672] print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 70B print_info: model params = 49.87 B print_info: general.name = Llama_Nemotron_Super print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 12690.13 MiB load_tensors: CPU model buffer size = 328.78 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 14336 llama_context: n_ctx_per_seq = 3584 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (3584) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 2.08 MiB llama_kv_cache_unified: kv_size = 14336, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1, padding = 32 CUDA error: out of memory current device: 0, in function ggml_backend_cuda_buffer_clear at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:622 cudaDeviceSynchronize() C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:76: CUDA error time=2025-07-18T12:34:07.112-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-18T12:34:07.225-04:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-07-18T12:34:07.363-04:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: CUDA error" [GIN] 2025/07/18 - 12:34:07 | 500 | 5.9167696s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.9.7-rc0
GiteaMirror added the bug label 2026-04-22 15:54:19 -05:00
Author
Owner

@mirage335 commented on GitHub (Jul 18, 2025):

From the logs of other experiments, it looks like this may be specific to Llama-3_3-Nemotron-Super-49B-v1 and other AI LLM models of that series.

<!-- gh-comment-id:3090140578 --> @mirage335 commented on GitHub (Jul 18, 2025): From the logs of other experiments, it looks like this may be specific to Llama-3_3-Nemotron-Super-49B-v1 and other AI LLM models of that series.
Author
Owner

@rick-github commented on GitHub (Jul 19, 2025):

time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"

Usually means that the drivers are too old.

<!-- gh-comment-id:3091341088 --> @rick-github commented on GitHub (Jul 19, 2025): ``` time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" ``` Usually means that the drivers are too old.
Author
Owner

@aaronnewsome commented on GitHub (Jul 29, 2025):

Using the latest rc2 docker container, I had the same issue with running Llama-3_3-Nemotron-Super-49B-v1, Ollama said that flash attention was not supported, even though it should be. I couldn't figure out what the issue was so I ended up using llama.cpp to run this specific model, but Ollama runs all the rest.

Now, I see rc3 is out so maybe there's a fix in there. I'm downloading now ...

<!-- gh-comment-id:3133850935 --> @aaronnewsome commented on GitHub (Jul 29, 2025): Using the latest rc2 docker container, I had the same issue with running Llama-3_3-Nemotron-Super-49B-v1, Ollama said that flash attention was not supported, even though it should be. I couldn't figure out what the issue was so I ended up using llama.cpp to run this specific model, but Ollama runs all the rest. Now, I see rc3 is out so maybe there's a fix in there. I'm downloading now ...
Author
Owner

@dhiltgen commented on GitHub (Jul 31, 2025):

@mirage335 can you share a server log with $env:OLLAMA_DEBUG="1" set so we can see some more details around GPU discovery and scheduling? This might be related to #11614

<!-- gh-comment-id:3141610352 --> @dhiltgen commented on GitHub (Jul 31, 2025): @mirage335 can you share a server log with `$env:OLLAMA_DEBUG="1"` set so we can see some more details around GPU discovery and scheduling? This might be related to #11614
Author
Owner

@chhu commented on GitHub (Aug 1, 2025):

Same problem on our linux rtx3090 setup. Llama.cpp compiled and its working. BTW the new qwen3 30b runs much faster with llama.cpp, i get 90t/s eval on llama.cpp (no FA) vs ~60t/s eval with ollama 0.10.1.

nvidia-smi
Fri Aug  1 13:18:28 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8             15W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:25:00.0 Off |                  N/A |
| 30%   35C    P8             16W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:41:00.0 Off |                  N/A |
| 30%   32C    P8             26W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off |   00000000:61:00.0 Off |                  N/A |
| 30%   30C    P8             23W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 3090        Off |   00000000:81:00.0 Off |                  N/A |
| 30%   34C    P8             23W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 3090        Off |   00000000:A1:00.0 Off |                  N/A |
| 30%   32C    P8             17W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 3090        Off |   00000000:C1:00.0 Off |                  N/A |
| 30%   32C    P8             21W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 3090        Off |   00000000:E1:00.0 Off |                  N/A |
| 30%   30C    P8             21W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

cut from ollama serve w/ debug and FA on:
initializing /lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7faf97e4ae70
dlsym: cuDriverGetVersion - 0x7faf97e4ae90
dlsym: cuDeviceGetCount - 0x7faf97e4aed0
dlsym: cuDeviceGet - 0x7faf97e4aeb0
dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0
dlsym: cuDeviceGetUuid - 0x7faf97e4af10
dlsym: cuDeviceGetName - 0x7faf97e4aef0
dlsym: cuCtxCreate_v3 - 0x7faf97e4b190
dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910
dlsym: cuCtxDestroy - 0x7faf97ea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
time=2025-08-01T13:20:05.372+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=8 library=/lib64/libcuda.so.570.124.06
[GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA totalMem 24135mb
[GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA freeMem 23871mb
[GPU-1068f375-e3fe-422f-f71b-73a2394e701f] Compute Capability 8.6
[GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA totalMem 24135mb
[GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA freeMem 23871mb
[GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] Compute Capability 8.6
[GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA totalMem 24135mb
[GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA freeMem 23871mb
[GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] Compute Capability 8.6
[GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA totalMem 24135mb
[GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA freeMem 23871mb
[GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] Compute Capability 8.6
[GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA totalMem 24135mb
[GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA freeMem 23871mb
[GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] Compute Capability 8.6
[GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA totalMem 24135mb
[GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA freeMem 23871mb
[GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] Compute Capability 8.6
[GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA totalMem 24135mb
[GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA freeMem 23871mb
[GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] Compute Capability 8.6
[GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA totalMem 24135mb
[GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA freeMem 23871mb
[GPU-b355efbb-1628-3337-ee30-11313755b901] Compute Capability 8.6
time=2025-08-01T13:20:07.273+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1068f375-e3fe-422f-f71b-73a2394e701f library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b355efbb-1628-3337-ee30-11313755b901 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"

time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB"
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers"
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B"
initializing /lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7faf97e4ae70
dlsym: cuDriverGetVersion - 0x7faf97e4ae90
dlsym: cuDeviceGetCount - 0x7faf97e4aed0
dlsym: cuDeviceGet - 0x7faf97e4aeb0
dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0
dlsym: cuDeviceGetUuid - 0x7faf97e4af10
dlsym: cuDeviceGetName - 0x7faf97e4aef0
dlsym: cuCtxCreate_v3 - 0x7faf97e4b190
dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910
dlsym: cuCtxDestroy - 0x7faf97ea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
time=2025-08-01T13:21:41.356+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:41.623+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:41.858+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.093+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.329+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.570+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.812+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:43.049+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
releasing cuda driver library
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB"
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers"
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B"
initializing /lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7faf97e4ae70
dlsym: cuDriverGetVersion - 0x7faf97e4ae90
dlsym: cuDeviceGetCount - 0x7faf97e4aed0
dlsym: cuDeviceGet - 0x7faf97e4aeb0
dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0
dlsym: cuDeviceGetUuid - 0x7faf97e4af10
dlsym: cuDeviceGetName - 0x7faf97e4aef0
dlsym: cuCtxCreate_v3 - 0x7faf97e4b190
dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910
dlsym: cuCtxDestroy - 0x7faf97ea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
... repeating...

time=2025-08-01T13:21:58.234+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="53.4 GiB" memory.required.partial="0 B" memory.required.kv="23.4 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="31.2 GiB" memory.graph.partial="31.2 GiB"
time=2025-08-01T13:21:58.234+02:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-08-01T13:21:58.234+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]

... then all lands up in CPU mem. For llama.cpp it also does not manage to fit all in GPUs if FA=off, but once FA=on there seems to be plenty of room. The new Qwen3 30B w/ 262k context (almost maxed out) gives me ~7t/s compared to the 90t/s with almost empty context. ctk and ctv were not touched. (f16 default)

<!-- gh-comment-id:3144252549 --> @chhu commented on GitHub (Aug 1, 2025): Same problem on our linux rtx3090 setup. Llama.cpp compiled and its working. BTW the new qwen3 30b runs much faster with llama.cpp, i get 90t/s eval on llama.cpp (no FA) vs ~60t/s eval with ollama 0.10.1. ``` nvidia-smi Fri Aug 1 13:18:28 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A | | 30% 35C P8 15W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:25:00.0 Off | N/A | | 30% 35C P8 16W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3090 Off | 00000000:41:00.0 Off | N/A | | 30% 32C P8 26W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 3090 Off | 00000000:61:00.0 Off | N/A | | 30% 30C P8 23W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce RTX 3090 Off | 00000000:81:00.0 Off | N/A | | 30% 34C P8 23W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 3090 Off | 00000000:A1:00.0 Off | N/A | | 30% 32C P8 17W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 3090 Off | 00000000:C1:00.0 Off | N/A | | 30% 32C P8 21W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA GeForce RTX 3090 Off | 00000000:E1:00.0 Off | N/A | | 30% 30C P8 21W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ cut from ollama serve w/ debug and FA on: initializing /lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7faf97e4ae70 dlsym: cuDriverGetVersion - 0x7faf97e4ae90 dlsym: cuDeviceGetCount - 0x7faf97e4aed0 dlsym: cuDeviceGet - 0x7faf97e4aeb0 dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0 dlsym: cuDeviceGetUuid - 0x7faf97e4af10 dlsym: cuDeviceGetName - 0x7faf97e4aef0 dlsym: cuCtxCreate_v3 - 0x7faf97e4b190 dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910 dlsym: cuCtxDestroy - 0x7faf97ea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 time=2025-08-01T13:20:05.372+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=8 library=/lib64/libcuda.so.570.124.06 [GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA totalMem 24135mb [GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA freeMem 23871mb [GPU-1068f375-e3fe-422f-f71b-73a2394e701f] Compute Capability 8.6 [GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA totalMem 24135mb [GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA freeMem 23871mb [GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] Compute Capability 8.6 [GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA totalMem 24135mb [GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA freeMem 23871mb [GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] Compute Capability 8.6 [GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA totalMem 24135mb [GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA freeMem 23871mb [GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] Compute Capability 8.6 [GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA totalMem 24135mb [GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA freeMem 23871mb [GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] Compute Capability 8.6 [GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA totalMem 24135mb [GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA freeMem 23871mb [GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] Compute Capability 8.6 [GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA totalMem 24135mb [GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA freeMem 23871mb [GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] Compute Capability 8.6 [GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA totalMem 24135mb [GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA freeMem 23871mb [GPU-b355efbb-1628-3337-ee30-11313755b901] Compute Capability 8.6 time=2025-08-01T13:20:07.273+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1068f375-e3fe-422f-f71b-73a2394e701f library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b355efbb-1628-3337-ee30-11313755b901 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0 time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B" initializing /lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7faf97e4ae70 dlsym: cuDriverGetVersion - 0x7faf97e4ae90 dlsym: cuDeviceGetCount - 0x7faf97e4aed0 dlsym: cuDeviceGet - 0x7faf97e4aeb0 dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0 dlsym: cuDeviceGetUuid - 0x7faf97e4af10 dlsym: cuDeviceGetName - 0x7faf97e4aef0 dlsym: cuCtxCreate_v3 - 0x7faf97e4b190 dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910 dlsym: cuCtxDestroy - 0x7faf97ea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 time=2025-08-01T13:21:41.356+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:41.623+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:41.858+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.093+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.329+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.570+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.812+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:43.049+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" releasing cuda driver library time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB" time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers" time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0 time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B" initializing /lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7faf97e4ae70 dlsym: cuDriverGetVersion - 0x7faf97e4ae90 dlsym: cuDeviceGetCount - 0x7faf97e4aed0 dlsym: cuDeviceGet - 0x7faf97e4aeb0 dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0 dlsym: cuDeviceGetUuid - 0x7faf97e4af10 dlsym: cuDeviceGetName - 0x7faf97e4aef0 dlsym: cuCtxCreate_v3 - 0x7faf97e4b190 dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910 dlsym: cuCtxDestroy - 0x7faf97ea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 ... repeating... time=2025-08-01T13:21:58.234+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="53.4 GiB" memory.required.partial="0 B" memory.required.kv="23.4 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="31.2 GiB" memory.graph.partial="31.2 GiB" time=2025-08-01T13:21:58.234+02:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-08-01T13:21:58.234+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] ``` ... then all lands up in CPU mem. For llama.cpp it also does not manage to fit all in GPUs if FA=off, but once FA=on there seems to be plenty of room. The new Qwen3 30B w/ 262k context (almost maxed out) gives me ~7t/s compared to the 90t/s with almost empty context. ctk and ctv were not touched. (f16 default)
Author
Owner

@chhu commented on GitHub (Aug 4, 2025):

Wow, I just realized this is more serious than expected. I always thought of FA as a speed-up only feature, but as it turns out it also saves a lot of this precious VRAM (or allows much bigger ctx windows). Will try to compile from source and see if it solves this, please fix!!

<!-- gh-comment-id:3150484768 --> @chhu commented on GitHub (Aug 4, 2025): Wow, I just realized this is more serious than expected. I always thought of FA as a speed-up only feature, but as it turns out it also saves a lot of this precious VRAM (or allows much bigger ctx windows). Will try to compile from source and see if it solves this, please fix!!
Author
Owner

@chhu commented on GitHub (Aug 4, 2025):

Got it working by compiling manually, but ollama still offloading to CPU. llama.cpp manages qwen3_30BA3B-Q8 with full ctx window of 262.114 on 3 24GB GPUs. With ollama I use all 8 and get heavy CPU offload despite having FA:

initializing /usr/lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7f2f9be4ae70
dlsym: cuDriverGetVersion - 0x7f2f9be4ae90
dlsym: cuDeviceGetCount - 0x7f2f9be4aed0
dlsym: cuDeviceGet - 0x7f2f9be4aeb0
dlsym: cuDeviceGetAttribute - 0x7f2f9be4afb0
dlsym: cuDeviceGetUuid - 0x7f2f9be4af10
dlsym: cuDeviceGetName - 0x7f2f9be4aef0
dlsym: cuCtxCreate_v3 - 0x7f2f9be4b190
dlsym: cuMemGetInfo_v2 - 0x7f2f9be4b910
dlsym: cuCtxDestroy - 0x7f2f9bea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
time=2025-08-04T16:45:48.925+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.149+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.377+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.610+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.842+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:50.071+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:50.294+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:50.522+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
releasing cuda driver library
time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=2,2,2,2,2,2,2,2 memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="211.8 GiB" memory.required.partial="182.0 GiB" memory.required.kv="14.6 GiB" memory.required.allocations="[22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="19.5 GiB" memory.graph.partial="19.5 GiB"
time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:218 msg="enabling flash attention"
time=2025-08-04T16:45:50.523+02:00 level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
time=2025-08-04T16:45:50.523+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                           general.basename str              = Qwen3
llama_model_loader: - kv   2:                          general.file_type u32              = 7
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                               general.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv   5:                    general.parameter_count u64              = 30532122624
llama_model_loader: - kv   6:               general.quantization_version u32              = 2
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                               general.type str              = model
llama_model_loader: - kv   9:                            general.version str              = 2507
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  22:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B Instruct 2507
print_info: n_ff_exp         = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-08-04T16:45:50.760+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/scr2/new/ollama/bin/ollama runner --model /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 --ctx-size 160000 --batch-size 512 --n-gpu-layers 16 --threads 64 --flash-attn --parallel 1 --tensor-split 2,2,2,2,2,2,2,2 --port 46783"
time=2025-08-04T16:45:50.760+02:00 level=DEBUG source=server.go:439 msg=subprocess LD_LIBRARY_PATH=/scr2/new/ollama/lib/ollama:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64::/vicsys/utils_rh8/cuda/lib64:/scr2/new/ollama/lib/ollama CUDA_BASE=/vicsys/utils_rh8/cuda CUDA_PATH=/vicsys/utils_rh8/cuda OLLAMA_HOST=0.0.0.0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_FLASH_ATTENTION=1 CUDA_HOME=/vicsys/utils_rh8/cuda OLLAMA_ORIGINS=* OLLAMA_DEBUG=1 PATH=/vicsys/utils_rh8/paraview/bin:/vicsys/utils_rh8/cmake/bin:/plp_scr1/utils/mpi/bin:/plp_scr1/utils/mpi_5/ucx_build/bin:/vicsys/utils_rh8/gcc/bin:/home/huettig/bin:/home/huettig/bin_extra:/home/huettig/.local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/vicsys/utils_rh8/cuda/bin:/plp_scr1/utils/numa/bin:/vicsys/utils_rh8/node/bin OLLAMA_MODELS=/scr2/new/ollama/models/models OLLAMA_MAX_LOADED_MODELS=24 OLLAMA_LIBRARY_PATH=/scr2/new/ollama/lib/ollama CUDA_VISIBLE_DEVICES=GPU-1068f375-e3fe-422f-f71b-73a2394e701f,GPU-615262d3-7dfe-666a-cf37-566e760e4ed1,GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c,GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036,GPU-02de372e-3119-89d2-b10e-16ce6064e6e0,GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5,GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3,GPU-b355efbb-1628-3337-ee30-11313755b901
time=2025-08-04T16:45:50.761+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-04T16:45:50.786+02:00 level=INFO source=runner.go:815 msg="starting go runner"
time=2025-08-04T16:45:50.786+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/scr2/new/ollama/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /scr2/new/ollama/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /scr2/new/ollama/lib/ollama/libggml-cpu-haswell.so
time=2025-08-04T16:45:52.012+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=860 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=860 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=860 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=860 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=860 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=860 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=860 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=860 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-04T16:45:52.013+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46783"
time=2025-08-04T16:45:52.016+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA6 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA7 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                           general.basename str              = Qwen3
llama_model_loader: - kv   2:                          general.file_type u32              = 7
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                               general.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv   5:                    general.parameter_count u64              = 30532122624
llama_model_loader: - kv   6:               general.quantization_version u32              = 2
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                               general.type str              = model
llama_model_loader: - kv   9:                            general.version str              = 2507
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  22:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B Instruct 2507
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 0
load_tensors: layer  33 assigned to device CUDA0, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA2, is_swa = 0
load_tensors: layer  37 assigned to device CUDA2, is_swa = 0
load_tensors: layer  38 assigned to device CUDA3, is_swa = 0
load_tensors: layer  39 assigned to device CUDA3, is_swa = 0
load_tensors: layer  40 assigned to device CUDA4, is_swa = 0
load_tensors: layer  41 assigned to device CUDA4, is_swa = 0
load_tensors: layer  42 assigned to device CUDA5, is_swa = 0
load_tensors: layer  43 assigned to device CUDA5, is_swa = 0
load_tensors: layer  44 assigned to device CUDA6, is_swa = 0
load_tensors: layer  45 assigned to device CUDA6, is_swa = 0
load_tensors: layer  46 assigned to device CUDA7, is_swa = 0
load_tensors: layer  47 assigned to device CUDA7, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q8_0) (and 386 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloaded 16/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  1264.28 MiB
load_tensors:        CUDA1 model buffer size =  1264.28 MiB
load_tensors:        CUDA2 model buffer size =  1264.28 MiB
load_tensors:        CUDA3 model buffer size =  1264.28 MiB
load_tensors:        CUDA4 model buffer size =  1264.28 MiB
load_tensors:        CUDA5 model buffer size =  1264.28 MiB
load_tensors:        CUDA6 model buffer size =  1264.28 MiB
load_tensors:        CUDA7 model buffer size =  1264.28 MiB
load_tensors:   CPU_Mapped model buffer size = 30973.40 MiB
time=2025-08-04T16:45:55.780+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.08"
time=2025-08-04T16:45:56.031+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.15"
time=2025-08-04T16:45:56.282+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.23"
time=2025-08-04T16:45:56.533+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.31"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 160000
llama_context: n_ctx_per_seq = 160000
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (160000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.59 MiB
create_memory: n_ctx = 160000 (padded)
llama_kv_cache_unified: kv_size = 160000, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 256
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified: layer  32: dev = CUDA0
llama_kv_cache_unified: layer  33: dev = CUDA0
llama_kv_cache_unified: layer  34: dev = CUDA1
llama_kv_cache_unified: layer  35: dev = CUDA1
llama_kv_cache_unified: layer  36: dev = CUDA2
llama_kv_cache_unified: layer  37: dev = CUDA2
llama_kv_cache_unified: layer  38: dev = CUDA3
llama_kv_cache_unified: layer  39: dev = CUDA3
llama_kv_cache_unified: layer  40: dev = CUDA4
llama_kv_cache_unified: layer  41: dev = CUDA4
llama_kv_cache_unified: layer  42: dev = CUDA5
llama_kv_cache_unified: layer  43: dev = CUDA5
llama_kv_cache_unified: layer  44: dev = CUDA6
llama_kv_cache_unified: layer  45: dev = CUDA6
llama_kv_cache_unified: layer  46: dev = CUDA7
llama_kv_cache_unified: layer  47: dev = CUDA7
llama_kv_cache_unified:      CUDA0 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA2 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA3 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA4 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA5 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA6 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA7 KV buffer size =   625.00 MiB
time=2025-08-04T16:45:56.784+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-08-04T16:45:57.035+02:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size = 10000.00 MiB
llama_kv_cache_unified: KV self size  = 15000.00 MiB, K (f16): 7500.00 MiB, V (f16): 7500.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 9
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   797.25 MiB
llama_context:      CUDA1 compute buffer size =   208.75 MiB
llama_context:      CUDA2 compute buffer size =   208.75 MiB
llama_context:      CUDA3 compute buffer size =   208.75 MiB
llama_context:      CUDA4 compute buffer size =   208.75 MiB
llama_context:      CUDA5 compute buffer size =   208.75 MiB
llama_context:      CUDA6 compute buffer size =   208.75 MiB
llama_context:      CUDA7 compute buffer size =   208.75 MiB
llama_context:  CUDA_Host compute buffer size =   316.51 MiB
llama_context: graph nodes  = 2935
llama_context: graph splits = 459 (with bs=512), 74 (with bs=1)
time=2025-08-04T16:46:00.798+02:00 level=INFO source=server.go:637 msg="llama runner started in 10.04 seconds"
time=2025-08-04T16:46:00.798+02:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000
time=2025-08-04T16:46:00.806+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=264 format=""
time=2025-08-04T16:46:00.813+02:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=59 used=0 remaining=59
time=2025-08-04T16:46:12.671+02:00 level=DEBUG source=sched.go:501 msg="context for request finished"
time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 duration=2562047h47m16.854775807s
time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 refCount=0

Here is the llama.cpp output, restricted to 3 GPUs:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 6056 (baad9488) with cc (GCC) 15.1.0 for x86_64-pc-linux-gnu
system info: n_threads = 64, n_threads_batch = 64, total_threads = 128

system_info: n_threads = 64 (n_threads_batch = 64) / 128 | CUDA : ARCHS = 860 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 127
main: loading model
srv    load_model: loading model 'models/qwen3-30.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from models/qwen3-30.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                           general.basename str              = Qwen3
llama_model_loader: - kv   2:                          general.file_type u32              = 7
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                               general.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv   5:                    general.parameter_count u64              = 30532122624
llama_model_loader: - kv   6:               general.quantization_version u32              = 2
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                               general.type str              = model
llama_model_loader: - kv   9:                            general.version str              = 2507
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  22:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B Instruct 2507
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
load_tensors:        CUDA0 model buffer size = 10746.41 MiB
load_tensors:        CUDA1 model buffer size = 10114.27 MiB
load_tensors:        CUDA2 model buffer size =  9797.43 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 262144
llama_context: n_ctx_per_seq = 262144
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =  8704.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =  8192.00 MiB
llama_kv_cache_unified:      CUDA2 KV buffer size =  7680.00 MiB
llama_kv_cache_unified: size = 24576.00 MiB (262144 cells,  48 layers,  1/ 1 seqs), K (f16): 12288.00 MiB, V (f16): 12288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =  2408.01 MiB
llama_context:      CUDA1 compute buffer size =  1124.01 MiB
llama_context:      CUDA2 compute buffer size =  1340.77 MiB
llama_context:  CUDA_Host compute buffer size =  2052.02 MiB
llama_context: graph nodes  = 3079
llama_context: graph splits = 4
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 262144
main: model loaded
main: chat template
...
<!-- gh-comment-id:3151134754 --> @chhu commented on GitHub (Aug 4, 2025): Got it working by compiling manually, but ollama still offloading to CPU. llama.cpp manages qwen3_30BA3B-Q8 with full ctx window of 262.114 on 3 24GB GPUs. With ollama I use all 8 and get heavy CPU offload despite having FA: ``` initializing /usr/lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7f2f9be4ae70 dlsym: cuDriverGetVersion - 0x7f2f9be4ae90 dlsym: cuDeviceGetCount - 0x7f2f9be4aed0 dlsym: cuDeviceGet - 0x7f2f9be4aeb0 dlsym: cuDeviceGetAttribute - 0x7f2f9be4afb0 dlsym: cuDeviceGetUuid - 0x7f2f9be4af10 dlsym: cuDeviceGetName - 0x7f2f9be4aef0 dlsym: cuCtxCreate_v3 - 0x7f2f9be4b190 dlsym: cuMemGetInfo_v2 - 0x7f2f9be4b910 dlsym: cuCtxDestroy - 0x7f2f9bea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 time=2025-08-04T16:45:48.925+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.149+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.377+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.610+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.842+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:50.071+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:50.294+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:50.522+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" releasing cuda driver library time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=2,2,2,2,2,2,2,2 memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="211.8 GiB" memory.required.partial="182.0 GiB" memory.required.kv="14.6 GiB" memory.required.allocations="[22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="19.5 GiB" memory.graph.partial="19.5 GiB" time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:218 msg="enabling flash attention" time=2025-08-04T16:45:50.523+02:00 level=WARN source=server.go:226 msg="kv cache type not supported by model" type="" time=2025-08-04T16:45:50.523+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.basename str = Qwen3 llama_model_loader: - kv 2: general.file_type u32 = 7 llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.name str = Qwen3 30B A3B Instruct 2507 llama_model_loader: - kv 5: general.parameter_count u64 = 30532122624 llama_model_loader: - kv 6: general.quantization_version u32 = 2 llama_model_loader: - kv 7: general.size_label str = 30B-A3B llama_model_loader: - kv 8: general.type str = model llama_model_loader: - kv 9: general.version str = 2507 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 22: qwen3moe.rope.freq_base f32 = 10000000.000000 llama_model_loader: - kv 23: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 338 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 30.25 GiB (8.51 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B Instruct 2507 print_info: n_ff_exp = 0 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-08-04T16:45:50.760+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/scr2/new/ollama/bin/ollama runner --model /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 --ctx-size 160000 --batch-size 512 --n-gpu-layers 16 --threads 64 --flash-attn --parallel 1 --tensor-split 2,2,2,2,2,2,2,2 --port 46783" time=2025-08-04T16:45:50.760+02:00 level=DEBUG source=server.go:439 msg=subprocess LD_LIBRARY_PATH=/scr2/new/ollama/lib/ollama:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64::/vicsys/utils_rh8/cuda/lib64:/scr2/new/ollama/lib/ollama CUDA_BASE=/vicsys/utils_rh8/cuda CUDA_PATH=/vicsys/utils_rh8/cuda OLLAMA_HOST=0.0.0.0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_FLASH_ATTENTION=1 CUDA_HOME=/vicsys/utils_rh8/cuda OLLAMA_ORIGINS=* OLLAMA_DEBUG=1 PATH=/vicsys/utils_rh8/paraview/bin:/vicsys/utils_rh8/cmake/bin:/plp_scr1/utils/mpi/bin:/plp_scr1/utils/mpi_5/ucx_build/bin:/vicsys/utils_rh8/gcc/bin:/home/huettig/bin:/home/huettig/bin_extra:/home/huettig/.local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/vicsys/utils_rh8/cuda/bin:/plp_scr1/utils/numa/bin:/vicsys/utils_rh8/node/bin OLLAMA_MODELS=/scr2/new/ollama/models/models OLLAMA_MAX_LOADED_MODELS=24 OLLAMA_LIBRARY_PATH=/scr2/new/ollama/lib/ollama CUDA_VISIBLE_DEVICES=GPU-1068f375-e3fe-422f-f71b-73a2394e701f,GPU-615262d3-7dfe-666a-cf37-566e760e4ed1,GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c,GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036,GPU-02de372e-3119-89d2-b10e-16ce6064e6e0,GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5,GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3,GPU-b355efbb-1628-3337-ee30-11313755b901 time=2025-08-04T16:45:50.761+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-04T16:45:50.786+02:00 level=INFO source=runner.go:815 msg="starting go runner" time=2025-08-04T16:45:50.786+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/scr2/new/ollama/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /scr2/new/ollama/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /scr2/new/ollama/lib/ollama/libggml-cpu-haswell.so time=2025-08-04T16:45:52.012+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=860 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=860 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=860 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=860 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=860 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=860 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=860 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=860 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-04T16:45:52.013+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46783" time=2025-08-04T16:45:52.016+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA6 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA7 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.basename str = Qwen3 llama_model_loader: - kv 2: general.file_type u32 = 7 llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.name str = Qwen3 30B A3B Instruct 2507 llama_model_loader: - kv 5: general.parameter_count u64 = 30532122624 llama_model_loader: - kv 6: general.quantization_version u32 = 2 llama_model_loader: - kv 7: general.size_label str = 30B-A3B llama_model_loader: - kv 8: general.type str = model llama_model_loader: - kv 9: general.version str = 2507 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 22: qwen3moe.rope.freq_base f32 = 10000000.000000 llama_model_loader: - kv 23: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 338 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 30.25 GiB (8.51 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 30B.A3B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B Instruct 2507 print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 load_tensors: layer 29 assigned to device CPU, is_swa = 0 load_tensors: layer 30 assigned to device CPU, is_swa = 0 load_tensors: layer 31 assigned to device CPU, is_swa = 0 load_tensors: layer 32 assigned to device CUDA0, is_swa = 0 load_tensors: layer 33 assigned to device CUDA0, is_swa = 0 load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 load_tensors: layer 36 assigned to device CUDA2, is_swa = 0 load_tensors: layer 37 assigned to device CUDA2, is_swa = 0 load_tensors: layer 38 assigned to device CUDA3, is_swa = 0 load_tensors: layer 39 assigned to device CUDA3, is_swa = 0 load_tensors: layer 40 assigned to device CUDA4, is_swa = 0 load_tensors: layer 41 assigned to device CUDA4, is_swa = 0 load_tensors: layer 42 assigned to device CUDA5, is_swa = 0 load_tensors: layer 43 assigned to device CUDA5, is_swa = 0 load_tensors: layer 44 assigned to device CUDA6, is_swa = 0 load_tensors: layer 45 assigned to device CUDA6, is_swa = 0 load_tensors: layer 46 assigned to device CUDA7, is_swa = 0 load_tensors: layer 47 assigned to device CUDA7, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q8_0) (and 386 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 16 repeating layers to GPU load_tensors: offloaded 16/49 layers to GPU load_tensors: CUDA0 model buffer size = 1264.28 MiB load_tensors: CUDA1 model buffer size = 1264.28 MiB load_tensors: CUDA2 model buffer size = 1264.28 MiB load_tensors: CUDA3 model buffer size = 1264.28 MiB load_tensors: CUDA4 model buffer size = 1264.28 MiB load_tensors: CUDA5 model buffer size = 1264.28 MiB load_tensors: CUDA6 model buffer size = 1264.28 MiB load_tensors: CUDA7 model buffer size = 1264.28 MiB load_tensors: CPU_Mapped model buffer size = 30973.40 MiB time=2025-08-04T16:45:55.780+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.08" time=2025-08-04T16:45:56.031+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.15" time=2025-08-04T16:45:56.282+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.23" time=2025-08-04T16:45:56.533+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.31" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 160000 llama_context: n_ctx_per_seq = 160000 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 10000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (160000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.59 MiB create_memory: n_ctx = 160000 (padded) llama_kv_cache_unified: kv_size = 160000, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 256 llama_kv_cache_unified: layer 0: dev = CPU llama_kv_cache_unified: layer 1: dev = CPU llama_kv_cache_unified: layer 2: dev = CPU llama_kv_cache_unified: layer 3: dev = CPU llama_kv_cache_unified: layer 4: dev = CPU llama_kv_cache_unified: layer 5: dev = CPU llama_kv_cache_unified: layer 6: dev = CPU llama_kv_cache_unified: layer 7: dev = CPU llama_kv_cache_unified: layer 8: dev = CPU llama_kv_cache_unified: layer 9: dev = CPU llama_kv_cache_unified: layer 10: dev = CPU llama_kv_cache_unified: layer 11: dev = CPU llama_kv_cache_unified: layer 12: dev = CPU llama_kv_cache_unified: layer 13: dev = CPU llama_kv_cache_unified: layer 14: dev = CPU llama_kv_cache_unified: layer 15: dev = CPU llama_kv_cache_unified: layer 16: dev = CPU llama_kv_cache_unified: layer 17: dev = CPU llama_kv_cache_unified: layer 18: dev = CPU llama_kv_cache_unified: layer 19: dev = CPU llama_kv_cache_unified: layer 20: dev = CPU llama_kv_cache_unified: layer 21: dev = CPU llama_kv_cache_unified: layer 22: dev = CPU llama_kv_cache_unified: layer 23: dev = CPU llama_kv_cache_unified: layer 24: dev = CPU llama_kv_cache_unified: layer 25: dev = CPU llama_kv_cache_unified: layer 26: dev = CPU llama_kv_cache_unified: layer 27: dev = CPU llama_kv_cache_unified: layer 28: dev = CPU llama_kv_cache_unified: layer 29: dev = CPU llama_kv_cache_unified: layer 30: dev = CPU llama_kv_cache_unified: layer 31: dev = CPU llama_kv_cache_unified: layer 32: dev = CUDA0 llama_kv_cache_unified: layer 33: dev = CUDA0 llama_kv_cache_unified: layer 34: dev = CUDA1 llama_kv_cache_unified: layer 35: dev = CUDA1 llama_kv_cache_unified: layer 36: dev = CUDA2 llama_kv_cache_unified: layer 37: dev = CUDA2 llama_kv_cache_unified: layer 38: dev = CUDA3 llama_kv_cache_unified: layer 39: dev = CUDA3 llama_kv_cache_unified: layer 40: dev = CUDA4 llama_kv_cache_unified: layer 41: dev = CUDA4 llama_kv_cache_unified: layer 42: dev = CUDA5 llama_kv_cache_unified: layer 43: dev = CUDA5 llama_kv_cache_unified: layer 44: dev = CUDA6 llama_kv_cache_unified: layer 45: dev = CUDA6 llama_kv_cache_unified: layer 46: dev = CUDA7 llama_kv_cache_unified: layer 47: dev = CUDA7 llama_kv_cache_unified: CUDA0 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA1 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA2 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA3 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA4 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA5 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA6 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA7 KV buffer size = 625.00 MiB time=2025-08-04T16:45:56.784+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-08-04T16:45:57.035+02:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 10000.00 MiB llama_kv_cache_unified: KV self size = 15000.00 MiB, K (f16): 7500.00 MiB, V (f16): 7500.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 9 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CUDA0 compute buffer size = 797.25 MiB llama_context: CUDA1 compute buffer size = 208.75 MiB llama_context: CUDA2 compute buffer size = 208.75 MiB llama_context: CUDA3 compute buffer size = 208.75 MiB llama_context: CUDA4 compute buffer size = 208.75 MiB llama_context: CUDA5 compute buffer size = 208.75 MiB llama_context: CUDA6 compute buffer size = 208.75 MiB llama_context: CUDA7 compute buffer size = 208.75 MiB llama_context: CUDA_Host compute buffer size = 316.51 MiB llama_context: graph nodes = 2935 llama_context: graph splits = 459 (with bs=512), 74 (with bs=1) time=2025-08-04T16:46:00.798+02:00 level=INFO source=server.go:637 msg="llama runner started in 10.04 seconds" time=2025-08-04T16:46:00.798+02:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 time=2025-08-04T16:46:00.806+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=264 format="" time=2025-08-04T16:46:00.813+02:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=59 used=0 remaining=59 time=2025-08-04T16:46:12.671+02:00 level=DEBUG source=sched.go:501 msg="context for request finished" time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 duration=2562047h47m16.854775807s time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 refCount=0 ``` Here is the llama.cpp output, restricted to 3 GPUs: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 3 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes build: 6056 (baad9488) with cc (GCC) 15.1.0 for x86_64-pc-linux-gnu system info: n_threads = 64, n_threads_batch = 64, total_threads = 128 system_info: n_threads = 64 (n_threads_batch = 64) / 128 | CUDA : ARCHS = 860 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | main: binding port with default address family main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 127 main: loading model srv load_model: loading model 'models/qwen3-30.gguf' llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from models/qwen3-30.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.basename str = Qwen3 llama_model_loader: - kv 2: general.file_type u32 = 7 llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.name str = Qwen3 30B A3B Instruct 2507 llama_model_loader: - kv 5: general.parameter_count u64 = 30532122624 llama_model_loader: - kv 6: general.quantization_version u32 = 2 llama_model_loader: - kv 7: general.size_label str = 30B-A3B llama_model_loader: - kv 8: general.type str = model llama_model_loader: - kv 9: general.version str = 2507 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 22: qwen3moe.rope.freq_base f32 = 10000000.000000 llama_model_loader: - kv 23: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 338 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 30.25 GiB (8.51 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 30B.A3B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B Instruct 2507 print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 48 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 49/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 315.30 MiB load_tensors: CUDA0 model buffer size = 10746.41 MiB load_tensors: CUDA1 model buffer size = 10114.27 MiB load_tensors: CUDA2 model buffer size = 9797.43 MiB .................................................................................................... llama_context: constructing llama_context llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache llama_context: n_seq_max = 1 llama_context: n_ctx = 262144 llama_context: n_ctx_per_seq = 262144 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: kv_unified = true llama_context: freq_base = 10000000.0 llama_context: freq_scale = 1 llama_context: CUDA_Host output buffer size = 0.58 MiB llama_kv_cache_unified: CUDA0 KV buffer size = 8704.00 MiB llama_kv_cache_unified: CUDA1 KV buffer size = 8192.00 MiB llama_kv_cache_unified: CUDA2 KV buffer size = 7680.00 MiB llama_kv_cache_unified: size = 24576.00 MiB (262144 cells, 48 layers, 1/ 1 seqs), K (f16): 12288.00 MiB, V (f16): 12288.00 MiB llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility llama_context: pipeline parallelism enabled (n_copies=4) llama_context: CUDA0 compute buffer size = 2408.01 MiB llama_context: CUDA1 compute buffer size = 1124.01 MiB llama_context: CUDA2 compute buffer size = 1340.77 MiB llama_context: CUDA_Host compute buffer size = 2052.02 MiB llama_context: graph nodes = 3079 llama_context: graph splits = 4 common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|im_end|> logit bias = -inf common_init_from_params: added <|fim_pad|> logit bias = -inf common_init_from_params: added <|repo_name|> logit bias = -inf common_init_from_params: added <|file_sep|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 262144 main: model loaded main: chat template ... ```
Author
Owner

@jessegross commented on GitHub (Aug 5, 2025):

@chhu There is a work in progress branch that may help with the issues that you are running into. If you like, you can give it a try. You'll need to build from source and set the environment variables OLLAMA_NEW_ENGINE=1 and OLLAMA_NEW_ESTIMATES=1
https://github.com/ollama/ollama/pull/11090

<!-- gh-comment-id:3152955791 --> @jessegross commented on GitHub (Aug 5, 2025): @chhu There is a work in progress branch that may help with the issues that you are running into. If you like, you can give it a try. You'll need to build from source and set the environment variables OLLAMA_NEW_ENGINE=1 and OLLAMA_NEW_ESTIMATES=1 https://github.com/ollama/ollama/pull/11090
Author
Owner

@chhu commented on GitHub (Aug 6, 2025):

I can confirm this works!!

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
qwen3_max:latest    c896132557cd    59 GB    100% GPU     262144     Forever

Please merge! I can imagine this is important for a lot of ppl wondering why they can't bump their ctx. 😄

<!-- gh-comment-id:3160537754 --> @chhu commented on GitHub (Aug 6, 2025): I can confirm this works!! ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3_max:latest c896132557cd 59 GB 100% GPU 262144 Forever ``` Please merge! I can imagine this is important for a lot of ppl wondering why they can't bump their ctx. 😄
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33332