[GH-ISSUE #11471] Flash Attention not supported? #33332

New Issue

GiteaMirror · 2026-04-22T15:54:19-05:00

GiteaMirror commented

2026-04-22 15:54:19 -05:00

Originally created by @mirage335 on GitHub (Jul 18, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11471

What is the issue?

Laptop RTX 4090 16GB . Seems ollama thinks flash attention is not supported.

Yes, as the log shows I am simultaneously trying to diagnose getting another AI LLM model to use maximum VRAM when an eGPU is plugged in, yet somehow maintain enough of an overhead buffer to at least work when the eGPU is not available.

Relevant log output

time=2025-07-18T12:33:42.475-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\mirag\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]"
time=2025-07-18T12:33:42.484-04:00 level=INFO source=images.go:476 msg="total blobs: 94"
time=2025-07-18T12:33:42.487-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-18T12:33:42.489-04:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.7-rc0)"
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20
time=2025-07-18T12:33:42.607-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e774f15-e659-849b-aa06-4415cca19573 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="16.0 GiB" available="14.7 GiB"
time=2025-07-18T12:34:01.865-04:00 level=INFO source=server.go:135 msg="system memory" total="63.7 GiB" free="44.2 GiB" free_swap="110.9 GiB"
time=2025-07-18T12:34:01.885-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=999 layers.model=81 layers.offload=0 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.6 GiB" memory.required.partial="0 B" memory.required.kv="2.2 GiB" memory.required.allocations="[0 B]" memory.weights.total="12.4 GiB" memory.weights.repeating="11.7 GiB" memory.weights.nonrepeating="688.9 MiB" memory.graph.full="23.3 GiB" memory.graph.partial="23.3 GiB"
time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0
llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deci
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Super
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.finetune str              = 3_3-Nemotron-Super
llama_model_loader: - kv   5:                           general.basename str              = Llama
llama_model_loader: - kv   6:                         general.size_label str              = 49B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv   9:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                        deci.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:               deci.attention.head_count_kv arr[i32,80]      = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ...
llama_model_loader: - kv  14:                  deci.attention.head_count arr[i32,80]      = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64...
llama_model_loader: - kv  15:                   deci.feed_forward_length arr[i32,80]      = [14336, 28672, 28672, 28672, 28672, 2...
llama_model_loader: - kv  16:                           deci.block_count u32              = 80
llama_model_loader: - kv  17:                        deci.context_length u32              = 131072
llama_model_loader: - kv  18:                      deci.embedding_length u32              = 8192
llama_model_loader: - kv  19:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                  deci.attention.key_length u32              = 128
llama_model_loader: - kv  21:                deci.attention.value_length u32              = 128
llama_model_loader: - kv  22:                            deci.vocab_size u32              = 128256
llama_model_loader: - kv  23:                  deci.rope.dimension_count u32              = 128
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{- bos_token }}{%- if messages[0]['r...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 19
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Llama-3_3-Nemotron-Super-...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 436
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  131 tensors
llama_model_loader: - type q2_K:   11 tensors
llama_model_loader: - type q4_K:   49 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xxs:  377 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ2_XXS - 2.0625 bpw
print_info: file size   = 12.71 GiB (2.19 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = deci
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 49.87 B
print_info: general.name     = Llama_Nemotron_Super
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-18T12:34:02.883-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\mirag\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\mirag\\.ollama\\models\\blobs\\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 --ctx-size 14336 --batch-size 512 --n-gpu-layers 999 --threads 14 --no-mmap --parallel 4 --port 57803"
time=2025-07-18T12:34:02.898-04:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-07-18T12:34:02.898-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-18T12:34:02.900-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2025-07-18T12:34:02.932-04:00 level=INFO source=runner.go:815 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-07-18T12:34:03.017-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-07-18T12:34:03.018-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:57803"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 Laptop GPU) - 15048 MiB free
llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deci
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Super
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.finetune str              = 3_3-Nemotron-Super
llama_model_loader: - kv   5:                           general.basename str              = Llama
llama_model_loader: - kv   6:                         general.size_label str              = 49B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv   9:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                        deci.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:               deci.attention.head_count_kv arr[i32,80]      = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ...
llama_model_loader: - kv  14:                  deci.attention.head_count arr[i32,80]      = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64...
llama_model_loader: - kv  15:                   deci.feed_forward_length arr[i32,80]      = [14336, 28672, 28672, 28672, 28672, 2...
llama_model_loader: - kv  16:                           deci.block_count u32              = 80
llama_model_loader: - kv  17:                        deci.context_length u32              = 131072
llama_model_loader: - kv  18:                      deci.embedding_length u32              = 8192
llama_model_loader: - kv  19:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                  deci.attention.key_length u32              = 128
llama_model_loader: - kv  21:                deci.attention.value_length u32              = 128
llama_model_loader: - kv  22:                            deci.vocab_size u32              = 128256
llama_model_loader: - kv  23:                  deci.rope.dimension_count u32              = 128
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = llama-bpe
time=2025-07-18T12:34:03.152-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{- bos_token }}{%- if messages[0]['r...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 19
llama_model_loader: - kv  34:                      quantize.imatrix.file str              = /models_out/Llama-3_3-Nemotron-Super-...
llama_model_loader: - kv  35:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  36:             quantize.imatrix.entries_count i32              = 436
llama_model_loader: - kv  37:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  131 tensors
llama_model_loader: - type q2_K:   11 tensors
llama_model_loader: - type q4_K:   49 tensors
llama_model_loader: - type q5_K:    1 tensors
llama_model_loader: - type iq2_xxs:  377 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ2_XXS - 2.0625 bpw
print_info: file size   = 12.71 GiB (2.19 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = deci
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64, 64, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64]
print_info: n_head_kv        = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8]
print_info: n_embd_k_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
print_info: n_embd_v_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = [14336, 28672, 28672, 28672, 28672, 28672, 14336, 14336, 28672, 28672, 28672, 17920, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 7168, 14336, 14336, 7168, 28672, 7168, 14336, 7168, 7168, 7168, 28672, 7168, 5632, 5632, 7168, 5632, 5632, 5632, 7168, 7168, 2816, 2816, 5632, 5632, 2816, 2816, 5632, 2816, 2816, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672]
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 49.87 B
print_info: general.name     = Llama_Nemotron_Super
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 12690.13 MiB
load_tensors:          CPU model buffer size =   328.78 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 14336
llama_context: n_ctx_per_seq = 3584
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (3584) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     2.08 MiB
llama_kv_cache_unified: kv_size = 14336, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1, padding = 32
CUDA error: out of memory
  current device: 0, in function ggml_backend_cuda_buffer_clear at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:622
  cudaDeviceSynchronize()
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:76: CUDA error
time=2025-07-18T12:34:07.112-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-18T12:34:07.225-04:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 0xc0000409"
time=2025-07-18T12:34:07.363-04:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: CUDA error"
[GIN] 2025/07/18 - 12:34:07 | 500 |    5.9167696s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.9.7-rc0

Originally created by @mirage335 on GitHub (Jul 18, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11471 ### What is the issue? Laptop RTX 4090 16GB . Seems ollama thinks flash attention is not supported. Yes, as the log shows I am simultaneously trying to diagnose getting another AI LLM model to use maximum VRAM when an eGPU is plugged in, yet somehow maintain enough of an overhead buffer to at least work when the eGPU is not available. ### Relevant log output ```shell time=2025-07-18T12:33:42.475-04:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\mirag\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:true OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]" time=2025-07-18T12:33:42.484-04:00 level=INFO source=images.go:476 msg="total blobs: 94" time=2025-07-18T12:33:42.487-04:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-18T12:33:42.489-04:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.7-rc0)" time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-07-18T12:33:42.489-04:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=14 efficiency=8 threads=20 time=2025-07-18T12:33:42.607-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9e774f15-e659-849b-aa06-4415cca19573 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="16.0 GiB" available="14.7 GiB" time=2025-07-18T12:34:01.865-04:00 level=INFO source=server.go:135 msg="system memory" total="63.7 GiB" free="44.2 GiB" free_swap="110.9 GiB" time=2025-07-18T12:34:01.885-04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=999 layers.model=81 layers.offload=0 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.6 GiB" memory.required.partial="0 B" memory.required.kv="2.2 GiB" memory.required.allocations="[0 B]" memory.weights.total="12.4 GiB" memory.weights.repeating="11.7 GiB" memory.weights.nonrepeating="688.9 MiB" memory.graph.full="23.3 GiB" memory.graph.partial="23.3 GiB" time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0 llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deci llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama_Nemotron_Super llama_model_loader: - kv 3: general.version str = v1 llama_model_loader: - kv 4: general.finetune str = 3_3-Nemotron-Super llama_model_loader: - kv 5: general.basename str = Llama llama_model_loader: - kv 6: general.size_label str = 49B llama_model_loader: - kv 7: general.license str = other llama_model_loader: - kv 8: general.license.name str = nvidia-open-model-license llama_model_loader: - kv 9: general.license.link str = https://www.nvidia.com/en-us/agreemen... llama_model_loader: - kv 10: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex... llama_model_loader: - kv 11: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 12: deci.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 13: deci.attention.head_count_kv arr[i32,80] = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ... llama_model_loader: - kv 14: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64... llama_model_loader: - kv 15: deci.feed_forward_length arr[i32,80] = [14336, 28672, 28672, 28672, 28672, 2... llama_model_loader: - kv 16: deci.block_count u32 = 80 llama_model_loader: - kv 17: deci.context_length u32 = 131072 llama_model_loader: - kv 18: deci.embedding_length u32 = 8192 llama_model_loader: - kv 19: deci.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 20: deci.attention.key_length u32 = 128 llama_model_loader: - kv 21: deci.attention.value_length u32 = 128 llama_model_loader: - kv 22: deci.vocab_size u32 = 128256 llama_model_loader: - kv 23: deci.rope.dimension_count u32 = 128 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 31: tokenizer.chat_template str = {{- bos_token }}{%- if messages[0]['r... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - kv 33: general.file_type u32 = 19 llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Llama-3_3-Nemotron-Super-... llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 436 llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 131 tensors llama_model_loader: - type q2_K: 11 tensors llama_model_loader: - type q4_K: 49 tensors llama_model_loader: - type q5_K: 1 tensors llama_model_loader: - type iq2_xxs: 377 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ2_XXS - 2.0625 bpw print_info: file size = 12.71 GiB (2.19 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = deci print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 49.87 B print_info: general.name = Llama_Nemotron_Super print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-07-18T12:34:02.883-04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\mirag\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\mirag\\.ollama\\models\\blobs\\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 --ctx-size 14336 --batch-size 512 --n-gpu-layers 999 --threads 14 --no-mmap --parallel 4 --port 57803" time=2025-07-18T12:34:02.898-04:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-07-18T12:34:02.898-04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-18T12:34:02.900-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" time=2025-07-18T12:34:02.932-04:00 level=INFO source=runner.go:815 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\mirag\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-07-18T12:34:03.017-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-07-18T12:34:03.018-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:57803" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 Laptop GPU) - 15048 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 569 tensors from C:\Users\mirag\.ollama\models\blobs\sha256-e8d0c0186ba2e3deb914d17a23caf8e21a5ec885ccf4ccb54101ccbcd95c8a36 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deci llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama_Nemotron_Super llama_model_loader: - kv 3: general.version str = v1 llama_model_loader: - kv 4: general.finetune str = 3_3-Nemotron-Super llama_model_loader: - kv 5: general.basename str = Llama llama_model_loader: - kv 6: general.size_label str = 49B llama_model_loader: - kv 7: general.license str = other llama_model_loader: - kv 8: general.license.name str = nvidia-open-model-license llama_model_loader: - kv 9: general.license.link str = https://www.nvidia.com/en-us/agreemen... llama_model_loader: - kv 10: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex... llama_model_loader: - kv 11: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 12: deci.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 13: deci.attention.head_count_kv arr[i32,80] = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ... llama_model_loader: - kv 14: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64... llama_model_loader: - kv 15: deci.feed_forward_length arr[i32,80] = [14336, 28672, 28672, 28672, 28672, 2... llama_model_loader: - kv 16: deci.block_count u32 = 80 llama_model_loader: - kv 17: deci.context_length u32 = 131072 llama_model_loader: - kv 18: deci.embedding_length u32 = 8192 llama_model_loader: - kv 19: deci.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 20: deci.attention.key_length u32 = 128 llama_model_loader: - kv 21: deci.attention.value_length u32 = 128 llama_model_loader: - kv 22: deci.vocab_size u32 = 128256 llama_model_loader: - kv 23: deci.rope.dimension_count u32 = 128 llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 25: tokenizer.ggml.pre str = llama-bpe time=2025-07-18T12:34:03.152-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 31: tokenizer.chat_template str = {{- bos_token }}{%- if messages[0]['r... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - kv 33: general.file_type u32 = 19 llama_model_loader: - kv 34: quantize.imatrix.file str = /models_out/Llama-3_3-Nemotron-Super-... llama_model_loader: - kv 35: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 436 llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 125 llama_model_loader: - type f32: 131 tensors llama_model_loader: - type q2_K: 11 tensors llama_model_loader: - type q4_K: 49 tensors llama_model_loader: - type q5_K: 1 tensors llama_model_loader: - type iq2_xxs: 377 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ2_XXS - 2.0625 bpw print_info: file size = 12.71 GiB (2.19 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = deci print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 8192 print_info: n_layer = 80 print_info: n_head = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64, 64, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64] print_info: n_head_kv = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8] print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8] print_info: n_embd_k_gqa = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] print_info: n_embd_v_gqa = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = [14336, 28672, 28672, 28672, 28672, 28672, 14336, 14336, 28672, 28672, 28672, 17920, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 7168, 14336, 14336, 7168, 28672, 7168, 14336, 7168, 7168, 7168, 28672, 7168, 5632, 5632, 7168, 5632, 5632, 5632, 7168, 7168, 2816, 2816, 5632, 5632, 2816, 2816, 5632, 2816, 2816, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672] print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 70B print_info: model params = 49.87 B print_info: general.name = Llama_Nemotron_Super print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 12690.13 MiB load_tensors: CPU model buffer size = 328.78 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 14336 llama_context: n_ctx_per_seq = 3584 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (3584) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 2.08 MiB llama_kv_cache_unified: kv_size = 14336, type_k = 'f16', type_v = 'f16', n_layer = 80, can_shift = 1, padding = 32 CUDA error: out of memory current device: 0, in function ggml_backend_cuda_buffer_clear at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:622 cudaDeviceSynchronize() C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:76: CUDA error time=2025-07-18T12:34:07.112-04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-18T12:34:07.225-04:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 0xc0000409" time=2025-07-18T12:34:07.363-04:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: CUDA error" [GIN] 2025/07/18 - 12:34:07 | 500 | 5.9167696s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.9.7-rc0

GiteaMirror added the bug label 2026-04-22 15:54:19 -05:00

GiteaMirror closed this issue

2026-04-22 15:54:20 -05:00

GiteaMirror commented

2026-04-22 15:54:21 -05:00

@mirage335 commented on GitHub (Jul 18, 2025):

From the logs of other experiments, it looks like this may be specific to Llama-3_3-Nemotron-Super-49B-v1 and other AI LLM models of that series.

@mirage335 commented on GitHub (Jul 18, 2025): From the logs of other experiments, it looks like this may be specific to Llama-3_3-Nemotron-Super-49B-v1 and other AI LLM models of that series.

GiteaMirror commented

2026-04-22 15:54:21 -05:00

@rick-github commented on GitHub (Jul 19, 2025):

time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"

Usually means that the drivers are too old.

@rick-github commented on GitHub (Jul 19, 2025): ``` time=2025-07-18T12:34:01.885-04:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" ``` Usually means that the drivers are too old.

GiteaMirror commented

2026-04-22 15:54:22 -05:00

@aaronnewsome commented on GitHub (Jul 29, 2025):

Using the latest rc2 docker container, I had the same issue with running Llama-3_3-Nemotron-Super-49B-v1, Ollama said that flash attention was not supported, even though it should be. I couldn't figure out what the issue was so I ended up using llama.cpp to run this specific model, but Ollama runs all the rest.

Now, I see rc3 is out so maybe there's a fix in there. I'm downloading now ...

@aaronnewsome commented on GitHub (Jul 29, 2025): Using the latest rc2 docker container, I had the same issue with running Llama-3_3-Nemotron-Super-49B-v1, Ollama said that flash attention was not supported, even though it should be. I couldn't figure out what the issue was so I ended up using llama.cpp to run this specific model, but Ollama runs all the rest. Now, I see rc3 is out so maybe there's a fix in there. I'm downloading now ...

GiteaMirror commented

2026-04-22 15:54:22 -05:00

@dhiltgen commented on GitHub (Jul 31, 2025):

@mirage335 can you share a server log with $env:OLLAMA_DEBUG="1" set so we can see some more details around GPU discovery and scheduling? This might be related to #11614

@dhiltgen commented on GitHub (Jul 31, 2025): @mirage335 can you share a server log with `$env:OLLAMA_DEBUG="1"` set so we can see some more details around GPU discovery and scheduling? This might be related to #11614

GiteaMirror commented

2026-04-22 15:54:23 -05:00

@chhu commented on GitHub (Aug 1, 2025):

Same problem on our linux rtx3090 setup. Llama.cpp compiled and its working. BTW the new qwen3 30b runs much faster with llama.cpp, i get 90t/s eval on llama.cpp (no FA) vs ~60t/s eval with ollama 0.10.1.

nvidia-smi
Fri Aug  1 13:18:28 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
| 30%   35C    P8             15W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:25:00.0 Off |                  N/A |
| 30%   35C    P8             16W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:41:00.0 Off |                  N/A |
| 30%   32C    P8             26W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off |   00000000:61:00.0 Off |                  N/A |
| 30%   30C    P8             23W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 3090        Off |   00000000:81:00.0 Off |                  N/A |
| 30%   34C    P8             23W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 3090        Off |   00000000:A1:00.0 Off |                  N/A |
| 30%   32C    P8             17W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 3090        Off |   00000000:C1:00.0 Off |                  N/A |
| 30%   32C    P8             21W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 3090        Off |   00000000:E1:00.0 Off |                  N/A |
| 30%   30C    P8             21W /  350W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

cut from ollama serve w/ debug and FA on:
initializing /lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7faf97e4ae70
dlsym: cuDriverGetVersion - 0x7faf97e4ae90
dlsym: cuDeviceGetCount - 0x7faf97e4aed0
dlsym: cuDeviceGet - 0x7faf97e4aeb0
dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0
dlsym: cuDeviceGetUuid - 0x7faf97e4af10
dlsym: cuDeviceGetName - 0x7faf97e4aef0
dlsym: cuCtxCreate_v3 - 0x7faf97e4b190
dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910
dlsym: cuCtxDestroy - 0x7faf97ea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
time=2025-08-01T13:20:05.372+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=8 library=/lib64/libcuda.so.570.124.06
[GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA totalMem 24135mb
[GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA freeMem 23871mb
[GPU-1068f375-e3fe-422f-f71b-73a2394e701f] Compute Capability 8.6
[GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA totalMem 24135mb
[GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA freeMem 23871mb
[GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] Compute Capability 8.6
[GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA totalMem 24135mb
[GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA freeMem 23871mb
[GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] Compute Capability 8.6
[GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA totalMem 24135mb
[GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA freeMem 23871mb
[GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] Compute Capability 8.6
[GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA totalMem 24135mb
[GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA freeMem 23871mb
[GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] Compute Capability 8.6
[GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA totalMem 24135mb
[GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA freeMem 23871mb
[GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] Compute Capability 8.6
[GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA totalMem 24135mb
[GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA freeMem 23871mb
[GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] Compute Capability 8.6
[GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA totalMem 24135mb
[GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA freeMem 23871mb
[GPU-b355efbb-1628-3337-ee30-11313755b901] Compute Capability 8.6
time=2025-08-01T13:20:07.273+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1068f375-e3fe-422f-f71b-73a2394e701f library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b355efbb-1628-3337-ee30-11313755b901 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"

time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB"
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers"
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0
time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B"
initializing /lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7faf97e4ae70
dlsym: cuDriverGetVersion - 0x7faf97e4ae90
dlsym: cuDeviceGetCount - 0x7faf97e4aed0
dlsym: cuDeviceGet - 0x7faf97e4aeb0
dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0
dlsym: cuDeviceGetUuid - 0x7faf97e4af10
dlsym: cuDeviceGetName - 0x7faf97e4aef0
dlsym: cuCtxCreate_v3 - 0x7faf97e4b190
dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910
dlsym: cuCtxDestroy - 0x7faf97ea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
time=2025-08-01T13:21:41.356+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:41.623+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:41.858+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.093+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.329+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.570+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:42.812+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-01T13:21:43.049+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
releasing cuda driver library
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB"
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers"
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]"
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0
time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B"
initializing /lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7faf97e4ae70
dlsym: cuDriverGetVersion - 0x7faf97e4ae90
dlsym: cuDeviceGetCount - 0x7faf97e4aed0
dlsym: cuDeviceGet - 0x7faf97e4aeb0
dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0
dlsym: cuDeviceGetUuid - 0x7faf97e4af10
dlsym: cuDeviceGetName - 0x7faf97e4aef0
dlsym: cuCtxCreate_v3 - 0x7faf97e4b190
dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910
dlsym: cuCtxDestroy - 0x7faf97ea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
... repeating...

time=2025-08-01T13:21:58.234+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="53.4 GiB" memory.required.partial="0 B" memory.required.kv="23.4 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="31.2 GiB" memory.graph.partial="31.2 GiB"
time=2025-08-01T13:21:58.234+02:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-08-01T13:21:58.234+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]

... then all lands up in CPU mem. For llama.cpp it also does not manage to fit all in GPUs if FA=off, but once FA=on there seems to be plenty of room. The new Qwen3 30B w/ 262k context (almost maxed out) gives me ~7t/s compared to the 90t/s with almost empty context. ctk and ctv were not touched. (f16 default)

@chhu commented on GitHub (Aug 1, 2025): Same problem on our linux rtx3090 setup. Llama.cpp compiled and its working. BTW the new qwen3 30b runs much faster with llama.cpp, i get 90t/s eval on llama.cpp (no FA) vs ~60t/s eval with ollama 0.10.1. ``` nvidia-smi Fri Aug 1 13:18:28 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A | | 30% 35C P8 15W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off | 00000000:25:00.0 Off | N/A | | 30% 35C P8 16W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3090 Off | 00000000:41:00.0 Off | N/A | | 30% 32C P8 26W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 3090 Off | 00000000:61:00.0 Off | N/A | | 30% 30C P8 23W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA GeForce RTX 3090 Off | 00000000:81:00.0 Off | N/A | | 30% 34C P8 23W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 3090 Off | 00000000:A1:00.0 Off | N/A | | 30% 32C P8 17W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 3090 Off | 00000000:C1:00.0 Off | N/A | | 30% 32C P8 21W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA GeForce RTX 3090 Off | 00000000:E1:00.0 Off | N/A | | 30% 30C P8 21W / 350W | 2MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ cut from ollama serve w/ debug and FA on: initializing /lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7faf97e4ae70 dlsym: cuDriverGetVersion - 0x7faf97e4ae90 dlsym: cuDeviceGetCount - 0x7faf97e4aed0 dlsym: cuDeviceGet - 0x7faf97e4aeb0 dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0 dlsym: cuDeviceGetUuid - 0x7faf97e4af10 dlsym: cuDeviceGetName - 0x7faf97e4aef0 dlsym: cuCtxCreate_v3 - 0x7faf97e4b190 dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910 dlsym: cuCtxDestroy - 0x7faf97ea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 time=2025-08-01T13:20:05.372+02:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=8 library=/lib64/libcuda.so.570.124.06 [GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA totalMem 24135mb [GPU-1068f375-e3fe-422f-f71b-73a2394e701f] CUDA freeMem 23871mb [GPU-1068f375-e3fe-422f-f71b-73a2394e701f] Compute Capability 8.6 [GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA totalMem 24135mb [GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] CUDA freeMem 23871mb [GPU-615262d3-7dfe-666a-cf37-566e760e4ed1] Compute Capability 8.6 [GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA totalMem 24135mb [GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] CUDA freeMem 23871mb [GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c] Compute Capability 8.6 [GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA totalMem 24135mb [GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] CUDA freeMem 23871mb [GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036] Compute Capability 8.6 [GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA totalMem 24135mb [GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] CUDA freeMem 23871mb [GPU-02de372e-3119-89d2-b10e-16ce6064e6e0] Compute Capability 8.6 [GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA totalMem 24135mb [GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] CUDA freeMem 23871mb [GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5] Compute Capability 8.6 [GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA totalMem 24135mb [GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] CUDA freeMem 23871mb [GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3] Compute Capability 8.6 [GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA totalMem 24135mb [GPU-b355efbb-1628-3337-ee30-11313755b901] CUDA freeMem 23871mb [GPU-b355efbb-1628-3337-ee30-11313755b901] Compute Capability 8.6 time=2025-08-01T13:20:07.273+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1068f375-e3fe-422f-f71b-73a2394e701f library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:20:07.274+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b355efbb-1628-3337-ee30-11313755b901 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0 time=2025-08-01T13:21:41.115+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B" initializing /lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7faf97e4ae70 dlsym: cuDriverGetVersion - 0x7faf97e4ae90 dlsym: cuDeviceGetCount - 0x7faf97e4aed0 dlsym: cuDeviceGet - 0x7faf97e4aeb0 dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0 dlsym: cuDeviceGetUuid - 0x7faf97e4af10 dlsym: cuDeviceGetName - 0x7faf97e4aef0 dlsym: cuCtxCreate_v3 - 0x7faf97e4b190 dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910 dlsym: cuCtxDestroy - 0x7faf97ea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 time=2025-08-01T13:21:41.356+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:41.623+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:41.858+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.093+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.329+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.570+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:42.812+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-01T13:21:43.049+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" releasing cuda driver library time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:201 msg="gpu has too little memory to allocate any layers" id=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" minimum_memory=479199232 layer_size="1.1 GiB" gpu_zer_overhead="0 B" partial_offload="31.2 GiB" full_offload="31.2 GiB" time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:351 msg="insufficient VRAM to load any model layers" time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[23.3 GiB]" time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3moe.vision.block_count default=0 time=2025-08-01T13:21:43.050+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.2 GiB" before.free="481.0 GiB" before.free_swap="0 B" now.total="503.2 GiB" now.free="481.0 GiB" now.free_swap="0 B" initializing /lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7faf97e4ae70 dlsym: cuDriverGetVersion - 0x7faf97e4ae90 dlsym: cuDeviceGetCount - 0x7faf97e4aed0 dlsym: cuDeviceGet - 0x7faf97e4aeb0 dlsym: cuDeviceGetAttribute - 0x7faf97e4afb0 dlsym: cuDeviceGetUuid - 0x7faf97e4af10 dlsym: cuDeviceGetName - 0x7faf97e4aef0 dlsym: cuCtxCreate_v3 - 0x7faf97e4b190 dlsym: cuMemGetInfo_v2 - 0x7faf97e4b910 dlsym: cuCtxDestroy - 0x7faf97ea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 ... repeating... time=2025-08-01T13:21:58.234+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="53.4 GiB" memory.required.partial="0 B" memory.required.kv="23.4 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="31.2 GiB" memory.graph.partial="31.2 GiB" time=2025-08-01T13:21:58.234+02:00 level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-08-01T13:21:58.234+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] ``` ... then all lands up in CPU mem. For llama.cpp it also does not manage to fit all in GPUs if FA=off, but once FA=on there seems to be plenty of room. The new Qwen3 30B w/ 262k context (almost maxed out) gives me ~7t/s compared to the 90t/s with almost empty context. ctk and ctv were not touched. (f16 default)

GiteaMirror commented

2026-04-22 15:54:23 -05:00

@chhu commented on GitHub (Aug 4, 2025):

Wow, I just realized this is more serious than expected. I always thought of FA as a speed-up only feature, but as it turns out it also saves a lot of this precious VRAM (or allows much bigger ctx windows). Will try to compile from source and see if it solves this, please fix!!

@chhu commented on GitHub (Aug 4, 2025): Wow, I just realized this is more serious than expected. I always thought of FA as a speed-up only feature, but as it turns out it also saves a lot of this precious VRAM (or allows much bigger ctx windows). Will try to compile from source and see if it solves this, please fix!!

GiteaMirror commented

2026-04-22 15:54:24 -05:00

@chhu commented on GitHub (Aug 4, 2025):

Got it working by compiling manually, but ollama still offloading to CPU. llama.cpp manages qwen3_30BA3B-Q8 with full ctx window of 262.114 on 3 24GB GPUs. With ollama I use all 8 and get heavy CPU offload despite having FA:

initializing /usr/lib64/libcuda.so.570.124.06
dlsym: cuInit - 0x7f2f9be4ae70
dlsym: cuDriverGetVersion - 0x7f2f9be4ae90
dlsym: cuDeviceGetCount - 0x7f2f9be4aed0
dlsym: cuDeviceGet - 0x7f2f9be4aeb0
dlsym: cuDeviceGetAttribute - 0x7f2f9be4afb0
dlsym: cuDeviceGetUuid - 0x7f2f9be4af10
dlsym: cuDeviceGetName - 0x7f2f9be4aef0
dlsym: cuCtxCreate_v3 - 0x7f2f9be4b190
dlsym: cuMemGetInfo_v2 - 0x7f2f9be4b910
dlsym: cuCtxDestroy - 0x7f2f9bea9ab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 8
time=2025-08-04T16:45:48.925+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.149+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.377+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.610+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:49.842+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:50.071+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:50.294+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
time=2025-08-04T16:45:50.522+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB"
releasing cuda driver library
time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=2,2,2,2,2,2,2,2 memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="211.8 GiB" memory.required.partial="182.0 GiB" memory.required.kv="14.6 GiB" memory.required.allocations="[22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="19.5 GiB" memory.graph.partial="19.5 GiB"
time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:218 msg="enabling flash attention"
time=2025-08-04T16:45:50.523+02:00 level=WARN source=server.go:226 msg="kv cache type not supported by model" type=""
time=2025-08-04T16:45:50.523+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                           general.basename str              = Qwen3
llama_model_loader: - kv   2:                          general.file_type u32              = 7
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                               general.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv   5:                    general.parameter_count u64              = 30532122624
llama_model_loader: - kv   6:               general.quantization_version u32              = 2
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                               general.type str              = model
llama_model_loader: - kv   9:                            general.version str              = 2507
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  22:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B Instruct 2507
print_info: n_ff_exp         = 0
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-08-04T16:45:50.760+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/scr2/new/ollama/bin/ollama runner --model /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 --ctx-size 160000 --batch-size 512 --n-gpu-layers 16 --threads 64 --flash-attn --parallel 1 --tensor-split 2,2,2,2,2,2,2,2 --port 46783"
time=2025-08-04T16:45:50.760+02:00 level=DEBUG source=server.go:439 msg=subprocess LD_LIBRARY_PATH=/scr2/new/ollama/lib/ollama:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64::/vicsys/utils_rh8/cuda/lib64:/scr2/new/ollama/lib/ollama CUDA_BASE=/vicsys/utils_rh8/cuda CUDA_PATH=/vicsys/utils_rh8/cuda OLLAMA_HOST=0.0.0.0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_FLASH_ATTENTION=1 CUDA_HOME=/vicsys/utils_rh8/cuda OLLAMA_ORIGINS=* OLLAMA_DEBUG=1 PATH=/vicsys/utils_rh8/paraview/bin:/vicsys/utils_rh8/cmake/bin:/plp_scr1/utils/mpi/bin:/plp_scr1/utils/mpi_5/ucx_build/bin:/vicsys/utils_rh8/gcc/bin:/home/huettig/bin:/home/huettig/bin_extra:/home/huettig/.local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/vicsys/utils_rh8/cuda/bin:/plp_scr1/utils/numa/bin:/vicsys/utils_rh8/node/bin OLLAMA_MODELS=/scr2/new/ollama/models/models OLLAMA_MAX_LOADED_MODELS=24 OLLAMA_LIBRARY_PATH=/scr2/new/ollama/lib/ollama CUDA_VISIBLE_DEVICES=GPU-1068f375-e3fe-422f-f71b-73a2394e701f,GPU-615262d3-7dfe-666a-cf37-566e760e4ed1,GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c,GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036,GPU-02de372e-3119-89d2-b10e-16ce6064e6e0,GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5,GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3,GPU-b355efbb-1628-3337-ee30-11313755b901
time=2025-08-04T16:45:50.761+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-04T16:45:50.786+02:00 level=INFO source=runner.go:815 msg="starting go runner"
time=2025-08-04T16:45:50.786+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/scr2/new/ollama/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /scr2/new/ollama/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /scr2/new/ollama/lib/ollama/libggml-cpu-haswell.so
time=2025-08-04T16:45:52.012+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=860 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=860 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=860 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=860 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=860 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=860 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=860 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=860 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-04T16:45:52.013+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46783"
time=2025-08-04T16:45:52.016+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA6 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA7 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                           general.basename str              = Qwen3
llama_model_loader: - kv   2:                          general.file_type u32              = 7
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                               general.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv   5:                    general.parameter_count u64              = 30532122624
llama_model_loader: - kv   6:               general.quantization_version u32              = 2
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                               general.type str              = model
llama_model_loader: - kv   9:                            general.version str              = 2507
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  22:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B Instruct 2507
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 0
load_tensors: layer  33 assigned to device CUDA0, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA2, is_swa = 0
load_tensors: layer  37 assigned to device CUDA2, is_swa = 0
load_tensors: layer  38 assigned to device CUDA3, is_swa = 0
load_tensors: layer  39 assigned to device CUDA3, is_swa = 0
load_tensors: layer  40 assigned to device CUDA4, is_swa = 0
load_tensors: layer  41 assigned to device CUDA4, is_swa = 0
load_tensors: layer  42 assigned to device CUDA5, is_swa = 0
load_tensors: layer  43 assigned to device CUDA5, is_swa = 0
load_tensors: layer  44 assigned to device CUDA6, is_swa = 0
load_tensors: layer  45 assigned to device CUDA6, is_swa = 0
load_tensors: layer  46 assigned to device CUDA7, is_swa = 0
load_tensors: layer  47 assigned to device CUDA7, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q8_0) (and 386 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloaded 16/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  1264.28 MiB
load_tensors:        CUDA1 model buffer size =  1264.28 MiB
load_tensors:        CUDA2 model buffer size =  1264.28 MiB
load_tensors:        CUDA3 model buffer size =  1264.28 MiB
load_tensors:        CUDA4 model buffer size =  1264.28 MiB
load_tensors:        CUDA5 model buffer size =  1264.28 MiB
load_tensors:        CUDA6 model buffer size =  1264.28 MiB
load_tensors:        CUDA7 model buffer size =  1264.28 MiB
load_tensors:   CPU_Mapped model buffer size = 30973.40 MiB
time=2025-08-04T16:45:55.780+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.08"
time=2025-08-04T16:45:56.031+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.15"
time=2025-08-04T16:45:56.282+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.23"
time=2025-08-04T16:45:56.533+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.31"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 160000
llama_context: n_ctx_per_seq = 160000
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (160000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.59 MiB
create_memory: n_ctx = 160000 (padded)
llama_kv_cache_unified: kv_size = 160000, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 256
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified: layer  32: dev = CUDA0
llama_kv_cache_unified: layer  33: dev = CUDA0
llama_kv_cache_unified: layer  34: dev = CUDA1
llama_kv_cache_unified: layer  35: dev = CUDA1
llama_kv_cache_unified: layer  36: dev = CUDA2
llama_kv_cache_unified: layer  37: dev = CUDA2
llama_kv_cache_unified: layer  38: dev = CUDA3
llama_kv_cache_unified: layer  39: dev = CUDA3
llama_kv_cache_unified: layer  40: dev = CUDA4
llama_kv_cache_unified: layer  41: dev = CUDA4
llama_kv_cache_unified: layer  42: dev = CUDA5
llama_kv_cache_unified: layer  43: dev = CUDA5
llama_kv_cache_unified: layer  44: dev = CUDA6
llama_kv_cache_unified: layer  45: dev = CUDA6
llama_kv_cache_unified: layer  46: dev = CUDA7
llama_kv_cache_unified: layer  47: dev = CUDA7
llama_kv_cache_unified:      CUDA0 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA2 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA3 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA4 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA5 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA6 KV buffer size =   625.00 MiB
llama_kv_cache_unified:      CUDA7 KV buffer size =   625.00 MiB
time=2025-08-04T16:45:56.784+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-08-04T16:45:57.035+02:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size = 10000.00 MiB
llama_kv_cache_unified: KV self size  = 15000.00 MiB, K (f16): 7500.00 MiB, V (f16): 7500.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 9
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   797.25 MiB
llama_context:      CUDA1 compute buffer size =   208.75 MiB
llama_context:      CUDA2 compute buffer size =   208.75 MiB
llama_context:      CUDA3 compute buffer size =   208.75 MiB
llama_context:      CUDA4 compute buffer size =   208.75 MiB
llama_context:      CUDA5 compute buffer size =   208.75 MiB
llama_context:      CUDA6 compute buffer size =   208.75 MiB
llama_context:      CUDA7 compute buffer size =   208.75 MiB
llama_context:  CUDA_Host compute buffer size =   316.51 MiB
llama_context: graph nodes  = 2935
llama_context: graph splits = 459 (with bs=512), 74 (with bs=1)
time=2025-08-04T16:46:00.798+02:00 level=INFO source=server.go:637 msg="llama runner started in 10.04 seconds"
time=2025-08-04T16:46:00.798+02:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000
time=2025-08-04T16:46:00.806+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=264 format=""
time=2025-08-04T16:46:00.813+02:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=59 used=0 remaining=59
time=2025-08-04T16:46:12.671+02:00 level=DEBUG source=sched.go:501 msg="context for request finished"
time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 duration=2562047h47m16.854775807s
time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 refCount=0

Here is the llama.cpp output, restricted to 3 GPUs:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 6056 (baad9488) with cc (GCC) 15.1.0 for x86_64-pc-linux-gnu
system info: n_threads = 64, n_threads_batch = 64, total_threads = 128

system_info: n_threads = 64 (n_threads_batch = 64) / 128 | CUDA : ARCHS = 860 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 127
main: loading model
srv    load_model: loading model 'models/qwen3-30.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from models/qwen3-30.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                           general.basename str              = Qwen3
llama_model_loader: - kv   2:                          general.file_type u32              = 7
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                               general.name str              = Qwen3 30B A3B Instruct 2507
llama_model_loader: - kv   5:                    general.parameter_count u64              = 30532122624
llama_model_loader: - kv   6:               general.quantization_version u32              = 2
llama_model_loader: - kv   7:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   8:                               general.type str              = model
llama_model_loader: - kv   9:                            general.version str              = 2507
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  21:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  22:                    qwen3moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B Instruct 2507
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
load_tensors:        CUDA0 model buffer size = 10746.41 MiB
load_tensors:        CUDA1 model buffer size = 10114.27 MiB
load_tensors:        CUDA2 model buffer size =  9797.43 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 262144
llama_context: n_ctx_per_seq = 262144
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =  8704.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =  8192.00 MiB
llama_kv_cache_unified:      CUDA2 KV buffer size =  7680.00 MiB
llama_kv_cache_unified: size = 24576.00 MiB (262144 cells,  48 layers,  1/ 1 seqs), K (f16): 12288.00 MiB, V (f16): 12288.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =  2408.01 MiB
llama_context:      CUDA1 compute buffer size =  1124.01 MiB
llama_context:      CUDA2 compute buffer size =  1340.77 MiB
llama_context:  CUDA_Host compute buffer size =  2052.02 MiB
llama_context: graph nodes  = 3079
llama_context: graph splits = 4
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 262144
main: model loaded
main: chat template
...

@chhu commented on GitHub (Aug 4, 2025): Got it working by compiling manually, but ollama still offloading to CPU. llama.cpp manages qwen3_30BA3B-Q8 with full ctx window of 262.114 on 3 24GB GPUs. With ollama I use all 8 and get heavy CPU offload despite having FA: ``` initializing /usr/lib64/libcuda.so.570.124.06 dlsym: cuInit - 0x7f2f9be4ae70 dlsym: cuDriverGetVersion - 0x7f2f9be4ae90 dlsym: cuDeviceGetCount - 0x7f2f9be4aed0 dlsym: cuDeviceGet - 0x7f2f9be4aeb0 dlsym: cuDeviceGetAttribute - 0x7f2f9be4afb0 dlsym: cuDeviceGetUuid - 0x7f2f9be4af10 dlsym: cuDeviceGetName - 0x7f2f9be4aef0 dlsym: cuCtxCreate_v3 - 0x7f2f9be4b190 dlsym: cuMemGetInfo_v2 - 0x7f2f9be4b910 dlsym: cuCtxDestroy - 0x7f2f9bea9ab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 8 time=2025-08-04T16:45:48.925+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-1068f375-e3fe-422f-f71b-73a2394e701f name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.149+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-615262d3-7dfe-666a-cf37-566e760e4ed1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.377+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.610+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:49.842+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-02de372e-3119-89d2-b10e-16ce6064e6e0 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:50.071+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:50.294+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" time=2025-08-04T16:45:50.522+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-b355efbb-1628-3337-ee30-11313755b901 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="23.6 GiB" before.free="23.3 GiB" now.total="23.6 GiB" now.free="23.3 GiB" now.used="263.3 MiB" releasing cuda driver library time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=16 layers.split=2,2,2,2,2,2,2,2 memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="211.8 GiB" memory.required.partial="182.0 GiB" memory.required.kv="14.6 GiB" memory.required.allocations="[22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB 22.7 GiB]" memory.weights.total="29.9 GiB" memory.weights.repeating="29.6 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="19.5 GiB" memory.graph.partial="19.5 GiB" time=2025-08-04T16:45:50.523+02:00 level=INFO source=server.go:218 msg="enabling flash attention" time=2025-08-04T16:45:50.523+02:00 level=WARN source=server.go:226 msg="kv cache type not supported by model" type="" time=2025-08-04T16:45:50.523+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.basename str = Qwen3 llama_model_loader: - kv 2: general.file_type u32 = 7 llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.name str = Qwen3 30B A3B Instruct 2507 llama_model_loader: - kv 5: general.parameter_count u64 = 30532122624 llama_model_loader: - kv 6: general.quantization_version u32 = 2 llama_model_loader: - kv 7: general.size_label str = 30B-A3B llama_model_loader: - kv 8: general.type str = model llama_model_loader: - kv 9: general.version str = 2507 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 22: qwen3moe.rope.freq_base f32 = 10000000.000000 llama_model_loader: - kv 23: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 338 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 30.25 GiB (8.51 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B Instruct 2507 print_info: n_ff_exp = 0 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-08-04T16:45:50.760+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/scr2/new/ollama/bin/ollama runner --model /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 --ctx-size 160000 --batch-size 512 --n-gpu-layers 16 --threads 64 --flash-attn --parallel 1 --tensor-split 2,2,2,2,2,2,2,2 --port 46783" time=2025-08-04T16:45:50.760+02:00 level=DEBUG source=server.go:439 msg=subprocess LD_LIBRARY_PATH=/scr2/new/ollama/lib/ollama:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/gcc-11.5.0/lib64:/vicsys/utils_rh8/cuda/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64:/plp_scr1/utils/mpi/lib:/plp_scr1/utils/mpi_5/ucx_build/lib:/vicsys/utils_rh8/gcc/lib64::/vicsys/utils_rh8/cuda/lib64:/scr2/new/ollama/lib/ollama CUDA_BASE=/vicsys/utils_rh8/cuda CUDA_PATH=/vicsys/utils_rh8/cuda OLLAMA_HOST=0.0.0.0 OLLAMA_NUM_PARALLEL=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_FLASH_ATTENTION=1 CUDA_HOME=/vicsys/utils_rh8/cuda OLLAMA_ORIGINS=* OLLAMA_DEBUG=1 PATH=/vicsys/utils_rh8/paraview/bin:/vicsys/utils_rh8/cmake/bin:/plp_scr1/utils/mpi/bin:/plp_scr1/utils/mpi_5/ucx_build/bin:/vicsys/utils_rh8/gcc/bin:/home/huettig/bin:/home/huettig/bin_extra:/home/huettig/.local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/vicsys/utils_rh8/cuda/bin:/plp_scr1/utils/numa/bin:/vicsys/utils_rh8/node/bin OLLAMA_MODELS=/scr2/new/ollama/models/models OLLAMA_MAX_LOADED_MODELS=24 OLLAMA_LIBRARY_PATH=/scr2/new/ollama/lib/ollama CUDA_VISIBLE_DEVICES=GPU-1068f375-e3fe-422f-f71b-73a2394e701f,GPU-615262d3-7dfe-666a-cf37-566e760e4ed1,GPU-f22fdea8-5405-3129-d32a-7f916d1f8b7c,GPU-7f0f16bf-bb19-c951-3cab-78c75e8f1036,GPU-02de372e-3119-89d2-b10e-16ce6064e6e0,GPU-80159554-dd91-f6ab-b8a5-bbf5a34c59c5,GPU-9190655b-d4a4-131c-ae18-7b58cce3b7f3,GPU-b355efbb-1628-3337-ee30-11313755b901 time=2025-08-04T16:45:50.761+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-04T16:45:50.761+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-04T16:45:50.786+02:00 level=INFO source=runner.go:815 msg="starting go runner" time=2025-08-04T16:45:50.786+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/scr2/new/ollama/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /scr2/new/ollama/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /scr2/new/ollama/lib/ollama/libggml-cpu-haswell.so time=2025-08-04T16:45:52.012+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=860 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=860 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=860 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=860 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=860 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=860 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 CUDA.6.ARCHS=860 CUDA.6.USE_GRAPHS=1 CUDA.6.PEER_MAX_BATCH_SIZE=128 CUDA.7.ARCHS=860 CUDA.7.USE_GRAPHS=1 CUDA.7.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-04T16:45:52.013+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46783" time=2025-08-04T16:45:52.016+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA6 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA7 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from /scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.basename str = Qwen3 llama_model_loader: - kv 2: general.file_type u32 = 7 llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.name str = Qwen3 30B A3B Instruct 2507 llama_model_loader: - kv 5: general.parameter_count u64 = 30532122624 llama_model_loader: - kv 6: general.quantization_version u32 = 2 llama_model_loader: - kv 7: general.size_label str = 30B-A3B llama_model_loader: - kv 8: general.type str = model llama_model_loader: - kv 9: general.version str = 2507 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 22: qwen3moe.rope.freq_base f32 = 10000000.000000 llama_model_loader: - kv 23: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 338 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 30.25 GiB (8.51 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 30B.A3B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B Instruct 2507 print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 load_tensors: layer 29 assigned to device CPU, is_swa = 0 load_tensors: layer 30 assigned to device CPU, is_swa = 0 load_tensors: layer 31 assigned to device CPU, is_swa = 0 load_tensors: layer 32 assigned to device CUDA0, is_swa = 0 load_tensors: layer 33 assigned to device CUDA0, is_swa = 0 load_tensors: layer 34 assigned to device CUDA1, is_swa = 0 load_tensors: layer 35 assigned to device CUDA1, is_swa = 0 load_tensors: layer 36 assigned to device CUDA2, is_swa = 0 load_tensors: layer 37 assigned to device CUDA2, is_swa = 0 load_tensors: layer 38 assigned to device CUDA3, is_swa = 0 load_tensors: layer 39 assigned to device CUDA3, is_swa = 0 load_tensors: layer 40 assigned to device CUDA4, is_swa = 0 load_tensors: layer 41 assigned to device CUDA4, is_swa = 0 load_tensors: layer 42 assigned to device CUDA5, is_swa = 0 load_tensors: layer 43 assigned to device CUDA5, is_swa = 0 load_tensors: layer 44 assigned to device CUDA6, is_swa = 0 load_tensors: layer 45 assigned to device CUDA6, is_swa = 0 load_tensors: layer 46 assigned to device CUDA7, is_swa = 0 load_tensors: layer 47 assigned to device CUDA7, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q8_0) (and 386 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 16 repeating layers to GPU load_tensors: offloaded 16/49 layers to GPU load_tensors: CUDA0 model buffer size = 1264.28 MiB load_tensors: CUDA1 model buffer size = 1264.28 MiB load_tensors: CUDA2 model buffer size = 1264.28 MiB load_tensors: CUDA3 model buffer size = 1264.28 MiB load_tensors: CUDA4 model buffer size = 1264.28 MiB load_tensors: CUDA5 model buffer size = 1264.28 MiB load_tensors: CUDA6 model buffer size = 1264.28 MiB load_tensors: CUDA7 model buffer size = 1264.28 MiB load_tensors: CPU_Mapped model buffer size = 30973.40 MiB time=2025-08-04T16:45:55.780+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.08" time=2025-08-04T16:45:56.031+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.15" time=2025-08-04T16:45:56.282+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.23" time=2025-08-04T16:45:56.533+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.31" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 160000 llama_context: n_ctx_per_seq = 160000 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 10000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (160000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.59 MiB create_memory: n_ctx = 160000 (padded) llama_kv_cache_unified: kv_size = 160000, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 256 llama_kv_cache_unified: layer 0: dev = CPU llama_kv_cache_unified: layer 1: dev = CPU llama_kv_cache_unified: layer 2: dev = CPU llama_kv_cache_unified: layer 3: dev = CPU llama_kv_cache_unified: layer 4: dev = CPU llama_kv_cache_unified: layer 5: dev = CPU llama_kv_cache_unified: layer 6: dev = CPU llama_kv_cache_unified: layer 7: dev = CPU llama_kv_cache_unified: layer 8: dev = CPU llama_kv_cache_unified: layer 9: dev = CPU llama_kv_cache_unified: layer 10: dev = CPU llama_kv_cache_unified: layer 11: dev = CPU llama_kv_cache_unified: layer 12: dev = CPU llama_kv_cache_unified: layer 13: dev = CPU llama_kv_cache_unified: layer 14: dev = CPU llama_kv_cache_unified: layer 15: dev = CPU llama_kv_cache_unified: layer 16: dev = CPU llama_kv_cache_unified: layer 17: dev = CPU llama_kv_cache_unified: layer 18: dev = CPU llama_kv_cache_unified: layer 19: dev = CPU llama_kv_cache_unified: layer 20: dev = CPU llama_kv_cache_unified: layer 21: dev = CPU llama_kv_cache_unified: layer 22: dev = CPU llama_kv_cache_unified: layer 23: dev = CPU llama_kv_cache_unified: layer 24: dev = CPU llama_kv_cache_unified: layer 25: dev = CPU llama_kv_cache_unified: layer 26: dev = CPU llama_kv_cache_unified: layer 27: dev = CPU llama_kv_cache_unified: layer 28: dev = CPU llama_kv_cache_unified: layer 29: dev = CPU llama_kv_cache_unified: layer 30: dev = CPU llama_kv_cache_unified: layer 31: dev = CPU llama_kv_cache_unified: layer 32: dev = CUDA0 llama_kv_cache_unified: layer 33: dev = CUDA0 llama_kv_cache_unified: layer 34: dev = CUDA1 llama_kv_cache_unified: layer 35: dev = CUDA1 llama_kv_cache_unified: layer 36: dev = CUDA2 llama_kv_cache_unified: layer 37: dev = CUDA2 llama_kv_cache_unified: layer 38: dev = CUDA3 llama_kv_cache_unified: layer 39: dev = CUDA3 llama_kv_cache_unified: layer 40: dev = CUDA4 llama_kv_cache_unified: layer 41: dev = CUDA4 llama_kv_cache_unified: layer 42: dev = CUDA5 llama_kv_cache_unified: layer 43: dev = CUDA5 llama_kv_cache_unified: layer 44: dev = CUDA6 llama_kv_cache_unified: layer 45: dev = CUDA6 llama_kv_cache_unified: layer 46: dev = CUDA7 llama_kv_cache_unified: layer 47: dev = CUDA7 llama_kv_cache_unified: CUDA0 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA1 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA2 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA3 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA4 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA5 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA6 KV buffer size = 625.00 MiB llama_kv_cache_unified: CUDA7 KV buffer size = 625.00 MiB time=2025-08-04T16:45:56.784+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-08-04T16:45:57.035+02:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 10000.00 MiB llama_kv_cache_unified: KV self size = 15000.00 MiB, K (f16): 7500.00 MiB, V (f16): 7500.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 9 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CUDA0 compute buffer size = 797.25 MiB llama_context: CUDA1 compute buffer size = 208.75 MiB llama_context: CUDA2 compute buffer size = 208.75 MiB llama_context: CUDA3 compute buffer size = 208.75 MiB llama_context: CUDA4 compute buffer size = 208.75 MiB llama_context: CUDA5 compute buffer size = 208.75 MiB llama_context: CUDA6 compute buffer size = 208.75 MiB llama_context: CUDA7 compute buffer size = 208.75 MiB llama_context: CUDA_Host compute buffer size = 316.51 MiB llama_context: graph nodes = 2935 llama_context: graph splits = 459 (with bs=512), 74 (with bs=1) time=2025-08-04T16:46:00.798+02:00 level=INFO source=server.go:637 msg="llama runner started in 10.04 seconds" time=2025-08-04T16:46:00.798+02:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 time=2025-08-04T16:46:00.806+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=264 format="" time=2025-08-04T16:46:00.813+02:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=59 used=0 remaining=59 time=2025-08-04T16:46:12.671+02:00 level=DEBUG source=sched.go:501 msg="context for request finished" time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 duration=2562047h47m16.854775807s time=2025-08-04T16:46:12.672+02:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:30b-a3b-instruct-2507-q8_0 runner.inference=cuda runner.devices=8 runner.size="211.8 GiB" runner.vram="182.0 GiB" runner.parallel=1 runner.pid=3393640 runner.model=/scr2/new/ollama/models/models/blobs/sha256-da17ea1e24268d63d8c5f876acc6a428a389f45e3d0d016ebdaaf3e1aed81c32 runner.num_ctx=160000 refCount=0 ``` Here is the llama.cpp output, restricted to 3 GPUs: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 3 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes build: 6056 (baad9488) with cc (GCC) 15.1.0 for x86_64-pc-linux-gnu system info: n_threads = 64, n_threads_batch = 64, total_threads = 128 system_info: n_threads = 64 (n_threads_batch = 64) / 128 | CUDA : ARCHS = 860 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | main: binding port with default address family main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 127 main: loading model srv load_model: loading model 'models/qwen3-30.gguf' llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 579 tensors from models/qwen3-30.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3moe llama_model_loader: - kv 1: general.basename str = Qwen3 llama_model_loader: - kv 2: general.file_type u32 = 7 llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.name str = Qwen3 30B A3B Instruct 2507 llama_model_loader: - kv 5: general.parameter_count u64 = 30532122624 llama_model_loader: - kv 6: general.quantization_version u32 = 2 llama_model_loader: - kv 7: general.size_label str = 30B-A3B llama_model_loader: - kv 8: general.type str = model llama_model_loader: - kv 9: general.version str = 2507 llama_model_loader: - kv 10: qwen3moe.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3moe.attention.head_count_kv u32 = 4 llama_model_loader: - kv 12: qwen3moe.attention.key_length u32 = 128 llama_model_loader: - kv 13: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3moe.attention.value_length u32 = 128 llama_model_loader: - kv 15: qwen3moe.block_count u32 = 48 llama_model_loader: - kv 16: qwen3moe.context_length u32 = 262144 llama_model_loader: - kv 17: qwen3moe.embedding_length u32 = 2048 llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128 llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768 llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8 llama_model_loader: - kv 21: qwen3moe.feed_forward_length u32 = 6144 llama_model_loader: - kv 22: qwen3moe.rope.freq_base f32 = 10000000.000000 llama_model_loader: - kv 23: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - type f32: 241 tensors llama_model_loader: - type q8_0: 338 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 30.25 GiB (8.51 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3moe print_info: vocab_only = 0 print_info: n_ctx_train = 262144 print_info: n_embd = 2048 print_info: n_layer = 48 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 128 print_info: n_expert_used = 8 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 262144 print_info: rope_finetuned = unknown print_info: model type = 30B.A3B print_info: model params = 30.53 B print_info: general.name = Qwen3 30B A3B Instruct 2507 print_info: n_ff_exp = 768 print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 48 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 49/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 315.30 MiB load_tensors: CUDA0 model buffer size = 10746.41 MiB load_tensors: CUDA1 model buffer size = 10114.27 MiB load_tensors: CUDA2 model buffer size = 9797.43 MiB .................................................................................................... llama_context: constructing llama_context llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache llama_context: n_seq_max = 1 llama_context: n_ctx = 262144 llama_context: n_ctx_per_seq = 262144 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: kv_unified = true llama_context: freq_base = 10000000.0 llama_context: freq_scale = 1 llama_context: CUDA_Host output buffer size = 0.58 MiB llama_kv_cache_unified: CUDA0 KV buffer size = 8704.00 MiB llama_kv_cache_unified: CUDA1 KV buffer size = 8192.00 MiB llama_kv_cache_unified: CUDA2 KV buffer size = 7680.00 MiB llama_kv_cache_unified: size = 24576.00 MiB (262144 cells, 48 layers, 1/ 1 seqs), K (f16): 12288.00 MiB, V (f16): 12288.00 MiB llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility llama_context: pipeline parallelism enabled (n_copies=4) llama_context: CUDA0 compute buffer size = 2408.01 MiB llama_context: CUDA1 compute buffer size = 1124.01 MiB llama_context: CUDA2 compute buffer size = 1340.77 MiB llama_context: CUDA_Host compute buffer size = 2052.02 MiB llama_context: graph nodes = 3079 llama_context: graph splits = 4 common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|im_end|> logit bias = -inf common_init_from_params: added <|fim_pad|> logit bias = -inf common_init_from_params: added <|repo_name|> logit bias = -inf common_init_from_params: added <|file_sep|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 262144 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) srv init: initializing slots, n_slots = 1 slot init: id 0 | task -1 | new slot n_ctx_slot = 262144 main: model loaded main: chat template ... ```

GiteaMirror commented

2026-04-22 15:54:26 -05:00

@jessegross commented on GitHub (Aug 5, 2025):

@chhu There is a work in progress branch that may help with the issues that you are running into. If you like, you can give it a try. You'll need to build from source and set the environment variables OLLAMA_NEW_ENGINE=1 and OLLAMA_NEW_ESTIMATES=1
https://github.com/ollama/ollama/pull/11090

@jessegross commented on GitHub (Aug 5, 2025): @chhu There is a work in progress branch that may help with the issues that you are running into. If you like, you can give it a try. You'll need to build from source and set the environment variables OLLAMA_NEW_ENGINE=1 and OLLAMA_NEW_ESTIMATES=1 https://github.com/ollama/ollama/pull/11090

GiteaMirror commented

2026-04-22 15:54:28 -05:00

@chhu commented on GitHub (Aug 6, 2025):

I can confirm this works!!

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
qwen3_max:latest    c896132557cd    59 GB    100% GPU     262144     Forever

Please merge! I can imagine this is important for a lot of ppl wondering why they can't bump their ctx. 😄

@chhu commented on GitHub (Aug 6, 2025): I can confirm this works!! ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3_max:latest c896132557cd 59 GB 100% GPU 262144 Forever ``` Please merge! I can imagine this is important for a lot of ppl wondering why they can't bump their ctx. 😄

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#33332