[GH-ISSUE #10984] Ollama not using my gpu properly #53754

Closed
opened 2026-04-29 04:40:43 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @buekera on GitHub (Jun 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10984

What is the issue?

Hi there! I just started using ollama today and noticed a strange behavior.

Despite ollama telling me (ollama ps) that the processor is 100% GPU, my GPU isn't doing anything and no load is offloaded to it.

ollama ps
NAME                              ID              SIZE     PROCESSOR    UNTIL              
deepseek-r1:8b-0528-qwen3-fp16    a5f168330005    18 GB    100% GPU     4 minutes from now    
nvidia-smi
Thu Jun  5 17:21:57 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.153.02             Driver Version: 570.153.02     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:09:00.0  On |                  N/A |
| 30%   37C    P0             44W /  575W |    2881MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1052      G   /usr/lib/Xorg                            81MiB |
|    0   N/A  N/A            1192      G   /usr/bin/kwin_wayland                   108MiB |
|    0   N/A  N/A            1288      G   /usr/bin/Xwayland                        12MiB |
|    0   N/A  N/A            1356    C+G   /usr/bin/plasmashell                    880MiB |
|    0   N/A  N/A            1610      G   /usr/lib/firefox/firefox                674MiB |
|    0   N/A  N/A            1660    C+G   /usr/bin/plasma-systemmonitor            82MiB |
|    0   N/A  N/A            1880      G   ...ing --variations-seed-version        129MiB |
|    0   N/A  N/A            1883      G   ...per --variations-seed-version        445MiB |
+-----------------------------------------------------------------------------------------+

I can see my CPU spiking and picking up all the load immediately.

Maybe I'm missing any fundamentals here, but ollama seems to detect my GPU just fine. From the ollama logs:

compute" id=GPU-b9efdd27-2bc6-f808-9156-b174c62c9ad9 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="28.4 GiB"

Some information about my linux installation:

  • endavouros on kernel 6.14.9
  • nvidia driver version 570.153.02-3

I hope I'm just being stupid but nothing in the logs caught my eye.

Thanks!

Relevant log output

Output of ollama:

time=2025-06-05T17:24:21.775+02:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:26843545600 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/abuka/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-06-05T17:24:21.776+02:00 level=INFO source=images.go:479 msg="total blobs: 17"
time=2025-06-05T17:24:21.776+02:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-06-05T17:24:21.776+02:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)"
time=2025-06-05T17:24:21.776+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-06-05T17:24:22.012+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b9efdd27-2bc6-f808-9156-b174c62c9ad9 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="28.4 GiB"
[GIN] 2025/06/05 - 17:24:29 | 200 |      30.911µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/05 - 17:24:29 | 200 |   23.053283ms |       127.0.0.1 | POST     "/api/show"
time=2025-06-05T17:24:30.006+02:00 level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="55.7 GiB" free_swap="64.0 GiB"
time=2025-06-05T17:24:30.006+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=5 layers.split="" memory.available="[28.4 GiB]" memory.gpu_overhead="25.0 GiB" memory.required.full="15.9 GiB" memory.required.partial="3.1 GiB" memory.required.kv="576.0 MiB" memory.required.allocations="[3.1 GiB]" memory.weights.total="14.1 GiB" memory.weights.repeating="12.9 GiB" memory.weights.nonrepeating="1.2 GiB" memory.graph.full="384.0 MiB" memory.graph.partial="384.0 MiB"
llama_model_loader: loaded meta data with 32 key-value pairs and 399 tensors from /home/abuka/.ollama/models/blobs/sha256-e3c5da368c781ba65788b7e5a888cb53e77e2c37b69059bd6316e7eb101164a9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 0528 Qwen3 8B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-0528-Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   7:                       qwen3.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 12288
llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 1
llama_model_loader: - kv  17:                    qwen3.rope.scaling.type str              = yarn
llama_model_loader: - kv  18:                  qwen3.rope.scaling.factor f32              = 4.000000
llama_model_loader: - kv  19: qwen3.rope.scaling.original_context_length u32              = 32768
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151645
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:  254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 15.26 GiB (16.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 28
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.19 B
print_info: general.name     = DeepSeek R1 0528 Qwen3 8B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151645 '<|end▁of▁sentence|>'
print_info: EOT token        = 151645 '<|end▁of▁sentence|>'
print_info: PAD token        = 151645 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151645 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-06-05T17:24:30.138+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/abuka/.ollama/models/blobs/sha256-e3c5da368c781ba65788b7e5a888cb53e77e2c37b69059bd6316e7eb101164a9 --ctx-size 4096 --batch-size 512 --n-gpu-layers 5 --threads 12 --parallel 1 --port 45743"
time=2025-06-05T17:24:30.138+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-05T17:24:30.138+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-05T17:24:30.139+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-05T17:24:30.146+02:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-06-05T17:24:30.149+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-06-05T17:24:30.150+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:45743"
llama_model_loader: loaded meta data with 32 key-value pairs and 399 tensors from /home/abuka/.ollama/models/blobs/sha256-e3c5da368c781ba65788b7e5a888cb53e77e2c37b69059bd6316e7eb101164a9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 0528 Qwen3 8B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-0528-Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   7:                       qwen3.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv   9:                  qwen3.feed_forward_length u32              = 12288
llama_model_loader: - kv  10:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  11:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  15:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  16:                          general.file_type u32              = 1
llama_model_loader: - kv  17:                    qwen3.rope.scaling.type str              = yarn
llama_model_loader: - kv  18:                  qwen3.rope.scaling.factor f32              = 4.000000
llama_model_loader: - kv  19: qwen3.rope.scaling.original_context_length u32              = 32768
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151645
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:  254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 15.26 GiB (16.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 28
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 36
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.25
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.19 B
print_info: general.name     = DeepSeek R1 0528 Qwen3 8B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151645 '<|end▁of▁sentence|>'
print_info: EOT token        = 151645 '<|end▁of▁sentence|>'
print_info: PAD token        = 151645 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151645 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
time=2025-06-05T17:24:30.389+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size = 15623.18 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.25
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.60 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32
llama_kv_cache_unified:        CPU KV buffer size =   576.00 MiB
llama_kv_cache_unified: KV self size  =  576.00 MiB, K (f16):  288.00 MiB, V (f16):  288.00 MiB
llama_context:        CPU compute buffer size =   304.75 MiB
llama_context: graph nodes  = 1374
llama_context: graph splits = 1
time=2025-06-05T17:24:31.393+02:00 level=INFO source=server.go:630 msg="llama runner started in 1.25 seconds"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.9.0

Originally created by @buekera on GitHub (Jun 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10984 ### What is the issue? Hi there! I just started using ollama today and noticed a strange behavior. Despite ollama telling me (ollama ps) that the processor is 100% GPU, my GPU isn't doing anything and no load is offloaded to it. ``` ollama ps NAME ID SIZE PROCESSOR UNTIL deepseek-r1:8b-0528-qwen3-fp16 a5f168330005 18 GB 100% GPU 4 minutes from now ``` ``` nvidia-smi Thu Jun 5 17:21:57 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.153.02 Driver Version: 570.153.02 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5090 Off | 00000000:09:00.0 On | N/A | | 30% 37C P0 44W / 575W | 2881MiB / 32607MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1052 G /usr/lib/Xorg 81MiB | | 0 N/A N/A 1192 G /usr/bin/kwin_wayland 108MiB | | 0 N/A N/A 1288 G /usr/bin/Xwayland 12MiB | | 0 N/A N/A 1356 C+G /usr/bin/plasmashell 880MiB | | 0 N/A N/A 1610 G /usr/lib/firefox/firefox 674MiB | | 0 N/A N/A 1660 C+G /usr/bin/plasma-systemmonitor 82MiB | | 0 N/A N/A 1880 G ...ing --variations-seed-version 129MiB | | 0 N/A N/A 1883 G ...per --variations-seed-version 445MiB | +-----------------------------------------------------------------------------------------+ ``` I can see my CPU spiking and picking up all the load immediately. Maybe I'm missing any fundamentals here, but ollama seems to detect my GPU just fine. From the ollama logs: ``` compute" id=GPU-b9efdd27-2bc6-f808-9156-b174c62c9ad9 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="28.4 GiB" ``` Some information about my linux installation: - endavouros on kernel 6.14.9 - nvidia driver version 570.153.02-3 I hope I'm just being stupid but nothing in the logs caught my eye. Thanks! ### Relevant log output ```shell Output of ollama: time=2025-06-05T17:24:21.775+02:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:26843545600 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/abuka/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-06-05T17:24:21.776+02:00 level=INFO source=images.go:479 msg="total blobs: 17" time=2025-06-05T17:24:21.776+02:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-06-05T17:24:21.776+02:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11434 (version 0.9.0)" time=2025-06-05T17:24:21.776+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-05T17:24:22.012+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b9efdd27-2bc6-f808-9156-b174c62c9ad9 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="28.4 GiB" [GIN] 2025/06/05 - 17:24:29 | 200 | 30.911µs | 127.0.0.1 | HEAD "/" [GIN] 2025/06/05 - 17:24:29 | 200 | 23.053283ms | 127.0.0.1 | POST "/api/show" time=2025-06-05T17:24:30.006+02:00 level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="55.7 GiB" free_swap="64.0 GiB" time=2025-06-05T17:24:30.006+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=5 layers.split="" memory.available="[28.4 GiB]" memory.gpu_overhead="25.0 GiB" memory.required.full="15.9 GiB" memory.required.partial="3.1 GiB" memory.required.kv="576.0 MiB" memory.required.allocations="[3.1 GiB]" memory.weights.total="14.1 GiB" memory.weights.repeating="12.9 GiB" memory.weights.nonrepeating="1.2 GiB" memory.graph.full="384.0 MiB" memory.graph.partial="384.0 MiB" llama_model_loader: loaded meta data with 32 key-value pairs and 399 tensors from /home/abuka/.ollama/models/blobs/sha256-e3c5da368c781ba65788b7e5a888cb53e77e2c37b69059bd6316e7eb101164a9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 0528 Qwen3 8B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-0528-Qwen3 llama_model_loader: - kv 4: general.size_label str = 8B llama_model_loader: - kv 5: general.license str = mit llama_model_loader: - kv 6: qwen3.block_count u32 = 36 llama_model_loader: - kv 7: qwen3.context_length u32 = 131072 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 4096 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 12288 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: general.file_type u32 = 1 llama_model_loader: - kv 17: qwen3.rope.scaling.type str = yarn llama_model_loader: - kv 18: qwen3.rope.scaling.factor f32 = 4.000000 llama_model_loader: - kv 19: qwen3.rope.scaling.original_context_length u32 = 32768 llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 31: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - type f32: 145 tensors llama_model_loader: - type f16: 254 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 15.26 GiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 28 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.19 B print_info: general.name = DeepSeek R1 0528 Qwen3 8B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|begin▁of▁sentence|>' print_info: EOS token = 151645 '<|end▁of▁sentence|>' print_info: EOT token = 151645 '<|end▁of▁sentence|>' print_info: PAD token = 151645 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151645 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-06-05T17:24:30.138+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/abuka/.ollama/models/blobs/sha256-e3c5da368c781ba65788b7e5a888cb53e77e2c37b69059bd6316e7eb101164a9 --ctx-size 4096 --batch-size 512 --n-gpu-layers 5 --threads 12 --parallel 1 --port 45743" time=2025-06-05T17:24:30.138+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-05T17:24:30.138+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-05T17:24:30.139+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-06-05T17:24:30.146+02:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-06-05T17:24:30.149+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-06-05T17:24:30.150+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:45743" llama_model_loader: loaded meta data with 32 key-value pairs and 399 tensors from /home/abuka/.ollama/models/blobs/sha256-e3c5da368c781ba65788b7e5a888cb53e77e2c37b69059bd6316e7eb101164a9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 0528 Qwen3 8B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-0528-Qwen3 llama_model_loader: - kv 4: general.size_label str = 8B llama_model_loader: - kv 5: general.license str = mit llama_model_loader: - kv 6: qwen3.block_count u32 = 36 llama_model_loader: - kv 7: qwen3.context_length u32 = 131072 llama_model_loader: - kv 8: qwen3.embedding_length u32 = 4096 llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 12288 llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 32 llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 16: general.file_type u32 = 1 llama_model_loader: - kv 17: qwen3.rope.scaling.type str = yarn llama_model_loader: - kv 18: qwen3.rope.scaling.factor f32 = 4.000000 llama_model_loader: - kv 19: qwen3.rope.scaling.original_context_length u32 = 32768 llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 31: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - type f32: 145 tensors llama_model_loader: - type f16: 254 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 15.26 GiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 28 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 36 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 12288 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.25 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.19 B print_info: general.name = DeepSeek R1 0528 Qwen3 8B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|begin▁of▁sentence|>' print_info: EOS token = 151645 '<|end▁of▁sentence|>' print_info: EOT token = 151645 '<|end▁of▁sentence|>' print_info: PAD token = 151645 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151645 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) time=2025-06-05T17:24:30.389+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 15623.18 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.25 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.60 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 576.00 MiB llama_kv_cache_unified: KV self size = 576.00 MiB, K (f16): 288.00 MiB, V (f16): 288.00 MiB llama_context: CPU compute buffer size = 304.75 MiB llama_context: graph nodes = 1374 llama_context: graph splits = 1 time=2025-06-05T17:24:31.393+02:00 level=INFO source=server.go:630 msg="llama runner started in 1.25 seconds" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.9.0
GiteaMirror added the bug label 2026-04-29 04:40:43 -05:00
Author
Owner

@buekera commented on GitHub (Jun 5, 2025):

I forgot to mention that I tried different llm models with different sizes in case the model was too big for my GPU. Even models small as qwen3:1.7b were not offloaded to my GPU.

<!-- gh-comment-id:2945005726 --> @buekera commented on GitHub (Jun 5, 2025): I forgot to mention that I tried different llm models with different sizes in case the model was too big for my GPU. Even models small as qwen3:1.7b were not offloaded to my GPU.
Author
Owner

@rick-github commented on GitHub (Jun 5, 2025):

time=2025-06-05T17:24:30.146+02:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-06-05T17:24:30.149+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

No GPU enabled backends found. How did you install ollama?

<!-- gh-comment-id:2945029877 --> @rick-github commented on GitHub (Jun 5, 2025): ``` time=2025-06-05T17:24:30.146+02:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-06-05T17:24:30.149+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ``` No GPU enabled backends found. How did you install ollama?
Author
Owner

@buekera commented on GitHub (Jun 5, 2025):

I installed it via the official arch package manager (pacman).
When this is the issue I will shoot myself :D

<!-- gh-comment-id:2945035199 --> @buekera commented on GitHub (Jun 5, 2025): I installed it via the official arch package manager (pacman). When this is the issue I will shoot myself :D
Author
Owner

@rick-github commented on GitHub (Jun 5, 2025):

WIth Arch you also need to install the GPU backend, ollama-cuda in this case.

<!-- gh-comment-id:2945042591 --> @rick-github commented on GitHub (Jun 5, 2025): WIth Arch you also need to install the GPU backend, ollama-cuda in this case.
Author
Owner

@buekera commented on GitHub (Jun 5, 2025):

Oh no, you are so right. I just did and it is working flawlessly.
I don't know why I didn't see that earlier myself. I swear I'm not drinking!

Thanks so much!

<!-- gh-comment-id:2945069148 --> @buekera commented on GitHub (Jun 5, 2025): Oh no, you are so right. I just did and it is working flawlessly. I don't know why I didn't see that earlier myself. I swear I'm not drinking! Thanks so much!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53754