[GH-ISSUE #10833] Inference running on CPU instead of GPU #69173

Closed
opened 2026-05-04 17:23:41 -05:00 by GiteaMirror · 29 comments
Owner

Originally created by @mordesku on GitHub (May 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10833

What is the issue?

As in the title, ollama doesn't run on GPU, despite its detection.

Relevant log output

time=2025-05-23T16:23:12.941+02:00 level=INFO source=routes.go:1205 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\mordesku\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-05-23T16:23:12.943+02:00 level=INFO source=images.go:463 msg="total blobs: 5"
time=2025-05-23T16:23:12.944+02:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-05-23T16:23:12.944+02:00 level=INFO source=routes.go:1258 msg="Listening on [::]:11434 (version 0.7.0)"
time=2025-05-23T16:23:12.944+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-23T16:23:12.944+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-05-23T16:23:12.944+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=16 efficiency=0 threads=32
time=2025-05-23T16:23:13.068+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-abdb4182-3306-85e9-d83c-56d59982821f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
[GIN] 2025/05/23 - 16:24:18 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/05/23 - 16:24:18 | 200 |     62.7206ms |       127.0.0.1 | POST     "/api/show"
time=2025-05-23T16:24:19.051+02:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\mordesku\.ollama\models\blobs\sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=GPU-abdb4182-3306-85e9-d83c-56d59982821f parallel=2 available=23405342720 required="6.0 GiB"
time=2025-05-23T16:24:19.069+02:00 level=INFO source=server.go:135 msg="system memory" total="79.9 GiB" free="65.8 GiB" free_swap="68.4 GiB"
time=2025-05-23T16:24:19.069+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[21.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.0 GiB" memory.required.partial="6.0 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-05-23T16:24:19.135+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\mordesku\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\mordesku\\.ollama\\models\\blobs\\sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 35 --threads 16 --no-mmap --parallel 2 --port 53086"
time=2025-05-23T16:24:19.138+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T16:24:19.138+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T16:24:19.138+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
time=2025-05-23T16:24:19.167+02:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T16:24:19.168+02:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:53086"
time=2025-05-23T16:24:19.217+02:00 level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
time=2025-05-23T16:24:19.233+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(clang)
time=2025-05-23T16:24:19.236+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="3.6 GiB"
time=2025-05-23T16:24:19.391+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-23T16:24:19.814+02:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="137.0 MiB"
time=2025-05-23T16:24:19.891+02:00 level=INFO source=server.go:630 msg="llama runner started in 0.75 seconds"
[GIN] 2025/05/23 - 16:24:19 | 200 |    949.7807ms |       127.0.0.1 | POST     "/api/generate"

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.7.0

Originally created by @mordesku on GitHub (May 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10833 ### What is the issue? As in the title, ollama doesn't run on GPU, despite its detection. ### Relevant log output ```shell time=2025-05-23T16:23:12.941+02:00 level=INFO source=routes.go:1205 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\mordesku\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-05-23T16:23:12.943+02:00 level=INFO source=images.go:463 msg="total blobs: 5" time=2025-05-23T16:23:12.944+02:00 level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-05-23T16:23:12.944+02:00 level=INFO source=routes.go:1258 msg="Listening on [::]:11434 (version 0.7.0)" time=2025-05-23T16:23:12.944+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-23T16:23:12.944+02:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-05-23T16:23:12.944+02:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=16 efficiency=0 threads=32 time=2025-05-23T16:23:13.068+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-abdb4182-3306-85e9-d83c-56d59982821f library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" [GIN] 2025/05/23 - 16:24:18 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/05/23 - 16:24:18 | 200 | 62.7206ms | 127.0.0.1 | POST "/api/show" time=2025-05-23T16:24:19.051+02:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\mordesku\.ollama\models\blobs\sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=GPU-abdb4182-3306-85e9-d83c-56d59982821f parallel=2 available=23405342720 required="6.0 GiB" time=2025-05-23T16:24:19.069+02:00 level=INFO source=server.go:135 msg="system memory" total="79.9 GiB" free="65.8 GiB" free_swap="68.4 GiB" time=2025-05-23T16:24:19.069+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[21.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.0 GiB" memory.required.partial="6.0 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[6.0 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="517.0 MiB" memory.graph.partial="1.0 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" time=2025-05-23T16:24:19.135+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="C:\\Users\\mordesku\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\mordesku\\.ollama\\models\\blobs\\sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 8192 --batch-size 512 --n-gpu-layers 35 --threads 16 --no-mmap --parallel 2 --port 53086" time=2025-05-23T16:24:19.138+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T16:24:19.138+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T16:24:19.138+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" time=2025-05-23T16:24:19.167+02:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T16:24:19.168+02:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:53086" time=2025-05-23T16:24:19.217+02:00 level=INFO source=ggml.go:73 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36 time=2025-05-23T16:24:19.233+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(clang) time=2025-05-23T16:24:19.236+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="3.6 GiB" time=2025-05-23T16:24:19.391+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-23T16:24:19.814+02:00 level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="137.0 MiB" time=2025-05-23T16:24:19.891+02:00 level=INFO source=server.go:630 msg="llama runner started in 0.75 seconds" [GIN] 2025/05/23 - 16:24:19 | 200 | 949.7807ms | 127.0.0.1 | POST "/api/generate" ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.7.0
GiteaMirror added the bug label 2026-05-04 17:23:41 -05:00
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

time=2025-05-23T16:24:19.233+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(clang)

No CPU or GPU backends loaded. How did you install ollama?

<!-- gh-comment-id:2904626187 --> @rick-github commented on GitHub (May 23, 2025): ``` time=2025-05-23T16:24:19.233+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(clang) ``` No CPU or GPU backends loaded. How did you install ollama?
Author
Owner

@mordesku commented on GitHub (May 23, 2025):

I'm using the regular installer. However, I was previously testing ollama on ipex when I was playing around with Intel B580. I deleted a bunch of links related to ollama from c: windows and my home directory.

<!-- gh-comment-id:2904641854 --> @mordesku commented on GitHub (May 23, 2025): I'm using the regular installer. However, I was previously testing ollama on ipex when I was playing around with Intel B580. I deleted a bunch of links related to ollama from c: windows and my home directory.
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

Re-install ollama.

<!-- gh-comment-id:2904687305 --> @rick-github commented on GitHub (May 23, 2025): Re-install ollama.
Author
Owner

@mordesku commented on GitHub (May 23, 2025):

Did that at least 3 times—the same result. Is there any information on how to manually clean up everything?

<!-- gh-comment-id:2904693576 --> @mordesku commented on GitHub (May 23, 2025): Did that at least 3 times—the same result. Is there any information on how to manually clean up everything?
Author
Owner

@sempervictus commented on GitHub (May 23, 2025):

Seeing the same behavior loading command-a on a host w/ 4x32G SXM cards. At least in cases where the layers need to be distributed between GPUs, it appears to select contiguous CPU memory instead of (300G bus) inter-GPU access.

<!-- gh-comment-id:2904716122 --> @sempervictus commented on GitHub (May 23, 2025): Seeing the same behavior loading `command-a` on a host w/ 4x32G SXM cards. At least in cases where the layers need to be distributed between GPUs, it appears to select contiguous CPU memory instead of (300G bus) inter-GPU access.
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

I'm not a Windows user but recursively deleting everything under C:\Users\mordesku\AppData\Local\Programs\Ollama should do it. Note this will also delete any downloaded models. You may also have to check your environment variables (particularly PATH) and remove any references to ollama and ipex.

<!-- gh-comment-id:2904727693 --> @rick-github commented on GitHub (May 23, 2025): I'm not a Windows user but recursively deleting everything under C:\Users\mordesku\AppData\Local\Programs\Ollama should do it. Note this will also delete any downloaded models. You may also have to check your environment variables (particularly PATH) and remove any references to ollama and ipex.
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

@sempervictus If it's using GPU at any point, including other models, it's not the same problem. In which case, open a new issue.

<!-- gh-comment-id:2904734261 --> @rick-github commented on GitHub (May 23, 2025): @sempervictus If it's using GPU at any point, including other models, it's not the same problem. In which case, open a new issue.
Author
Owner

@abes200 commented on GitHub (May 24, 2025):

I had this issue on 0.7.0. The issue was closed because they thought they had fixed it with 0.7.1. However I am still getting this issue on 0.7.1. It does not use the GPU at all, whether I load the model into GPU memory or not, resulting in an exceptionally long time to get a response from some models. Particularly Gemma3 for me.
To fix this for me I just re-installed 0.6.8 and everything works fine again.

<!-- gh-comment-id:2906187831 --> @abes200 commented on GitHub (May 24, 2025): I had this issue on 0.7.0. The issue was closed because they thought they had fixed it with 0.7.1. However I am still getting this issue on 0.7.1. It does not use the GPU at all, whether I load the model into GPU memory or not, resulting in an exceptionally long time to get a response from some models. Particularly Gemma3 for me. To fix this for me I just re-installed 0.6.8 and everything works fine again.
Author
Owner

@rick-github commented on GitHub (May 24, 2025):

Logs.

<!-- gh-comment-id:2906218463 --> @rick-github commented on GitHub (May 24, 2025): Logs.
Author
Owner

@abes200 commented on GitHub (May 24, 2025):

Logs.

How do I stop Ollama from putting my PC user name and other identifiable details into the logs? It includes the full path of the ollama model file, which is OLLAMA_MODELS:C:\\Users\\[MYUSERNAME]\\.ollama\\models

<!-- gh-comment-id:2906261051 --> @abes200 commented on GitHub (May 24, 2025): > Logs. How do I stop Ollama from putting my PC user name and other identifiable details into the logs? It includes the full path of the ollama model file, which is OLLAMA_MODELS:C:\\\Users\\\\[MYUSERNAME]\\\\.ollama\\\models
Author
Owner

@stubkan commented on GitHub (May 24, 2025):

Have the same issue. Posting here, as this log may be relevant. Got a 4gb model imported (qwen3-4b) and see it taking up GPU vram space... But... inferencing it just uses the CPU. I see all the chip usages going up and down in system monitor. Not good.

Checking ps shows this;

ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen3-4b:latest fe7c4d51aadb 13 GB 59%/41% CPU/GPU

I'm not sure how a 4gb model is taking up 13gb of memory - but this may be why it's slow and using the CPU ?

ollama list
NAME ID SIZE MODIFIED
qwen3-4b:latest fe7c4d51aadb 4.3 GB 16 minutes ago

Is this normal behaviour?

Logs:

May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  27:                          general.file_type u32              = 7
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-4B-GGUF/imatrix_unsloth.dat
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-4B.txt
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  30:             quantize.imatrix.entries_count i32              = 252
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  31:              quantize.imatrix.chunks_count i32              = 685
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type  f32:  145 tensors
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type q8_0:  253 tensors
May 24 22:58:22 pleiades ollama[1391]: print_info: file format = GGUF V3 (latest)
May 24 22:58:22 pleiades ollama[1391]: print_info: file type   = Q8_0
May 24 22:58:22 pleiades ollama[1391]: print_info: file size   = 3.98 GiB (8.50 BPW)
May 24 22:58:22 pleiades ollama[1391]: load: special tokens cache size = 26
May 24 22:58:22 pleiades ollama[1391]: load: token to piece cache size = 0.9311 MB
May 24 22:58:22 pleiades ollama[1391]: print_info: arch             = qwen3
May 24 22:58:22 pleiades ollama[1391]: print_info: vocab_only       = 1
May 24 22:58:22 pleiades ollama[1391]: print_info: model type       = ?B
May 24 22:58:22 pleiades ollama[1391]: print_info: model params     = 4.02 B
May 24 22:58:22 pleiades ollama[1391]: print_info: general.name     = Qwen3-4B
May 24 22:58:22 pleiades ollama[1391]: print_info: vocab type       = BPE
May 24 22:58:22 pleiades ollama[1391]: print_info: n_vocab          = 151936
May 24 22:58:22 pleiades ollama[1391]: print_info: n_merges         = 151387
May 24 22:58:22 pleiades ollama[1391]: print_info: BOS token        = 11 ','
May 24 22:58:22 pleiades ollama[1391]: print_info: EOS token        = 151645 '<|im_end|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: EOT token        = 151645 '<|im_end|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: PAD token        = 151654 '<|vision_pad|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: LF token         = 198 'Ċ'
May 24 22:58:22 pleiades ollama[1391]: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: FIM MID token    = 151660 '<|fim_middle|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: FIM REP token    = 151663 '<|repo_name|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: FIM SEP token    = 151664 '<|file_sep|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token        = 151643 '<|endoftext|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token        = 151645 '<|im_end|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token        = 151662 '<|fim_pad|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token        = 151663 '<|repo_name|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token        = 151664 '<|file_sep|>'
May 24 22:58:22 pleiades ollama[1391]: print_info: max token length = 256
May 24 22:58:22 pleiades ollama[1391]: llama_model_load: vocab only - skipping tensors
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-eed555233267a33c7e8ee31682762cc7751b3f6d224039086e0e846f05fffa5d --ctx-size 32768 --batch-size 512 --n-gpu-layers 5 --threads 4 --parallel 1 --port 35167"
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.491+01:00 level=INFO source=runner.go:815 msg="starting go runner"
May 24 22:58:22 pleiades ollama[1391]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
May 24 22:58:22 pleiades ollama[1391]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 24 22:58:22 pleiades ollama[1391]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 24 22:58:22 pleiades ollama[1391]: ggml_cuda_init: found 1 CUDA devices:
May 24 22:58:22 pleiades ollama[1391]:   Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes
May 24 22:58:22 pleiades ollama[1391]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.761+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.762+01:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:35167"
May 24 22:58:22 pleiades ollama[1391]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060) - 5289 MiB free
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-eed555233267a33c7e8ee31682762cc7751b3f6d224039086e0e846f05fffa5d (version GGUF V3 (latest))
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   0:                       general.architecture str              = qwen3
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   1:                               general.type str              = model
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   2:                               general.name str              = Qwen3-4B
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   3:                           general.basename str              = Qwen3-4B
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   5:                         general.size_label str              = 4B
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   7:                          qwen3.block_count u32              = 36
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   8:                       qwen3.context_length u32              = 40960
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv   9:                     qwen3.embedding_length u32              = 2560
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  10:                  qwen3.feed_forward_length u32              = 9728
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  11:                 qwen3.attention.head_count u32              = 32
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  12:              qwen3.attention.head_count_kv u32              = 8
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  13:                       qwen3.rope.freq_base f32              = 1000000.000000
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  14:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  15:                 qwen3.attention.key_length u32              = 128
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  16:               qwen3.attention.value_length u32              = 128
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151654
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = false
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  27:                          general.file_type u32              = 7
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  28:                      quantize.imatrix.file str              = Qwen3-4B-GGUF/imatrix_unsloth.dat
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  29:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-4B.txt
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  30:             quantize.imatrix.entries_count i32              = 252
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv  31:              quantize.imatrix.chunks_count i32              = 685
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type  f32:  145 tensors
May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type q8_0:  253 tensors
May 24 22:58:22 pleiades ollama[1391]: print_info: file format = GGUF V3 (latest)
May 24 22:58:22 pleiades ollama[1391]: print_info: file type   = Q8_0
May 24 22:58:22 pleiades ollama[1391]: print_info: file size   = 3.98 GiB (8.50 BPW)
May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.986+01:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
May 24 22:58:23 pleiades ollama[1391]: load: special tokens cache size = 26
May 24 22:58:23 pleiades ollama[1391]: load: token to piece cache size = 0.9311 MB
May 24 22:58:23 pleiades ollama[1391]: print_info: arch             = qwen3
May 24 22:58:23 pleiades ollama[1391]: print_info: vocab_only       = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: n_ctx_train      = 40960
May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd           = 2560
May 24 22:58:23 pleiades ollama[1391]: print_info: n_layer          = 36
May 24 22:58:23 pleiades ollama[1391]: print_info: n_head           = 32
May 24 22:58:23 pleiades ollama[1391]: print_info: n_head_kv        = 8
May 24 22:58:23 pleiades ollama[1391]: print_info: n_rot            = 128
May 24 22:58:23 pleiades ollama[1391]: print_info: n_swa            = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: n_swa_pattern    = 1
May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_head_k    = 128
May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_head_v    = 128
May 24 22:58:23 pleiades ollama[1391]: print_info: n_gqa            = 4
May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_k_gqa     = 1024
May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_v_gqa     = 1024
May 24 22:58:23 pleiades ollama[1391]: print_info: f_norm_eps       = 0.0e+00
May 24 22:58:23 pleiades ollama[1391]: print_info: f_norm_rms_eps   = 1.0e-06
May 24 22:58:23 pleiades ollama[1391]: print_info: f_clamp_kqv      = 0.0e+00
May 24 22:58:23 pleiades ollama[1391]: print_info: f_max_alibi_bias = 0.0e+00
May 24 22:58:23 pleiades ollama[1391]: print_info: f_logit_scale    = 0.0e+00
May 24 22:58:23 pleiades ollama[1391]: print_info: f_attn_scale     = 0.0e+00
May 24 22:58:23 pleiades ollama[1391]: print_info: n_ff             = 9728
May 24 22:58:23 pleiades ollama[1391]: print_info: n_expert         = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: n_expert_used    = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: causal attn      = 1
May 24 22:58:23 pleiades ollama[1391]: print_info: pooling type     = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: rope type        = 2
May 24 22:58:23 pleiades ollama[1391]: print_info: rope scaling     = linear
May 24 22:58:23 pleiades ollama[1391]: print_info: freq_base_train  = 1000000.0
May 24 22:58:23 pleiades ollama[1391]: print_info: freq_scale_train = 1
May 24 22:58:23 pleiades ollama[1391]: print_info: n_ctx_orig_yarn  = 40960
May 24 22:58:23 pleiades ollama[1391]: print_info: rope_finetuned   = unknown
May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_d_conv       = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_d_inner      = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_d_state      = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_dt_rank      = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_dt_b_c_rms   = 0
May 24 22:58:23 pleiades ollama[1391]: print_info: model type       = 4B
May 24 22:58:23 pleiades ollama[1391]: print_info: model params     = 4.02 B
May 24 22:58:23 pleiades ollama[1391]: print_info: general.name     = Qwen3-4B
May 24 22:58:23 pleiades ollama[1391]: print_info: vocab type       = BPE
May 24 22:58:23 pleiades ollama[1391]: print_info: n_vocab          = 151936
May 24 22:58:23 pleiades ollama[1391]: print_info: n_merges         = 151387
May 24 22:58:23 pleiades ollama[1391]: print_info: BOS token        = 11 ','
May 24 22:58:23 pleiades ollama[1391]: print_info: EOS token        = 151645 '<|im_end|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: EOT token        = 151645 '<|im_end|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: PAD token        = 151654 '<|vision_pad|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: LF token         = 198 'Ċ'
May 24 22:58:23 pleiades ollama[1391]: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: FIM MID token    = 151660 '<|fim_middle|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: FIM REP token    = 151663 '<|repo_name|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: FIM SEP token    = 151664 '<|file_sep|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token        = 151643 '<|endoftext|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token        = 151645 '<|im_end|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token        = 151662 '<|fim_pad|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token        = 151663 '<|repo_name|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token        = 151664 '<|file_sep|>'
May 24 22:58:23 pleiades ollama[1391]: print_info: max token length = 256
May 24 22:58:23 pleiades ollama[1391]: load_tensors: loading model tensors, this can take a while... (mmap = true)
May 24 22:58:23 pleiades ollama[1391]: load_tensors: offloading 5 repeating layers to GPU
May 24 22:58:23 pleiades ollama[1391]: load_tensors: offloaded 5/37 layers to GPU
May 24 22:58:23 pleiades ollama[1391]: load_tensors:        CUDA0 model buffer size =   511.43 MiB
May 24 22:58:23 pleiades ollama[1391]: load_tensors:   CPU_Mapped model buffer size =  3565.00 MiB
May 24 22:58:23 pleiades ollama[1391]: llama_context: constructing llama_context
May 24 22:58:23 pleiades ollama[1391]: llama_context: n_seq_max     = 1
May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ctx         = 32768
May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ctx_per_seq = 32768
May 24 22:58:23 pleiades ollama[1391]: llama_context: n_batch       = 512
May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ubatch      = 512
May 24 22:58:23 pleiades ollama[1391]: llama_context: causal_attn   = 1
May 24 22:58:23 pleiades ollama[1391]: llama_context: flash_attn    = 0
May 24 22:58:23 pleiades ollama[1391]: llama_context: freq_base     = 1000000.0
May 24 22:58:23 pleiades ollama[1391]: llama_context: freq_scale    = 1
May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
May 24 22:58:23 pleiades ollama[1391]: llama_context:        CPU  output buffer size =     0.59 MiB
May 24 22:58:23 pleiades ollama[1391]: llama_kv_cache_unified: kv_size = 32768, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32
May 24 22:58:23 pleiades ollama[1391]: llama_kv_cache_unified:      CUDA0 KV buffer size =   640.00 MiB
May 24 22:58:24 pleiades ollama[1391]: llama_kv_cache_unified:        CPU KV buffer size =  3968.00 MiB
May 24 22:58:24 pleiades ollama[1391]: llama_kv_cache_unified: KV self size  = 4608.00 MiB, K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
May 24 22:58:24 pleiades ollama[1391]: llama_context:      CUDA0 compute buffer size =  2322.00 MiB
May 24 22:58:24 pleiades ollama[1391]: llama_context:  CUDA_Host compute buffer size =    69.01 MiB
May 24 22:58:24 pleiades ollama[1391]: llama_context: graph nodes  = 1374
May 24 22:58:24 pleiades ollama[1391]: llama_context: graph splits = 407 (with bs=512), 65 (with bs=1)
May 24 22:58:24 pleiades ollama[1391]: time=2025-05-24T22:58:24.492+01:00 level=INFO source=server.go:630 msg="llama runner started in 2.01 seconds"
May 24 22:58:32 pleiades ollama[1391]: [GIN] 2025/05/24 - 22:58:32 | 200 | 10.480719552s |       127.0.0.1 | POST     "/api/chat"
May 24 23:01:55 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:01:55 | 200 |          1m8s |       127.0.0.1 | POST     "/api/chat"
May 24 23:05:23 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:23 | 200 |      15.649µs |       127.0.0.1 | HEAD     "/"
May 24 23:05:23 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:23 | 200 |   25.963275ms |       127.0.0.1 | POST     "/api/show"
May 24 23:05:23 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:23 | 200 |   14.118912ms |       127.0.0.1 | POST     "/api/generate"
May 24 23:05:26 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:26 | 200 |   67.204728ms |       127.0.0.1 | POST     "/api/show"
May 24 23:05:44 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:44 | 200 |  7.356454887s |       127.0.0.1 | POST     "/api/chat"
May 24 23:05:57 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:57 | 200 |   31.040694ms |       127.0.0.1 | POST     "/api/show"
May 24 23:08:47 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:08:47 | 200 |      36.266µs |       127.0.0.1 | GET      "/api/version"
May 24 23:08:54 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:08:54 | 200 |       22.67µs |       127.0.0.1 | HEAD     "/"
May 24 23:08:54 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:08:54 | 200 |    1.371102ms |       127.0.0.1 | GET      "/api/ps"
May 24 23:10:29 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:10:29 | 200 |      55.611µs |       127.0.0.1 | HEAD     "/"
May 24 23:10:29 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:10:29 | 200 |    1.432656ms |       127.0.0.1 | GET      "/api/tags"


<!-- gh-comment-id:2907206856 --> @stubkan commented on GitHub (May 24, 2025): Have the same issue. Posting here, as this log may be relevant. Got a 4gb model imported (qwen3-4b) and see it taking up GPU vram space... But... inferencing it just uses the CPU. I see all the chip usages going up and down in system monitor. Not good. Checking ps shows this; ollama ps NAME ID SIZE PROCESSOR UNTIL qwen3-4b:latest fe7c4d51aadb 13 GB 59%/41% CPU/GPU I'm not sure how a 4gb model is taking up 13gb of memory - but this may be why it's slow and using the CPU ? ollama list NAME ID SIZE MODIFIED qwen3-4b:latest fe7c4d51aadb 4.3 GB 16 minutes ago Is this normal behaviour? Logs: <details> ``` May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 26: general.quantization_version u32 = 2 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 27: general.file_type u32 = 7 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-4B-GGUF/imatrix_unsloth.dat May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-4B.txt May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 252 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type f32: 145 tensors May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type q8_0: 253 tensors May 24 22:58:22 pleiades ollama[1391]: print_info: file format = GGUF V3 (latest) May 24 22:58:22 pleiades ollama[1391]: print_info: file type = Q8_0 May 24 22:58:22 pleiades ollama[1391]: print_info: file size = 3.98 GiB (8.50 BPW) May 24 22:58:22 pleiades ollama[1391]: load: special tokens cache size = 26 May 24 22:58:22 pleiades ollama[1391]: load: token to piece cache size = 0.9311 MB May 24 22:58:22 pleiades ollama[1391]: print_info: arch = qwen3 May 24 22:58:22 pleiades ollama[1391]: print_info: vocab_only = 1 May 24 22:58:22 pleiades ollama[1391]: print_info: model type = ?B May 24 22:58:22 pleiades ollama[1391]: print_info: model params = 4.02 B May 24 22:58:22 pleiades ollama[1391]: print_info: general.name = Qwen3-4B May 24 22:58:22 pleiades ollama[1391]: print_info: vocab type = BPE May 24 22:58:22 pleiades ollama[1391]: print_info: n_vocab = 151936 May 24 22:58:22 pleiades ollama[1391]: print_info: n_merges = 151387 May 24 22:58:22 pleiades ollama[1391]: print_info: BOS token = 11 ',' May 24 22:58:22 pleiades ollama[1391]: print_info: EOS token = 151645 '<|im_end|>' May 24 22:58:22 pleiades ollama[1391]: print_info: EOT token = 151645 '<|im_end|>' May 24 22:58:22 pleiades ollama[1391]: print_info: PAD token = 151654 '<|vision_pad|>' May 24 22:58:22 pleiades ollama[1391]: print_info: LF token = 198 'Ċ' May 24 22:58:22 pleiades ollama[1391]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 24 22:58:22 pleiades ollama[1391]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 24 22:58:22 pleiades ollama[1391]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 24 22:58:22 pleiades ollama[1391]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 24 22:58:22 pleiades ollama[1391]: print_info: FIM REP token = 151663 '<|repo_name|>' May 24 22:58:22 pleiades ollama[1391]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token = 151643 '<|endoftext|>' May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token = 151645 '<|im_end|>' May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token = 151662 '<|fim_pad|>' May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token = 151663 '<|repo_name|>' May 24 22:58:22 pleiades ollama[1391]: print_info: EOG token = 151664 '<|file_sep|>' May 24 22:58:22 pleiades ollama[1391]: print_info: max token length = 256 May 24 22:58:22 pleiades ollama[1391]: llama_model_load: vocab only - skipping tensors May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-eed555233267a33c7e8ee31682762cc7751b3f6d224039086e0e846f05fffa5d --ctx-size 32768 --batch-size 512 --n-gpu-layers 5 --threads 4 --parallel 1 --port 35167" May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.484+01:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.491+01:00 level=INFO source=runner.go:815 msg="starting go runner" May 24 22:58:22 pleiades ollama[1391]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so May 24 22:58:22 pleiades ollama[1391]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 24 22:58:22 pleiades ollama[1391]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 24 22:58:22 pleiades ollama[1391]: ggml_cuda_init: found 1 CUDA devices: May 24 22:58:22 pleiades ollama[1391]: Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes May 24 22:58:22 pleiades ollama[1391]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.761+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.762+01:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:35167" May 24 22:58:22 pleiades ollama[1391]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060) - 5289 MiB free May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: loaded meta data with 32 key-value pairs and 398 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-eed555233267a33c7e8ee31682762cc7751b3f6d224039086e0e846f05fffa5d (version GGUF V3 (latest)) May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 0: general.architecture str = qwen3 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 1: general.type str = model May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 2: general.name str = Qwen3-4B May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 3: general.basename str = Qwen3-4B May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 4: general.quantized_by str = Unsloth May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 5: general.size_label str = 4B May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 7: qwen3.block_count u32 = 36 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 8: qwen3.context_length u32 = 40960 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 9: qwen3.embedding_length u32 = 2560 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 9728 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 32 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 26: general.quantization_version u32 = 2 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 27: general.file_type u32 = 7 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-4B-GGUF/imatrix_unsloth.dat May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-4B.txt May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 252 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 685 May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type f32: 145 tensors May 24 22:58:22 pleiades ollama[1391]: llama_model_loader: - type q8_0: 253 tensors May 24 22:58:22 pleiades ollama[1391]: print_info: file format = GGUF V3 (latest) May 24 22:58:22 pleiades ollama[1391]: print_info: file type = Q8_0 May 24 22:58:22 pleiades ollama[1391]: print_info: file size = 3.98 GiB (8.50 BPW) May 24 22:58:22 pleiades ollama[1391]: time=2025-05-24T22:58:22.986+01:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" May 24 22:58:23 pleiades ollama[1391]: load: special tokens cache size = 26 May 24 22:58:23 pleiades ollama[1391]: load: token to piece cache size = 0.9311 MB May 24 22:58:23 pleiades ollama[1391]: print_info: arch = qwen3 May 24 22:58:23 pleiades ollama[1391]: print_info: vocab_only = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: n_ctx_train = 40960 May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd = 2560 May 24 22:58:23 pleiades ollama[1391]: print_info: n_layer = 36 May 24 22:58:23 pleiades ollama[1391]: print_info: n_head = 32 May 24 22:58:23 pleiades ollama[1391]: print_info: n_head_kv = 8 May 24 22:58:23 pleiades ollama[1391]: print_info: n_rot = 128 May 24 22:58:23 pleiades ollama[1391]: print_info: n_swa = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: n_swa_pattern = 1 May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_head_k = 128 May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_head_v = 128 May 24 22:58:23 pleiades ollama[1391]: print_info: n_gqa = 4 May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_k_gqa = 1024 May 24 22:58:23 pleiades ollama[1391]: print_info: n_embd_v_gqa = 1024 May 24 22:58:23 pleiades ollama[1391]: print_info: f_norm_eps = 0.0e+00 May 24 22:58:23 pleiades ollama[1391]: print_info: f_norm_rms_eps = 1.0e-06 May 24 22:58:23 pleiades ollama[1391]: print_info: f_clamp_kqv = 0.0e+00 May 24 22:58:23 pleiades ollama[1391]: print_info: f_max_alibi_bias = 0.0e+00 May 24 22:58:23 pleiades ollama[1391]: print_info: f_logit_scale = 0.0e+00 May 24 22:58:23 pleiades ollama[1391]: print_info: f_attn_scale = 0.0e+00 May 24 22:58:23 pleiades ollama[1391]: print_info: n_ff = 9728 May 24 22:58:23 pleiades ollama[1391]: print_info: n_expert = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: n_expert_used = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: causal attn = 1 May 24 22:58:23 pleiades ollama[1391]: print_info: pooling type = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: rope type = 2 May 24 22:58:23 pleiades ollama[1391]: print_info: rope scaling = linear May 24 22:58:23 pleiades ollama[1391]: print_info: freq_base_train = 1000000.0 May 24 22:58:23 pleiades ollama[1391]: print_info: freq_scale_train = 1 May 24 22:58:23 pleiades ollama[1391]: print_info: n_ctx_orig_yarn = 40960 May 24 22:58:23 pleiades ollama[1391]: print_info: rope_finetuned = unknown May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_d_conv = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_d_inner = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_d_state = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_dt_rank = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: ssm_dt_b_c_rms = 0 May 24 22:58:23 pleiades ollama[1391]: print_info: model type = 4B May 24 22:58:23 pleiades ollama[1391]: print_info: model params = 4.02 B May 24 22:58:23 pleiades ollama[1391]: print_info: general.name = Qwen3-4B May 24 22:58:23 pleiades ollama[1391]: print_info: vocab type = BPE May 24 22:58:23 pleiades ollama[1391]: print_info: n_vocab = 151936 May 24 22:58:23 pleiades ollama[1391]: print_info: n_merges = 151387 May 24 22:58:23 pleiades ollama[1391]: print_info: BOS token = 11 ',' May 24 22:58:23 pleiades ollama[1391]: print_info: EOS token = 151645 '<|im_end|>' May 24 22:58:23 pleiades ollama[1391]: print_info: EOT token = 151645 '<|im_end|>' May 24 22:58:23 pleiades ollama[1391]: print_info: PAD token = 151654 '<|vision_pad|>' May 24 22:58:23 pleiades ollama[1391]: print_info: LF token = 198 'Ċ' May 24 22:58:23 pleiades ollama[1391]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 24 22:58:23 pleiades ollama[1391]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 24 22:58:23 pleiades ollama[1391]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 24 22:58:23 pleiades ollama[1391]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 24 22:58:23 pleiades ollama[1391]: print_info: FIM REP token = 151663 '<|repo_name|>' May 24 22:58:23 pleiades ollama[1391]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token = 151643 '<|endoftext|>' May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token = 151645 '<|im_end|>' May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token = 151662 '<|fim_pad|>' May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token = 151663 '<|repo_name|>' May 24 22:58:23 pleiades ollama[1391]: print_info: EOG token = 151664 '<|file_sep|>' May 24 22:58:23 pleiades ollama[1391]: print_info: max token length = 256 May 24 22:58:23 pleiades ollama[1391]: load_tensors: loading model tensors, this can take a while... (mmap = true) May 24 22:58:23 pleiades ollama[1391]: load_tensors: offloading 5 repeating layers to GPU May 24 22:58:23 pleiades ollama[1391]: load_tensors: offloaded 5/37 layers to GPU May 24 22:58:23 pleiades ollama[1391]: load_tensors: CUDA0 model buffer size = 511.43 MiB May 24 22:58:23 pleiades ollama[1391]: load_tensors: CPU_Mapped model buffer size = 3565.00 MiB May 24 22:58:23 pleiades ollama[1391]: llama_context: constructing llama_context May 24 22:58:23 pleiades ollama[1391]: llama_context: n_seq_max = 1 May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ctx = 32768 May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ctx_per_seq = 32768 May 24 22:58:23 pleiades ollama[1391]: llama_context: n_batch = 512 May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ubatch = 512 May 24 22:58:23 pleiades ollama[1391]: llama_context: causal_attn = 1 May 24 22:58:23 pleiades ollama[1391]: llama_context: flash_attn = 0 May 24 22:58:23 pleiades ollama[1391]: llama_context: freq_base = 1000000.0 May 24 22:58:23 pleiades ollama[1391]: llama_context: freq_scale = 1 May 24 22:58:23 pleiades ollama[1391]: llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized May 24 22:58:23 pleiades ollama[1391]: llama_context: CPU output buffer size = 0.59 MiB May 24 22:58:23 pleiades ollama[1391]: llama_kv_cache_unified: kv_size = 32768, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32 May 24 22:58:23 pleiades ollama[1391]: llama_kv_cache_unified: CUDA0 KV buffer size = 640.00 MiB May 24 22:58:24 pleiades ollama[1391]: llama_kv_cache_unified: CPU KV buffer size = 3968.00 MiB May 24 22:58:24 pleiades ollama[1391]: llama_kv_cache_unified: KV self size = 4608.00 MiB, K (f16): 2304.00 MiB, V (f16): 2304.00 MiB May 24 22:58:24 pleiades ollama[1391]: llama_context: CUDA0 compute buffer size = 2322.00 MiB May 24 22:58:24 pleiades ollama[1391]: llama_context: CUDA_Host compute buffer size = 69.01 MiB May 24 22:58:24 pleiades ollama[1391]: llama_context: graph nodes = 1374 May 24 22:58:24 pleiades ollama[1391]: llama_context: graph splits = 407 (with bs=512), 65 (with bs=1) May 24 22:58:24 pleiades ollama[1391]: time=2025-05-24T22:58:24.492+01:00 level=INFO source=server.go:630 msg="llama runner started in 2.01 seconds" May 24 22:58:32 pleiades ollama[1391]: [GIN] 2025/05/24 - 22:58:32 | 200 | 10.480719552s | 127.0.0.1 | POST "/api/chat" May 24 23:01:55 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:01:55 | 200 | 1m8s | 127.0.0.1 | POST "/api/chat" May 24 23:05:23 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:23 | 200 | 15.649µs | 127.0.0.1 | HEAD "/" May 24 23:05:23 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:23 | 200 | 25.963275ms | 127.0.0.1 | POST "/api/show" May 24 23:05:23 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:23 | 200 | 14.118912ms | 127.0.0.1 | POST "/api/generate" May 24 23:05:26 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:26 | 200 | 67.204728ms | 127.0.0.1 | POST "/api/show" May 24 23:05:44 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:44 | 200 | 7.356454887s | 127.0.0.1 | POST "/api/chat" May 24 23:05:57 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:05:57 | 200 | 31.040694ms | 127.0.0.1 | POST "/api/show" May 24 23:08:47 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:08:47 | 200 | 36.266µs | 127.0.0.1 | GET "/api/version" May 24 23:08:54 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:08:54 | 200 | 22.67µs | 127.0.0.1 | HEAD "/" May 24 23:08:54 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:08:54 | 200 | 1.371102ms | 127.0.0.1 | GET "/api/ps" May 24 23:10:29 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:10:29 | 200 | 55.611µs | 127.0.0.1 | HEAD "/" May 24 23:10:29 pleiades ollama[1391]: [GIN] 2025/05/24 - 23:10:29 | 200 | 1.432656ms | 127.0.0.1 | GET "/api/tags" ``` </details>
Author
Owner

@rick-github commented on GitHub (May 24, 2025):

It's taking up a lot of VRAM because you have a context of 32768 tokens. In the bit of the log that you didn't include it will show the memory estimation, but the upshot is that ollama can only load 5 of the 37 layers of the model into VRAM. This means that 32 layers are loaded into system RAM where the CPU does the inference. Because the CPU is much slower than the GPU at doing the matrix operations required for inference, most of the time is spent waiting for the CPU to finish its calculations. This shows as high utilization for the CPU and low utilization for the GPU. This is normal behaviour.

<!-- gh-comment-id:2907234920 --> @rick-github commented on GitHub (May 24, 2025): It's taking up a lot of VRAM because you have a context of 32768 tokens. In the bit of the log that you didn't include it will show the memory estimation, but the upshot is that ollama can only load 5 of the 37 layers of the model into VRAM. This means that 32 layers are loaded into system RAM where the CPU does the inference. Because the CPU is much slower than the GPU at doing the matrix operations required for inference, most of the time is spent waiting for the CPU to finish its calculations. This shows as high utilization for the CPU and low utilization for the GPU. This is normal behaviour.
Author
Owner

@alhadebe commented on GitHub (May 24, 2025):

@rick-github I have the same issue as @stubkan , I did not set the context, so it would be the default ollama ctx.

mistral-small3.1 is using 25GB per ollama ps & offlloading to cpu, but only using 14GB per nvidia-smi

docker exec -it ollama ollama ps
NAME                       ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:latest    b9aaf0c2586a    25 GB    41%/59% CPU/GPU    59 minutes from now
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8             11W /  420W |   14054MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          100959      C   /usr/bin/ollama                       14044MiB |
+-----------------------------------------------------------------------------------------+

_ollama_logs

<!-- gh-comment-id:2907294278 --> @alhadebe commented on GitHub (May 24, 2025): @rick-github I have the same issue as @stubkan , I did not set the context, so it would be the default ollama ctx. mistral-small3.1 is using 25GB per ollama ps & offlloading to cpu, but only using 14GB per nvidia-smi ``` docker exec -it ollama ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:latest b9aaf0c2586a 25 GB 41%/59% CPU/GPU 59 minutes from now ``` ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A | | 0% 32C P8 11W / 420W | 14054MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 100959 C /usr/bin/ollama 14044MiB | +-----------------------------------------------------------------------------------------+ ``` [_ollama_logs](https://github.com/user-attachments/files/20428352/_ollama_logs.3.txt)
Author
Owner

@rick-github commented on GitHub (May 24, 2025):

--flash-attn --kv-cache-type q8_0

Memory estimation is inaccurate when flash attention is enabled: #6160. As there is VRAM available and only a small fraction of the model is offloaded to CPU, you can force the entire model into VRAM by overriding num_gpu as described here.

<!-- gh-comment-id:2907320169 --> @rick-github commented on GitHub (May 24, 2025): ``` --flash-attn --kv-cache-type q8_0 ``` Memory estimation is inaccurate when flash attention is enabled: #6160. As there is VRAM available and only a small fraction of the model is offloaded to CPU, you can force the entire model into VRAM by overriding `num_gpu` as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).
Author
Owner

@abes200 commented on GitHub (May 24, 2025):

Ok for everyone who thinks this is normal behavior, or just not loading the model into VRAM properly.
Here is the verbose and PS outputs from running Gemma3 12B Q4 QAT on Ollama 0.6.8:
Verbose:
total duration: 2m8.0704078s
load duration: 111.3335ms
prompt eval count: 1942 token(s)
prompt eval duration: 30.9231777s
prompt eval rate: 62.80 tokens/s <----- **Notice this part in particular
eval count: 237 token(s)
eval duration: 1m36.8859754s
eval rate: 2.45 tokens/s

PS: gemma3qatvis3:latest 533bc7e8cbec 12 GB 57%/43% CPU/GPU 24 minutes from now

These are the normal results I expect on my poor old PC.
And here is the verbose and PS outputs from running Gemma3 on Ollama 0.7.1, which is very similar to running 0.7.0:
Verbose:
total duration: 8m4.3370989s
load duration: 112.5559ms
prompt eval count: 1942 token(s)
prompt eval duration: 6m19.7563467s
prompt eval rate: 5.11 tokens/s <----- **And notice this part in particular
eval count: 253 token(s)
eval duration: 1m44.377226s
eval rate: 2.42 tokens/s

PS:gemma3qatvis3:latest 533bc7e8cbec 11 GB 53%/47% CPU/GPU 24 minutes from now

As you can see, on 0.7.1 it's actually saying more of the model (slightly) is being loaded into VRAM and the model takes up less ram in total. Yet the prompt eval rate is literally more than 10x higher on 0.6.8.
Notes:

  1. Yes, this is a custom model I downloaded from hugging face.
  2. Yes, the results are the same even if loading a model pulled directly from Ollama site.
  3. Yes, re-installing 0.6.8 immediately fixed my problems.
  4. Yes, I am specifying num_gpu on both and both tests are using the exact same model with the exact same settings.
  5. No, I wont include an entire log. If you want a specific part, tell me and I'll remove all identifying details from it before sharing it.
<!-- gh-comment-id:2907425474 --> @abes200 commented on GitHub (May 24, 2025): Ok for everyone who thinks this is normal behavior, or just not loading the model into VRAM properly. Here is the verbose and PS outputs from running Gemma3 12B Q4 QAT on Ollama 0.6.8: Verbose: total duration: **2m8.0704078s** load duration: 111.3335ms prompt eval count: 1942 token(s) prompt eval duration: 30.9231777s prompt eval rate: **62.80 tokens/s** <----- **Notice this part in particular eval count: 237 token(s) eval duration: 1m36.8859754s eval rate: 2.45 tokens/s PS: gemma3qatvis3:latest 533bc7e8cbec 12 GB 57%/43% CPU/GPU 24 minutes from now These are the normal results I expect on my poor old PC. And here is the verbose and PS outputs from running Gemma3 on Ollama 0.7.1, which is very similar to running 0.7.0: Verbose: total duration: **8m4.3370989s** load duration: 112.5559ms prompt eval count: 1942 token(s) prompt eval duration: 6m19.7563467s prompt eval rate: **5.11 tokens/s** <----- **And notice this part in particular eval count: 253 token(s) eval duration: 1m44.377226s eval rate: 2.42 tokens/s PS:gemma3qatvis3:latest 533bc7e8cbec 11 GB 53%/47% CPU/GPU 24 minutes from now As you can see, on 0.7.1 it's actually saying more of the model (slightly) is being loaded into VRAM and the model takes up less ram in total. Yet the prompt eval rate is literally more than 10x higher on 0.6.8. Notes: 1. Yes, this is a custom model I downloaded from hugging face. 2. Yes, the results are the same even if loading a model pulled directly from Ollama site. 3. Yes, re-installing 0.6.8 immediately fixed my problems. 4. Yes, I am specifying num_gpu on both and both tests are using the exact same model with the exact same settings. 5. No, I wont include an entire log. If you want a specific part, tell me and I'll remove all identifying details from it before sharing it.
Author
Owner

@alhadebe commented on GitHub (May 25, 2025):

@

--flash-attn --kv-cache-type q8_0

Memory estimation is inaccurate when flash attention is enabled: #6160. As there is VRAM available and only a small fraction of the model is offloaded to CPU, you can force the entire model into VRAM by overriding num_gpu as described here.

@rick-github thanks for the links. I think there is still an issue and might be isolated to mistral-small3.1
If I run without flash-attn or kv-cache, the memory usage still doesnt look right. See ollama ps & logs
ctx is set to 4096, it should not be using 26GB

docker exec -it ollama ollama ps
NAME                       ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:latest    b9aaf0c2586a    26 GB    40%/60% CPU/GPU    59 minutes from now

_ollama_logs (4).txt

<!-- gh-comment-id:2907551628 --> @alhadebe commented on GitHub (May 25, 2025): @ > ``` > --flash-attn --kv-cache-type q8_0 > ``` > > Memory estimation is inaccurate when flash attention is enabled: [#6160](https://github.com/ollama/ollama/issues/6160). As there is VRAM available and only a small fraction of the model is offloaded to CPU, you can force the entire model into VRAM by overriding `num_gpu` as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). @rick-github thanks for the links. I think there is still an issue and might be isolated to mistral-small3.1 If I run without flash-attn or kv-cache, the memory usage still doesnt look right. See ollama ps & logs ctx is set to 4096, it should not be using 26GB ``` docker exec -it ollama ollama ps NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:latest b9aaf0c2586a 26 GB 40%/60% CPU/GPU 59 minutes from now ``` [_ollama_logs (4).txt](https://github.com/user-attachments/files/20428959/_ollama_logs.4.txt)
Author
Owner

@jessegross commented on GitHub (May 25, 2025):

@NotYourAverageAl Everything looks correct here. Here's the math:

layers.requested=-1 layers.model=41 layers.offload=40

It is offloading everything except the last layer, which is the one that contains the vision projector. This is the smallest step between full offloading and partial offloading. (i.e. the vision projector is the lowest priority layer to offload and it won't offload a partial vision projector).

time=2025-05-24T22:38:16.037Z level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="1.7 GiB"
time=2025-05-24T22:38:16.037Z level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="12.7 GiB"
time=2025-05-24T22:38:16.344Z level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="167.7 MiB"
time=2025-05-24T22:38:16.344Z level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="9.0 GiB"

Given the above, the vision projector is that part that is on the CPU. It is large - 1.7G weights + 9G graph = 10.7G.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8             11W /  420W |   14054MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvidia-smi is showing 14G used on the GPU - this is only the part offloaded, not the entire model. Adding that plus what it couldn't offload is 14G + 10.7G = 24.7G. This is larger than the VRAM on your GPU, which is why the vision projector got pushed to the CPU.

NAME                       ID              SIZE     PROCESSOR          UNTIL
mistral-small3.1:latest    b9aaf0c2586a    25 GB    41%/59% CPU/GPU    59 minutes from now

24.7G is also roughly the same as the total size of the model reported by ollama ps and 59% * 25G = 14.75GB, roughly what is on the GPU.

Note that only the number in ollama ps is an estimate - everything else is the actual memory allocations. So we can see that the estimate matches reality.

<!-- gh-comment-id:2907562411 --> @jessegross commented on GitHub (May 25, 2025): @NotYourAverageAl Everything looks correct here. Here's the math: `layers.requested=-1 layers.model=41 layers.offload=40` It is offloading everything except the last layer, which is the one that contains the vision projector. This is the smallest step between full offloading and partial offloading. (i.e. the vision projector is the lowest priority layer to offload and it won't offload a partial vision projector). ``` time=2025-05-24T22:38:16.037Z level=INFO source=ggml.go:351 msg="model weights" buffer=CPU size="1.7 GiB" time=2025-05-24T22:38:16.037Z level=INFO source=ggml.go:351 msg="model weights" buffer=CUDA0 size="12.7 GiB" ``` ``` time=2025-05-24T22:38:16.344Z level=INFO source=ggml.go:638 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="167.7 MiB" time=2025-05-24T22:38:16.344Z level=INFO source=ggml.go:638 msg="compute graph" backend=CPU buffer_type=CPU size="9.0 GiB" ``` Given the above, the vision projector is that part that is on the CPU. It is large - 1.7G weights + 9G graph = 10.7G. ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A | | 0% 32C P8 11W / 420W | 14054MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` nvidia-smi is showing 14G used on the GPU - this is only the part offloaded, not the entire model. Adding that plus what it couldn't offload is 14G + 10.7G = 24.7G. This is larger than the VRAM on your GPU, which is why the vision projector got pushed to the CPU. ``` NAME ID SIZE PROCESSOR UNTIL mistral-small3.1:latest b9aaf0c2586a 25 GB 41%/59% CPU/GPU 59 minutes from now ``` 24.7G is also roughly the same as the total size of the model reported by ollama ps and 59% * 25G = 14.75GB, roughly what is on the GPU. Note that only the number in ollama ps is an estimate - everything else is the actual memory allocations. So we can see that the estimate matches reality.
Author
Owner

@alhadebe commented on GitHub (May 25, 2025):

@rick-github but the model size for mistral-small3.1:latest is 15GB, it should all fit on the 3090. Am I missing something? gemma3:27b use 17GB and fits on the gpu only, so the smaller mistral should not be using more memory, if I understand correctly.

Image

Unless somehow I got the q8 quant (26GB) instead of the q4 quant (15GB)

<!-- gh-comment-id:2907570538 --> @alhadebe commented on GitHub (May 25, 2025): @rick-github but the model size for mistral-small3.1:latest is 15GB, it should all fit on the 3090. Am I missing something? [gemma3:27b](https://ollama.com/library/gemma3:27b) use 17GB and fits on the gpu only, so the smaller mistral should not be using more memory, if I understand correctly. ![Image](https://github.com/user-attachments/assets/91a51fc1-6fc5-49e6-b154-32c10efe6824) Unless somehow I got the q8 quant (26GB) instead of the q4 quant (15GB)
Author
Owner

@jessegross commented on GitHub (May 25, 2025):

15G is the on disk size, it also needs space in memory to do the computation. In the example above, where I said:
Given the above, the vision projector is that part that is on the CPU. It is large - 1.7G weights + 9G graph = 10.7G.

The weights (1.7G) is the part coming from disk. The graph (9G) is the computation buffer, so that is dominating the memory usage, not the part coming from disk.

Different models have different architectures that result in very different computation buffer sizes. Mistral's is particularly large, Gemma3's is quite a bit smaller and for text-only models it can be negligible. The last part is why people often use the on disk size as an estimate for the memory requirements but that is not accurate for many of the new models.

<!-- gh-comment-id:2907574097 --> @jessegross commented on GitHub (May 25, 2025): 15G is the on disk size, it also needs space in memory to do the computation. In the example above, where I said: `Given the above, the vision projector is that part that is on the CPU. It is large - 1.7G weights + 9G graph = 10.7G.` The weights (1.7G) is the part coming from disk. The graph (9G) is the computation buffer, so that is dominating the memory usage, not the part coming from disk. Different models have different architectures that result in very different computation buffer sizes. Mistral's is particularly large, Gemma3's is quite a bit smaller and for text-only models it can be negligible. The last part is why people often use the on disk size as an estimate for the memory requirements but that is not accurate for many of the new models.
Author
Owner

@alhadebe commented on GitHub (May 25, 2025):

That makes more sense. Thanks @jessegross

<!-- gh-comment-id:2907577696 --> @alhadebe commented on GitHub (May 25, 2025): That makes more sense. Thanks @jessegross
Author
Owner

@ccebelenski commented on GitHub (May 25, 2025):

I am seeing a discrepancy however - model is devstral:latest. ~20GB on disk, split across 4 GPU's (4060 Ti 16GB) - even with lots of space left in VRAM (after context allocation even) it is CPU offloading from the GPU's. Nothing really notable in the logs except this behavior, but I'm thinking the memory calculation is wacky. The GPU's have an average of 6GB free on each card, and 'ollama ps' is showing a total model size of 68GB, or 4GB more than VRAM available, which explains why it thinks it has to offload to CPU. Only thing of note in my config is that I quantize KV cache to q8_0. Max loaded models = 1. num_ctx = 65536 which will fit handily.

<!-- gh-comment-id:2908118460 --> @ccebelenski commented on GitHub (May 25, 2025): I am seeing a discrepancy however - model is devstral:latest. ~20GB on disk, split across 4 GPU's (4060 Ti 16GB) - even with lots of space left in VRAM (after context allocation even) it is CPU offloading from the GPU's. Nothing really notable in the logs except this behavior, but I'm thinking the memory calculation is wacky. The GPU's have an average of 6GB free on each card, and 'ollama ps' is showing a total model size of 68GB, or 4GB more than VRAM available, which explains why it thinks it has to offload to CPU. Only thing of note in my config is that I quantize KV cache to q8_0. Max loaded models = 1. num_ctx = 65536 which will fit handily.
Author
Owner

@zora-wuw commented on GitHub (Aug 4, 2025):

I have the same issue as OP, on AMD CPU and NVIDIA GPU, but for Linux version. Installed by manual commands because it is running on HPC system and I have no root access.
Log shows it discovers gpu but found non compatible gpu library? And called cpu backend.
But ollama ps shows using GPU.

time=2025-08-04T16:20:24.583+10:00 level=INFO source=routes.go:1238 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/fred/oz334/IP_classifier/ollama_20250729/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]"
time=2025-08-04T16:20:24.611+10:00 level=INFO source=images.go:476 msg="total blobs: 10"
time=2025-08-04T16:20:24.611+10:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-08-04T16:20:24.612+10:00 level=INFO source=routes.go:1291 msg="Listening on [::]:11434 (version 0.10.0)"
time=2025-08-04T16:20:24.612+10:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler"
time=2025-08-04T16:20:24.612+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-04T16:20:24.615+10:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-08-04T16:20:24.615+10:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-08-04T16:20:24.615+10:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libcuda.so* /apps/modules/software/CUDA/12.8.0/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/lib/libcuda.so* /var/tmp/jobfs/libcuda.so* /tmp/libcuda.so* /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libcuda.so* /apps/modules/software/CUDA/12.8.0/nvvm/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib/libcuda.so* /apps/slurm/latest/lib/libcuda.so* /apps/slurm/latest/lib/slurm/libcuda.so* /opt/nvidia/latest/usr/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
**time=2025-08-04T16:20:24.633+10:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08]**
initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
dlsym: cuInit - 0x14c7abe4c060
dlsym: cuDriverGetVersion - 0x14c7abe4c080
dlsym: cuDeviceGetCount - 0x14c7abe4c0c0
dlsym: cuDeviceGet - 0x14c7abe4c0a0
dlsym: cuDeviceGetAttribute - 0x14c7abe4c1a0
dlsym: cuDeviceGetUuid - 0x14c7abe4c100
dlsym: cuDeviceGetName - 0x14c7abe4c0e0
dlsym: cuCtxCreate_v3 - 0x14c7abe4c380
dlsym: cuMemGetInfo_v2 - 0x14c7abe4cb00
dlsym: cuCtxDestroy - 0x14c7abeaaca0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-08-04T16:20:24.867+10:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
[GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71] CUDA totalMem 81153mb
[GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71] CUDA freeMem 80729mb
[GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71] Compute Capability 8.0
time=2025-08-04T16:20:25.144+10:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-08-04T16:20:25.144+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 library=cuda variant=v12 compute=8.0 driver=12.8 name="NVIDIA A100-SXM4-80GB" total="79.3 GiB" available="78.8 GiB"
[GIN] 2025/08/04 - 16:21:24 | 200 |     210.705µs |       127.0.0.1 | HEAD     "/"
time=2025-08-04T16:21:24.723+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
[GIN] 2025/08/04 - 16:21:24 | 200 |  187.709665ms |       127.0.0.1 | POST     "/api/show"
time=2025-08-04T16:21:24.760+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="463.4 GiB" before.free_swap="118.4 GiB" now.total="503.5 GiB" now.free="465.3 GiB" now.free_swap="118.4 GiB"
initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
dlsym: cuInit - 0x14c7abe4c060
dlsym: cuDriverGetVersion - 0x14c7abe4c080
dlsym: cuDeviceGetCount - 0x14c7abe4c0c0
dlsym: cuDeviceGet - 0x14c7abe4c0a0
dlsym: cuDeviceGetAttribute - 0x14c7abe4c1a0
dlsym: cuDeviceGetUuid - 0x14c7abe4c100
dlsym: cuDeviceGetName - 0x14c7abe4c0e0
dlsym: cuCtxCreate_v3 - 0x14c7abe4c380
dlsym: cuMemGetInfo_v2 - 0x14c7abe4cb00
dlsym: cuCtxDestroy - 0x14c7abeaaca0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-08-04T16:21:25.027+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB"
releasing cuda driver library
time=2025-08-04T16:21:25.028+10:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-08-04T16:21:25.041+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-08-04T16:21:25.074+10:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-08-04T16:21:25.074+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]"
time=2025-08-04T16:21:25.074+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-08-04T16:21:25.075+10:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 gpu=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 parallel=1 available=84651212800 required="21.5 GiB"
time=2025-08-04T16:21:25.075+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="465.3 GiB" before.free_swap="118.4 GiB" now.total="503.5 GiB" now.free="465.3 GiB" now.free_swap="118.4 GiB"
initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
dlsym: cuInit - 0x14c7abe4c060
dlsym: cuDriverGetVersion - 0x14c7abe4c080
dlsym: cuDeviceGetCount - 0x14c7abe4c0c0
dlsym: cuDeviceGet - 0x14c7abe4c0a0
dlsym: cuDeviceGetAttribute - 0x14c7abe4c1a0
dlsym: cuDeviceGetUuid - 0x14c7abe4c100
dlsym: cuDeviceGetName - 0x14c7abe4c0e0
dlsym: cuCtxCreate_v3 - 0x14c7abe4c380
dlsym: cuMemGetInfo_v2 - 0x14c7abe4cb00
dlsym: cuCtxDestroy - 0x14c7abeaaca0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-08-04T16:21:25.335+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB"
releasing cuda driver library
time=2025-08-04T16:21:25.335+10:00 level=INFO source=server.go:135 msg="system memory" total="503.5 GiB" free="465.3 GiB" free_swap="118.4 GiB"
time=2025-08-04T16:21:25.335+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]"
time=2025-08-04T16:21:25.335+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-08-04T16:21:25.336+10:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[78.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
**time=2025-08-04T16:21:25.336+10:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]**
time=2025-08-04T16:21:25.360+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.rope.freq_scale default=1
time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-08-04T16:21:25.366+10:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/fred/oz334/IP_classifier/ollama_20250729/bin/ollama runner --ollama-engine --model /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 4096 --batch-size 512 --n-gpu-layers 65 --threads 64 --parallel 1 --port 45673"
time=2025-08-04T16:21:25.366+10:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/fred/oz334/IP_classifier/ollama_20250729/models ROCR_VISIBLE_DEVICES=0 OLLAMA_LIBRARY_PATH=/var/tmp/jobfs:/tmp:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama CUDA_VISIBLE_DEVICES=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 OLLAMA_HOST=0.0.0.0:11434 CUDA_PATH=/apps/modules/software/CUDA/12.8.0 CUDA_ROOT=/apps/modules/software/CUDA/12.8.0 GPU_DEVICE_ORDINAL=0 OLLAMA_LLM_LIBRARY=cuda OLLAMA_LOAD_TIMEOUT=60m OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama:/apps/modules/software/CUDA/12.8.0/lib64:/apps/modules/software/CUDA/12.8.0/lib:/var/tmp/jobfs:/tmp:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama:/apps/modules/software/CUDA/12.8.0/nvvm/lib64:/apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64:/apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib:/apps/slurm/latest/lib:/apps/slurm/latest/lib/slurm:/opt/nvidia/latest/usr/lib64:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama OLLAMA_NEW_ENGINE=true CUDA_HOME=/apps/modules/software/CUDA/12.8.0 PATH=/apps/system/software/apptainer/latest/bin:/apps/modules/software/CUDA/12.8.0/nvvm/bin:/apps/modules/software/CUDA/12.8.0/bin:/fred/oz334/IP_classifier/ollama_20250729/bin:/home/zwu/.nvm/versions/node/v22.17.1/bin:/apps/slurm/latest/sbin:/apps/slurm/latest/bin:/opt/nvidia/latest/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/fred/oz334/ollama OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama
time=2025-08-04T16:21:25.366+10:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-04T16:21:25.366+10:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-04T16:21:25.380+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-04T16:21:25.381+10:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-04T16:21:25.382+10:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:45673"
time=2025-08-04T16:21:25.408+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default=""
time=2025-08-04T16:21:25.410+10:00 level=INFO source=ggml.go:92 msg="" architecture=qwen3 file_type=Q4_K_M name="Qwen3 32B" description="" num_tensors=707 num_key_values=28
time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama
load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so
time=2025-08-04T16:21:25.489+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU"
time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:376 msg="offloaded 0/65 layers to GPU"
time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="18.8 GiB"
time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.rope.freq_scale default=1
time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_count default=0
time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_used_count default=0
time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.norm_top_k_prob default=true
time=2025-08-04T16:21:25.631+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-04T16:21:25.667+10:00 level=DEBUG source=ggml.go:650 msg="compute graph" nodes=2502 splits=1
time=2025-08-04T16:21:25.667+10:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="564.0 MiB"
time=2025-08-04T16:21:25.668+10:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=437575680A allocated.CPU.Weights="[315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 638151680A]" allocated.CPU.Cache="[16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 0U]" allocated.CPU.Graph=591396864A
time=2025-08-04T16:21:25.884+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.01"
time=2025-08-04T16:21:26.137+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.04"
time=2025-08-04T16:21:26.387+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.06"
time=2025-08-04T16:21:26.638+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.07"
time=2025-08-04T16:21:26.888+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.09"
time=2025-08-04T16:21:27.139+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.10"
time=2025-08-04T16:21:27.389+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.12"
time=2025-08-04T16:21:27.640+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.13"
time=2025-08-04T16:21:27.893+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.15"
time=2025-08-04T16:21:28.143+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.16"
time=2025-08-04T16:21:28.394+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.17"
time=2025-08-04T16:21:28.645+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.19"
time=2025-08-04T16:21:28.895+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.20"
time=2025-08-04T16:21:29.146+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.22"
time=2025-08-04T16:21:29.397+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.23"
time=2025-08-04T16:21:29.653+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.24"
time=2025-08-04T16:21:29.904+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.26"
time=2025-08-04T16:21:30.154+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.27"
time=2025-08-04T16:21:30.407+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.28"
time=2025-08-04T16:21:30.657+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.29"
time=2025-08-04T16:21:30.908+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.31"
time=2025-08-04T16:21:31.159+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.32"
time=2025-08-04T16:21:31.410+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.35"
time=2025-08-04T16:21:31.660+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.37"
time=2025-08-04T16:21:31.911+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.39"
time=2025-08-04T16:21:32.161+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.42"
time=2025-08-04T16:21:32.412+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.45"
time=2025-08-04T16:21:32.662+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.47"
time=2025-08-04T16:21:32.913+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.49"
time=2025-08-04T16:21:33.163+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.50"
time=2025-08-04T16:21:33.414+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.51"
time=2025-08-04T16:21:33.665+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.52"
time=2025-08-04T16:21:33.915+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.53"
time=2025-08-04T16:21:34.166+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.54"
time=2025-08-04T16:21:34.417+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.55"
time=2025-08-04T16:21:34.668+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.56"
time=2025-08-04T16:21:34.919+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.57"
time=2025-08-04T16:21:35.169+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.58"
time=2025-08-04T16:21:35.423+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.59"
time=2025-08-04T16:21:35.674+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.60"
time=2025-08-04T16:21:35.924+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.61"
time=2025-08-04T16:21:36.175+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.62"
time=2025-08-04T16:21:36.426+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.63"
time=2025-08-04T16:21:36.676+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.64"
time=2025-08-04T16:21:36.931+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.65"
time=2025-08-04T16:21:37.182+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.66"
time=2025-08-04T16:21:37.439+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.67"
time=2025-08-04T16:21:37.689+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.69"
time=2025-08-04T16:21:37.940+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.69"
time=2025-08-04T16:21:38.196+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.71"
time=2025-08-04T16:21:38.446+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.72"
time=2025-08-04T16:21:38.697+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.73"
time=2025-08-04T16:21:38.947+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.74"
time=2025-08-04T16:21:39.202+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.75"
time=2025-08-04T16:21:39.453+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.76"
time=2025-08-04T16:21:39.703+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.77"
time=2025-08-04T16:21:39.956+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.78"
time=2025-08-04T16:21:40.207+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.79"
time=2025-08-04T16:21:40.458+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.80"
time=2025-08-04T16:21:40.708+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.81"
time=2025-08-04T16:21:40.959+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.82"
time=2025-08-04T16:21:41.209+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.83"
time=2025-08-04T16:21:41.460+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.83"
time=2025-08-04T16:21:41.712+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.85"
time=2025-08-04T16:21:41.963+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.85"
time=2025-08-04T16:21:42.213+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.86"
time=2025-08-04T16:21:42.464+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.87"
time=2025-08-04T16:21:42.714+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.88"
time=2025-08-04T16:21:42.965+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.89"
time=2025-08-04T16:21:43.216+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.90"
time=2025-08-04T16:21:43.466+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.91"
time=2025-08-04T16:21:43.717+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.92"
time=2025-08-04T16:21:43.967+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.92"
time=2025-08-04T16:21:44.218+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.94"
time=2025-08-04T16:21:44.469+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.95"
time=2025-08-04T16:21:44.719+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.96"
time=2025-08-04T16:21:44.970+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97"
time=2025-08-04T16:21:45.220+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97"
time=2025-08-04T16:21:45.481+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97"
time=2025-08-04T16:21:45.750+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97"
time=2025-08-04T16:21:46.004+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.98"
time=2025-08-04T16:21:46.257+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.99"
time=2025-08-04T16:21:46.508+10:00 level=INFO source=server.go:637 msg="llama runner started in 21.14 seconds"
time=2025-08-04T16:21:46.509+10:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=1 runner.pid=810687 runner.model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=4096
time=2025-08-04T16:21:46.519+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=58 format=""
time=2025-08-04T16:21:46.573+10:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11

Also ollama ps shows using GPU somehow

⠇ NAME         ID              SIZE     PROCESSOR    CONTEXT    UNTIL             
qwen3:32b    030ee887880f    23 GB    100% GPU     4096       4 minutes from now
<!-- gh-comment-id:3149333256 --> @zora-wuw commented on GitHub (Aug 4, 2025): I have the same issue as OP, on AMD CPU and NVIDIA GPU, but for Linux version. Installed by manual commands because it is running on HPC system and I have no root access. Log shows it discovers gpu but found non compatible gpu library? And called cpu backend. But `ollama ps` shows using GPU. ``` time=2025-08-04T16:20:24.583+10:00 level=INFO source=routes.go:1238 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/fred/oz334/IP_classifier/ollama_20250729/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]" time=2025-08-04T16:20:24.611+10:00 level=INFO source=images.go:476 msg="total blobs: 10" time=2025-08-04T16:20:24.611+10:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-08-04T16:20:24.612+10:00 level=INFO source=routes.go:1291 msg="Listening on [::]:11434 (version 0.10.0)" time=2025-08-04T16:20:24.612+10:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler" time=2025-08-04T16:20:24.612+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-04T16:20:24.615+10:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-08-04T16:20:24.615+10:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-08-04T16:20:24.615+10:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libcuda.so* /apps/modules/software/CUDA/12.8.0/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/lib/libcuda.so* /var/tmp/jobfs/libcuda.so* /tmp/libcuda.so* /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libcuda.so* /apps/modules/software/CUDA/12.8.0/nvvm/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib/libcuda.so* /apps/slurm/latest/lib/libcuda.so* /apps/slurm/latest/lib/slurm/libcuda.so* /opt/nvidia/latest/usr/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" **time=2025-08-04T16:20:24.633+10:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08]** initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 dlsym: cuInit - 0x14c7abe4c060 dlsym: cuDriverGetVersion - 0x14c7abe4c080 dlsym: cuDeviceGetCount - 0x14c7abe4c0c0 dlsym: cuDeviceGet - 0x14c7abe4c0a0 dlsym: cuDeviceGetAttribute - 0x14c7abe4c1a0 dlsym: cuDeviceGetUuid - 0x14c7abe4c100 dlsym: cuDeviceGetName - 0x14c7abe4c0e0 dlsym: cuCtxCreate_v3 - 0x14c7abe4c380 dlsym: cuMemGetInfo_v2 - 0x14c7abe4cb00 dlsym: cuCtxDestroy - 0x14c7abeaaca0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-08-04T16:20:24.867+10:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 [GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71] CUDA totalMem 81153mb [GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71] CUDA freeMem 80729mb [GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71] Compute Capability 8.0 time=2025-08-04T16:20:25.144+10:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-08-04T16:20:25.144+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 library=cuda variant=v12 compute=8.0 driver=12.8 name="NVIDIA A100-SXM4-80GB" total="79.3 GiB" available="78.8 GiB" [GIN] 2025/08/04 - 16:21:24 | 200 | 210.705µs | 127.0.0.1 | HEAD "/" time=2025-08-04T16:21:24.723+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 [GIN] 2025/08/04 - 16:21:24 | 200 | 187.709665ms | 127.0.0.1 | POST "/api/show" time=2025-08-04T16:21:24.760+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="463.4 GiB" before.free_swap="118.4 GiB" now.total="503.5 GiB" now.free="465.3 GiB" now.free_swap="118.4 GiB" initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 dlsym: cuInit - 0x14c7abe4c060 dlsym: cuDriverGetVersion - 0x14c7abe4c080 dlsym: cuDeviceGetCount - 0x14c7abe4c0c0 dlsym: cuDeviceGet - 0x14c7abe4c0a0 dlsym: cuDeviceGetAttribute - 0x14c7abe4c1a0 dlsym: cuDeviceGetUuid - 0x14c7abe4c100 dlsym: cuDeviceGetName - 0x14c7abe4c0e0 dlsym: cuCtxCreate_v3 - 0x14c7abe4c380 dlsym: cuMemGetInfo_v2 - 0x14c7abe4cb00 dlsym: cuCtxDestroy - 0x14c7abeaaca0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-08-04T16:21:25.027+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB" releasing cuda driver library time=2025-08-04T16:21:25.028+10:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-08-04T16:21:25.041+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-08-04T16:21:25.074+10:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-08-04T16:21:25.074+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]" time=2025-08-04T16:21:25.074+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-08-04T16:21:25.075+10:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 gpu=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 parallel=1 available=84651212800 required="21.5 GiB" time=2025-08-04T16:21:25.075+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="465.3 GiB" before.free_swap="118.4 GiB" now.total="503.5 GiB" now.free="465.3 GiB" now.free_swap="118.4 GiB" initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 dlsym: cuInit - 0x14c7abe4c060 dlsym: cuDriverGetVersion - 0x14c7abe4c080 dlsym: cuDeviceGetCount - 0x14c7abe4c0c0 dlsym: cuDeviceGet - 0x14c7abe4c0a0 dlsym: cuDeviceGetAttribute - 0x14c7abe4c1a0 dlsym: cuDeviceGetUuid - 0x14c7abe4c100 dlsym: cuDeviceGetName - 0x14c7abe4c0e0 dlsym: cuCtxCreate_v3 - 0x14c7abe4c380 dlsym: cuMemGetInfo_v2 - 0x14c7abe4cb00 dlsym: cuCtxDestroy - 0x14c7abeaaca0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-08-04T16:21:25.335+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB" releasing cuda driver library time=2025-08-04T16:21:25.335+10:00 level=INFO source=server.go:135 msg="system memory" total="503.5 GiB" free="465.3 GiB" free_swap="118.4 GiB" time=2025-08-04T16:21:25.335+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]" time=2025-08-04T16:21:25.335+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-08-04T16:21:25.336+10:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[78.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" **time=2025-08-04T16:21:25.336+10:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]** time=2025-08-04T16:21:25.360+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.rope.freq_scale default=1 time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-08-04T16:21:25.361+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-08-04T16:21:25.366+10:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/fred/oz334/IP_classifier/ollama_20250729/bin/ollama runner --ollama-engine --model /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 4096 --batch-size 512 --n-gpu-layers 65 --threads 64 --parallel 1 --port 45673" time=2025-08-04T16:21:25.366+10:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/fred/oz334/IP_classifier/ollama_20250729/models ROCR_VISIBLE_DEVICES=0 OLLAMA_LIBRARY_PATH=/var/tmp/jobfs:/tmp:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama CUDA_VISIBLE_DEVICES=GPU-7c75217c-ed02-25a3-ffda-9815da4a0b71 OLLAMA_HOST=0.0.0.0:11434 CUDA_PATH=/apps/modules/software/CUDA/12.8.0 CUDA_ROOT=/apps/modules/software/CUDA/12.8.0 GPU_DEVICE_ORDINAL=0 OLLAMA_LLM_LIBRARY=cuda OLLAMA_LOAD_TIMEOUT=60m OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama:/apps/modules/software/CUDA/12.8.0/lib64:/apps/modules/software/CUDA/12.8.0/lib:/var/tmp/jobfs:/tmp:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama:/apps/modules/software/CUDA/12.8.0/nvvm/lib64:/apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64:/apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib:/apps/slurm/latest/lib:/apps/slurm/latest/lib/slurm:/opt/nvidia/latest/usr/lib64:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama OLLAMA_NEW_ENGINE=true CUDA_HOME=/apps/modules/software/CUDA/12.8.0 PATH=/apps/system/software/apptainer/latest/bin:/apps/modules/software/CUDA/12.8.0/nvvm/bin:/apps/modules/software/CUDA/12.8.0/bin:/fred/oz334/IP_classifier/ollama_20250729/bin:/home/zwu/.nvm/versions/node/v22.17.1/bin:/apps/slurm/latest/sbin:/apps/slurm/latest/bin:/opt/nvidia/latest/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/fred/oz334/ollama OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama time=2025-08-04T16:21:25.366+10:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-04T16:21:25.366+10:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-04T16:21:25.380+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-04T16:21:25.381+10:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-04T16:21:25.382+10:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:45673" time=2025-08-04T16:21:25.408+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default="" time=2025-08-04T16:21:25.410+10:00 level=INFO source=ggml.go:92 msg="" architecture=qwen3 file_type=Q4_K_M name="Qwen3 32B" description="" num_tensors=707 num_key_values=28 time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so time=2025-08-04T16:21:25.489+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU" time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:376 msg="offloaded 0/65 layers to GPU" time=2025-08-04T16:21:25.492+10:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="18.8 GiB" time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.rope.freq_scale default=1 time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_count default=0 time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.expert_used_count default=0 time=2025-08-04T16:21:25.492+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.norm_top_k_prob default=true time=2025-08-04T16:21:25.631+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-04T16:21:25.667+10:00 level=DEBUG source=ggml.go:650 msg="compute graph" nodes=2502 splits=1 time=2025-08-04T16:21:25.667+10:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="564.0 MiB" time=2025-08-04T16:21:25.668+10:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=437575680A allocated.CPU.Weights="[315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 281846784A 281846784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 315638784A 638151680A]" allocated.CPU.Cache="[16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 16777216A 0U]" allocated.CPU.Graph=591396864A time=2025-08-04T16:21:25.884+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.01" time=2025-08-04T16:21:26.137+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.04" time=2025-08-04T16:21:26.387+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.06" time=2025-08-04T16:21:26.638+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.07" time=2025-08-04T16:21:26.888+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.09" time=2025-08-04T16:21:27.139+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.10" time=2025-08-04T16:21:27.389+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.12" time=2025-08-04T16:21:27.640+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.13" time=2025-08-04T16:21:27.893+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.15" time=2025-08-04T16:21:28.143+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.16" time=2025-08-04T16:21:28.394+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.17" time=2025-08-04T16:21:28.645+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.19" time=2025-08-04T16:21:28.895+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.20" time=2025-08-04T16:21:29.146+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.22" time=2025-08-04T16:21:29.397+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.23" time=2025-08-04T16:21:29.653+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.24" time=2025-08-04T16:21:29.904+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.26" time=2025-08-04T16:21:30.154+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.27" time=2025-08-04T16:21:30.407+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.28" time=2025-08-04T16:21:30.657+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.29" time=2025-08-04T16:21:30.908+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.31" time=2025-08-04T16:21:31.159+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.32" time=2025-08-04T16:21:31.410+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.35" time=2025-08-04T16:21:31.660+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.37" time=2025-08-04T16:21:31.911+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.39" time=2025-08-04T16:21:32.161+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.42" time=2025-08-04T16:21:32.412+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.45" time=2025-08-04T16:21:32.662+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.47" time=2025-08-04T16:21:32.913+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.49" time=2025-08-04T16:21:33.163+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.50" time=2025-08-04T16:21:33.414+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.51" time=2025-08-04T16:21:33.665+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.52" time=2025-08-04T16:21:33.915+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.53" time=2025-08-04T16:21:34.166+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.54" time=2025-08-04T16:21:34.417+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.55" time=2025-08-04T16:21:34.668+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.56" time=2025-08-04T16:21:34.919+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.57" time=2025-08-04T16:21:35.169+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.58" time=2025-08-04T16:21:35.423+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.59" time=2025-08-04T16:21:35.674+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.60" time=2025-08-04T16:21:35.924+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.61" time=2025-08-04T16:21:36.175+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.62" time=2025-08-04T16:21:36.426+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.63" time=2025-08-04T16:21:36.676+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.64" time=2025-08-04T16:21:36.931+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.65" time=2025-08-04T16:21:37.182+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.66" time=2025-08-04T16:21:37.439+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.67" time=2025-08-04T16:21:37.689+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.69" time=2025-08-04T16:21:37.940+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.69" time=2025-08-04T16:21:38.196+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.71" time=2025-08-04T16:21:38.446+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.72" time=2025-08-04T16:21:38.697+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.73" time=2025-08-04T16:21:38.947+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.74" time=2025-08-04T16:21:39.202+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.75" time=2025-08-04T16:21:39.453+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.76" time=2025-08-04T16:21:39.703+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.77" time=2025-08-04T16:21:39.956+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.78" time=2025-08-04T16:21:40.207+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.79" time=2025-08-04T16:21:40.458+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.80" time=2025-08-04T16:21:40.708+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.81" time=2025-08-04T16:21:40.959+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.82" time=2025-08-04T16:21:41.209+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.83" time=2025-08-04T16:21:41.460+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.83" time=2025-08-04T16:21:41.712+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.85" time=2025-08-04T16:21:41.963+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.85" time=2025-08-04T16:21:42.213+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.86" time=2025-08-04T16:21:42.464+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.87" time=2025-08-04T16:21:42.714+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.88" time=2025-08-04T16:21:42.965+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.89" time=2025-08-04T16:21:43.216+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.90" time=2025-08-04T16:21:43.466+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.91" time=2025-08-04T16:21:43.717+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.92" time=2025-08-04T16:21:43.967+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.92" time=2025-08-04T16:21:44.218+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.94" time=2025-08-04T16:21:44.469+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.95" time=2025-08-04T16:21:44.719+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.96" time=2025-08-04T16:21:44.970+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97" time=2025-08-04T16:21:45.220+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97" time=2025-08-04T16:21:45.481+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97" time=2025-08-04T16:21:45.750+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.97" time=2025-08-04T16:21:46.004+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.98" time=2025-08-04T16:21:46.257+10:00 level=DEBUG source=server.go:643 msg="model load progress 0.99" time=2025-08-04T16:21:46.508+10:00 level=INFO source=server.go:637 msg="llama runner started in 21.14 seconds" time=2025-08-04T16:21:46.509+10:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=1 runner.pid=810687 runner.model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=4096 time=2025-08-04T16:21:46.519+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=58 format="" time=2025-08-04T16:21:46.573+10:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11 ``` Also `ollama ps` shows using GPU somehow ``` ⠇ NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:32b 030ee887880f 23 GB 100% GPU 4096 4 minutes from now ```
Author
Owner

@rick-github commented on GitHub (Aug 4, 2025):

@ccebelenski

Memory estimation is inaccurate when flash attention is enabled. #6160

@zora-wuw

time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama
load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so
time=2025-08-04T16:21:25.489+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

No GPU backends found. How did you install ollama?

<!-- gh-comment-id:3149523057 --> @rick-github commented on GitHub (Aug 4, 2025): @ccebelenski Memory estimation is inaccurate when flash attention is enabled. #6160 @zora-wuw ``` time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so time=2025-08-04T16:21:25.489+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ``` No GPU backends found. How did you install ollama?
Author
Owner

@zora-wuw commented on GitHub (Aug 4, 2025):

@ccebelenski

Memory estimation is inaccurate when flash attention is enabled. #6160

@zora-wuw

time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama
load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so
time=2025-08-04T16:21:25.489+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

No GPU backends found. How did you install ollama?

Hi @rick-github , thank you for the quick reply. I installed manually by

curl -LO https://ollama.com/download/ollama-linux-amd64.tgz
tar -C /fred/oz334/ollama_20250729 -xzf ollama-linux-amd64.tgz

Because I have no root user permission.
I started ollama in bash script by

module load cuda/12.8.0
OLLAMA_BIN=/fred/oz334/IP_classifier/ollama_20250729/bin/ollama
export OLLAMA_MODELS=/fred/oz334/IP_classifier/ollama_20250729/models
export OLLAMA_DEBUG=1
export OLLAMA_HOST="0.0.0.0:11434"
export OLLAMA_NEW_ENGINE=true
export OLLAMA_LLM_LIBRARY=cuda
export OLLAMA_LIBRARY_PATH="$TMPDIR:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama"
export OLLAMA_LOAD_TIMEOUT=60m
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$CUDA_HOME/lib:$OLLAMA_LIBRARY_PATH:$LD_LIBRARY_PATH"
echo "-------------------------------"
echo "Starting Ollama…"
nohup $OLLAMA_BIN serve > /fred/oz334/IP_classifier/jobs/ollama_server.log 2>&1 & 

Added $TMPDIR to OLLAMA_LIBRARY_PATH for testing.
The lib/ollama has below content:

total 1.5G
lrwxrwxrwx 1 zwu oz334   23 Jul  3 03:32 libcublasLt.so.12 -> libcublasLt.so.12.8.4.1
-rwxr-xr-x 1 zwu oz334 717M Jul  8  2015 libcublasLt.so.12.8.4.1
lrwxrwxrwx 1 zwu oz334   21 Jul  3 03:32 libcublas.so.12 -> libcublas.so.12.8.4.1
-rwxr-xr-x 1 zwu oz334 111M Jul  8  2015 libcublas.so.12.8.4.1
lrwxrwxrwx 1 zwu oz334   20 Jul  3 03:32 libcudart.so.12 -> libcudart.so.12.8.90
-rwxr-xr-x 1 zwu oz334 712K Jul  8  2015 libcudart.so.12.8.90
-rwxr-xr-x 1 zwu oz334 582K Jul  3 03:23 libggml-base.so
-rwxr-xr-x 1 zwu oz334 605K Jul  3 03:23 libggml-cpu-alderlake.so
-rwxr-xr-x 1 zwu oz334 605K Jul  3 03:23 libggml-cpu-haswell.so
-rwxr-xr-x 1 zwu oz334 709K Jul  3 03:23 libggml-cpu-icelake.so
-rwxr-xr-x 1 zwu oz334 593K Jul  3 03:23 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 zwu oz334 713K Jul  3 03:23 libggml-cpu-skylakex.so
-rwxr-xr-x 1 zwu oz334 469K Jul  3 03:23 libggml-cpu-sse42.so
-rwxr-xr-x 1 zwu oz334 465K Jul  3 03:23 libggml-cpu-x64.so
-rwxr-xr-x 1 zwu oz334 1.2G Jul  3 03:32 libggml-cuda.so
-rwxr-xr-x 1 zwu oz334 577M Jul  3 03:33 libggml-hip.so
<!-- gh-comment-id:3149860733 --> @zora-wuw commented on GitHub (Aug 4, 2025): > [@ccebelenski](https://github.com/ccebelenski) > > Memory estimation is inaccurate when flash attention is enabled. [#6160](https://github.com/ollama/ollama/issues/6160) > > [@zora-wuw](https://github.com/zora-wuw) > > ``` > time=2025-08-04T16:21:25.410+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama > load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so > time=2025-08-04T16:21:25.489+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) > ``` > > No GPU backends found. How did you install ollama? Hi @rick-github , thank you for the quick reply. I installed manually by ``` curl -LO https://ollama.com/download/ollama-linux-amd64.tgz tar -C /fred/oz334/ollama_20250729 -xzf ollama-linux-amd64.tgz ``` Because I have no root user permission. I started ollama in bash script by ``` module load cuda/12.8.0 OLLAMA_BIN=/fred/oz334/IP_classifier/ollama_20250729/bin/ollama export OLLAMA_MODELS=/fred/oz334/IP_classifier/ollama_20250729/models export OLLAMA_DEBUG=1 export OLLAMA_HOST="0.0.0.0:11434" export OLLAMA_NEW_ENGINE=true export OLLAMA_LLM_LIBRARY=cuda export OLLAMA_LIBRARY_PATH="$TMPDIR:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama" export OLLAMA_LOAD_TIMEOUT=60m export LD_LIBRARY_PATH="$CUDA_HOME/lib64:$CUDA_HOME/lib:$OLLAMA_LIBRARY_PATH:$LD_LIBRARY_PATH" echo "-------------------------------" echo "Starting Ollama…" nohup $OLLAMA_BIN serve > /fred/oz334/IP_classifier/jobs/ollama_server.log 2>&1 & ``` Added $TMPDIR to OLLAMA_LIBRARY_PATH for testing. The lib/ollama has below content: ``` total 1.5G lrwxrwxrwx 1 zwu oz334 23 Jul 3 03:32 libcublasLt.so.12 -> libcublasLt.so.12.8.4.1 -rwxr-xr-x 1 zwu oz334 717M Jul 8 2015 libcublasLt.so.12.8.4.1 lrwxrwxrwx 1 zwu oz334 21 Jul 3 03:32 libcublas.so.12 -> libcublas.so.12.8.4.1 -rwxr-xr-x 1 zwu oz334 111M Jul 8 2015 libcublas.so.12.8.4.1 lrwxrwxrwx 1 zwu oz334 20 Jul 3 03:32 libcudart.so.12 -> libcudart.so.12.8.90 -rwxr-xr-x 1 zwu oz334 712K Jul 8 2015 libcudart.so.12.8.90 -rwxr-xr-x 1 zwu oz334 582K Jul 3 03:23 libggml-base.so -rwxr-xr-x 1 zwu oz334 605K Jul 3 03:23 libggml-cpu-alderlake.so -rwxr-xr-x 1 zwu oz334 605K Jul 3 03:23 libggml-cpu-haswell.so -rwxr-xr-x 1 zwu oz334 709K Jul 3 03:23 libggml-cpu-icelake.so -rwxr-xr-x 1 zwu oz334 593K Jul 3 03:23 libggml-cpu-sandybridge.so -rwxr-xr-x 1 zwu oz334 713K Jul 3 03:23 libggml-cpu-skylakex.so -rwxr-xr-x 1 zwu oz334 469K Jul 3 03:23 libggml-cpu-sse42.so -rwxr-xr-x 1 zwu oz334 465K Jul 3 03:23 libggml-cpu-x64.so -rwxr-xr-x 1 zwu oz334 1.2G Jul 3 03:32 libggml-cuda.so -rwxr-xr-x 1 zwu oz334 577M Jul 3 03:33 libggml-hip.so ```
Author
Owner

@rick-github commented on GitHub (Aug 4, 2025):

You extracted the tar.gz file into /fred/oz334/ollama_20250729 but are running the binary as /fred/oz334/IP_classifier/ollama_20250729/bin/ollama? Ollama finds the backends relative to the binary, you need to arrange for /fred/oz334/IP_classifier/ollama_20250729/lib to point to /fred/oz334/ollama_20250729/lib.

Also don't set OLLAMA_NEW_ENGINE, OLLAMA_LLM_LIBRARY, OLLAMA_LIBRARY_PATH, or LD_LIBRARY_PATH.

<!-- gh-comment-id:3150059739 --> @rick-github commented on GitHub (Aug 4, 2025): You extracted the tar.gz file into `/fred/oz334/ollama_20250729` but are running the binary as `/fred/oz334/IP_classifier/ollama_20250729/bin/ollama`? Ollama finds the backends relative to the binary, you need to arrange for `/fred/oz334/IP_classifier/ollama_20250729/lib` to point to `/fred/oz334/ollama_20250729/lib`. Also don't set `OLLAMA_NEW_ENGINE`, `OLLAMA_LLM_LIBRARY`, `OLLAMA_LIBRARY_PATH`, or `LD_LIBRARY_PATH`.
Author
Owner

@zora-wuw commented on GitHub (Aug 4, 2025):

@rick-github Might be a typo in my earlier post for extraction destination. To confirm, I re-extracted tar directly into /fred/oz334/IP_classifier/ollama_20250729 (after deleting previous bin and lib folders), so making sure the bin and lib (and models) in same directory. As you suggested, also deleted setting for OLLAMA_NEW_ENGINE, OLLAMA_LLM_LIBRARY, OLLAMA_LIBRARY_PATH, and LD_LIBRARY_PATH.

Log looks quite different, but still time=2025-08-04T20:51:03.466+10:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] and time=2025-08-04T20:51:03.744+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so

Full log below:
$ cat ollama_server.log 
time=2025-08-04T20:50:52.653+10:00 level=INFO source=routes.go:1238 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/fred/oz334/IP_classifier/ollama_20250729/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]"
time=2025-08-04T20:50:52.937+10:00 level=INFO source=images.go:476 msg="total blobs: 10"
time=2025-08-04T20:50:52.940+10:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-08-04T20:50:52.941+10:00 level=INFO source=routes.go:1291 msg="Listening on [::]:11434 (version 0.10.0)"
time=2025-08-04T20:50:52.941+10:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler"
time=2025-08-04T20:50:52.941+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-04T20:50:52.945+10:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-08-04T20:50:52.945+10:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-08-04T20:50:52.945+10:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libcuda.so* /apps/modules/software/CUDA/12.8.0/nvvm/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib/libcuda.so* /apps/slurm/latest/lib/libcuda.so* /apps/slurm/latest/lib/slurm/libcuda.so* /opt/nvidia/latest/usr/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-08-04T20:50:52.976+10:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08]
initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
dlsym: cuInit - 0x150a77e4c060
dlsym: cuDriverGetVersion - 0x150a77e4c080
dlsym: cuDeviceGetCount - 0x150a77e4c0c0
dlsym: cuDeviceGet - 0x150a77e4c0a0
dlsym: cuDeviceGetAttribute - 0x150a77e4c1a0
dlsym: cuDeviceGetUuid - 0x150a77e4c100
dlsym: cuDeviceGetName - 0x150a77e4c0e0
dlsym: cuCtxCreate_v3 - 0x150a77e4c380
dlsym: cuMemGetInfo_v2 - 0x150a77e4cb00
dlsym: cuCtxDestroy - 0x150a77eaaca0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-08-04T20:50:53.222+10:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
[GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e] CUDA totalMem 81153mb
[GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e] CUDA freeMem 80729mb
[GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e] Compute Capability 8.0
time=2025-08-04T20:50:53.500+10:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-08-04T20:50:53.500+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e library=cuda variant=v12 compute=8.0 driver=12.8 name="NVIDIA A100-SXM4-80GB" total="79.3 GiB" available="78.8 GiB"
[GIN] 2025/08/04 - 20:51:02 | 200 |      270.75µs |       127.0.0.1 | HEAD     "/"
time=2025-08-04T20:51:02.835+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
[GIN] 2025/08/04 - 20:51:02 | 200 |   240.07609ms |       127.0.0.1 | POST     "/api/show"
time=2025-08-04T20:51:02.881+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="453.7 GiB" before.free_swap="115.4 GiB" now.total="503.5 GiB" now.free="455.0 GiB" now.free_swap="115.4 GiB"
initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
dlsym: cuInit - 0x150a77e4c060
dlsym: cuDriverGetVersion - 0x150a77e4c080
dlsym: cuDeviceGetCount - 0x150a77e4c0c0
dlsym: cuDeviceGet - 0x150a77e4c0a0
dlsym: cuDeviceGetAttribute - 0x150a77e4c1a0
dlsym: cuDeviceGetUuid - 0x150a77e4c100
dlsym: cuDeviceGetName - 0x150a77e4c0e0
dlsym: cuCtxCreate_v3 - 0x150a77e4c380
dlsym: cuMemGetInfo_v2 - 0x150a77e4cb00
dlsym: cuCtxDestroy - 0x150a77eaaca0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-08-04T20:51:03.146+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB"
releasing cuda driver library
time=2025-08-04T20:51:03.146+10:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-08-04T20:51:03.160+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-08-04T20:51:03.192+10:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-08-04T20:51:03.192+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]"
time=2025-08-04T20:51:03.192+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-08-04T20:51:03.193+10:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 gpu=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e parallel=1 available=84651212800 required="21.5 GiB"
time=2025-08-04T20:51:03.193+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="455.0 GiB" before.free_swap="115.4 GiB" now.total="503.5 GiB" now.free="454.9 GiB" now.free_swap="115.4 GiB"
initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08
dlsym: cuInit - 0x150a77e4c060
dlsym: cuDriverGetVersion - 0x150a77e4c080
dlsym: cuDeviceGetCount - 0x150a77e4c0c0
dlsym: cuDeviceGet - 0x150a77e4c0a0
dlsym: cuDeviceGetAttribute - 0x150a77e4c1a0
dlsym: cuDeviceGetUuid - 0x150a77e4c100
dlsym: cuDeviceGetName - 0x150a77e4c0e0
dlsym: cuCtxCreate_v3 - 0x150a77e4c380
dlsym: cuMemGetInfo_v2 - 0x150a77e4cb00
dlsym: cuCtxDestroy - 0x150a77eaaca0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-08-04T20:51:03.465+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB"
releasing cuda driver library
time=2025-08-04T20:51:03.465+10:00 level=INFO source=server.go:135 msg="system memory" total="503.5 GiB" free="454.9 GiB" free_swap="115.4 GiB"
time=2025-08-04T20:51:03.465+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]"
time=2025-08-04T20:51:03.465+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-08-04T20:51:03.465+10:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[78.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB"
time=2025-08-04T20:51:03.466+10:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-08-04T20:51:03.724+10:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/fred/oz334/IP_classifier/ollama_20250729/bin/ollama runner --model /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 4096 --batch-size 512 --n-gpu-layers 65 --threads 64 --parallel 1 --port 41963"
time=2025-08-04T20:51:03.724+10:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/fred/oz334/IP_classifier/ollama_20250729/models ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e OLLAMA_HOST=0.0.0.0:11434 CUDA_PATH=/apps/modules/software/CUDA/12.8.0 CUDA_ROOT=/apps/modules/software/CUDA/12.8.0 GPU_DEVICE_ORDINAL=0 OLLAMA_LLM_LIBRARY=cuda OLLAMA_LOAD_TIMEOUT=60m OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama:/apps/modules/software/CUDA/12.8.0/nvvm/lib64:/apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64:/apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib:/apps/slurm/latest/lib:/apps/slurm/latest/lib/slurm:/opt/nvidia/latest/usr/lib64:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama CUDA_HOME=/apps/modules/software/CUDA/12.8.0 PATH=/apps/system/software/apptainer/latest/bin:/apps/modules/software/CUDA/12.8.0/nvvm/bin:/apps/modules/software/CUDA/12.8.0/bin:/fred/oz334/IP_classifier/ollama_20250729/bin:/home/zwu/.nvm/versions/node/v22.17.1/bin:/apps/slurm/latest/sbin:/apps/slurm/latest/bin:/opt/nvidia/latest/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/fred/oz334/ollama OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama
time=2025-08-04T20:51:03.725+10:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-04T20:51:03.725+10:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-04T20:51:03.726+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-04T20:51:03.744+10:00 level=INFO source=runner.go:815 msg="starting go runner"
time=2025-08-04T20:51:03.744+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama
load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so
time=2025-08-04T20:51:04.003+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-08-04T20:51:04.004+10:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:41963"
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 25600
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CPU, is_swa = 0
load_tensors: layer  33 assigned to device CPU, is_swa = 0
load_tensors: layer  34 assigned to device CPU, is_swa = 0
load_tensors: layer  35 assigned to device CPU, is_swa = 0
load_tensors: layer  36 assigned to device CPU, is_swa = 0
load_tensors: layer  37 assigned to device CPU, is_swa = 0
load_tensors: layer  38 assigned to device CPU, is_swa = 0
load_tensors: layer  39 assigned to device CPU, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 0
load_tensors: layer  41 assigned to device CPU, is_swa = 0
load_tensors: layer  42 assigned to device CPU, is_swa = 0
load_tensors: layer  43 assigned to device CPU, is_swa = 0
load_tensors: layer  44 assigned to device CPU, is_swa = 0
load_tensors: layer  45 assigned to device CPU, is_swa = 0
load_tensors: layer  46 assigned to device CPU, is_swa = 0
load_tensors: layer  47 assigned to device CPU, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: layer  49 assigned to device CPU, is_swa = 0
load_tensors: layer  50 assigned to device CPU, is_swa = 0
load_tensors: layer  51 assigned to device CPU, is_swa = 0
load_tensors: layer  52 assigned to device CPU, is_swa = 0
load_tensors: layer  53 assigned to device CPU, is_swa = 0
load_tensors: layer  54 assigned to device CPU, is_swa = 0
load_tensors: layer  55 assigned to device CPU, is_swa = 0
load_tensors: layer  56 assigned to device CPU, is_swa = 0
load_tensors: layer  57 assigned to device CPU, is_swa = 0
load_tensors: layer  58 assigned to device CPU, is_swa = 0
load_tensors: layer  59 assigned to device CPU, is_swa = 0
load_tensors: layer  60 assigned to device CPU, is_swa = 0
load_tensors: layer  61 assigned to device CPU, is_swa = 0
load_tensors: layer  62 assigned to device CPU, is_swa = 0
load_tensors: layer  63 assigned to device CPU, is_swa = 0
load_tensors: layer  64 assigned to device CPU, is_swa = 0
time=2025-08-04T20:51:04.228+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
[GIN] 2025/08/04 - 20:51:12 | 200 |      24.506µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/04 - 20:51:12 | 200 |     215.636µs |       127.0.0.1 | GET      "/api/ps"
load_tensors:   CPU_Mapped model buffer size = 19259.71 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.60 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1, padding = 32
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified: layer  32: dev = CPU
llama_kv_cache_unified: layer  33: dev = CPU
llama_kv_cache_unified: layer  34: dev = CPU
llama_kv_cache_unified: layer  35: dev = CPU
llama_kv_cache_unified: layer  36: dev = CPU
llama_kv_cache_unified: layer  37: dev = CPU
llama_kv_cache_unified: layer  38: dev = CPU
llama_kv_cache_unified: layer  39: dev = CPU
llama_kv_cache_unified: layer  40: dev = CPU
llama_kv_cache_unified: layer  41: dev = CPU
llama_kv_cache_unified: layer  42: dev = CPU
llama_kv_cache_unified: layer  43: dev = CPU
llama_kv_cache_unified: layer  44: dev = CPU
llama_kv_cache_unified: layer  45: dev = CPU
llama_kv_cache_unified: layer  46: dev = CPU
llama_kv_cache_unified: layer  47: dev = CPU
llama_kv_cache_unified: layer  48: dev = CPU
llama_kv_cache_unified: layer  49: dev = CPU
llama_kv_cache_unified: layer  50: dev = CPU
llama_kv_cache_unified: layer  51: dev = CPU
llama_kv_cache_unified: layer  52: dev = CPU
llama_kv_cache_unified: layer  53: dev = CPU
llama_kv_cache_unified: layer  54: dev = CPU
llama_kv_cache_unified: layer  55: dev = CPU
llama_kv_cache_unified: layer  56: dev = CPU
llama_kv_cache_unified: layer  57: dev = CPU
llama_kv_cache_unified: layer  58: dev = CPU
llama_kv_cache_unified: layer  59: dev = CPU
llama_kv_cache_unified: layer  60: dev = CPU
llama_kv_cache_unified: layer  61: dev = CPU
llama_kv_cache_unified: layer  62: dev = CPU
llama_kv_cache_unified: layer  63: dev = CPU
time=2025-08-04T20:51:21.049+10:00 level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-08-04T20:51:21.300+10:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size =  1024.00 MiB
llama_kv_cache_unified: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:        CPU compute buffer size =   572.01 MiB
llama_context: graph nodes  = 2438
llama_context: graph splits = 1
time=2025-08-04T20:51:21.550+10:00 level=INFO source=server.go:637 msg="llama runner started in 17.83 seconds"
time=2025-08-04T20:51:21.551+10:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=1 runner.pid=3661685 runner.model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=4096
time=2025-08-04T20:51:21.551+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=58 format=""
time=2025-08-04T20:51:21.553+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11
Since it can find the cpu backend, why not the `libggml-cuda.so` lol
<!-- gh-comment-id:3150179960 --> @zora-wuw commented on GitHub (Aug 4, 2025): @rick-github Might be a typo in my earlier post for extraction destination. To confirm, I re-extracted tar directly into `/fred/oz334/IP_classifier/ollama_20250729` (after deleting previous bin and lib folders), so making sure the `bin` and `lib` (and models) in same directory. As you suggested, also deleted setting for OLLAMA_NEW_ENGINE, OLLAMA_LLM_LIBRARY, OLLAMA_LIBRARY_PATH, and LD_LIBRARY_PATH. Log looks quite different, but still `time=2025-08-04T20:51:03.466+10:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]` and `time=2025-08-04T20:51:03.744+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so` <details> <summary>Full log below:</summary> ```bash $ cat ollama_server.log time=2025-08-04T20:50:52.653+10:00 level=INFO source=routes.go:1238 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/fred/oz334/IP_classifier/ollama_20250729/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]" time=2025-08-04T20:50:52.937+10:00 level=INFO source=images.go:476 msg="total blobs: 10" time=2025-08-04T20:50:52.940+10:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-08-04T20:50:52.941+10:00 level=INFO source=routes.go:1291 msg="Listening on [::]:11434 (version 0.10.0)" time=2025-08-04T20:50:52.941+10:00 level=DEBUG source=sched.go:106 msg="starting llm scheduler" time=2025-08-04T20:50:52.941+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-04T20:50:52.945+10:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-08-04T20:50:52.945+10:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-08-04T20:50:52.945+10:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libcuda.so* /apps/modules/software/CUDA/12.8.0/nvvm/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64/libcuda.so* /apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib/libcuda.so* /apps/slurm/latest/lib/libcuda.so* /apps/slurm/latest/lib/slurm/libcuda.so* /opt/nvidia/latest/usr/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-08-04T20:50:52.976+10:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08] initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 dlsym: cuInit - 0x150a77e4c060 dlsym: cuDriverGetVersion - 0x150a77e4c080 dlsym: cuDeviceGetCount - 0x150a77e4c0c0 dlsym: cuDeviceGet - 0x150a77e4c0a0 dlsym: cuDeviceGetAttribute - 0x150a77e4c1a0 dlsym: cuDeviceGetUuid - 0x150a77e4c100 dlsym: cuDeviceGetName - 0x150a77e4c0e0 dlsym: cuCtxCreate_v3 - 0x150a77e4c380 dlsym: cuMemGetInfo_v2 - 0x150a77e4cb00 dlsym: cuCtxDestroy - 0x150a77eaaca0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-08-04T20:50:53.222+10:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 [GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e] CUDA totalMem 81153mb [GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e] CUDA freeMem 80729mb [GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e] Compute Capability 8.0 time=2025-08-04T20:50:53.500+10:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-08-04T20:50:53.500+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e library=cuda variant=v12 compute=8.0 driver=12.8 name="NVIDIA A100-SXM4-80GB" total="79.3 GiB" available="78.8 GiB" [GIN] 2025/08/04 - 20:51:02 | 200 | 270.75µs | 127.0.0.1 | HEAD "/" time=2025-08-04T20:51:02.835+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 [GIN] 2025/08/04 - 20:51:02 | 200 | 240.07609ms | 127.0.0.1 | POST "/api/show" time=2025-08-04T20:51:02.881+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="453.7 GiB" before.free_swap="115.4 GiB" now.total="503.5 GiB" now.free="455.0 GiB" now.free_swap="115.4 GiB" initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 dlsym: cuInit - 0x150a77e4c060 dlsym: cuDriverGetVersion - 0x150a77e4c080 dlsym: cuDeviceGetCount - 0x150a77e4c0c0 dlsym: cuDeviceGet - 0x150a77e4c0a0 dlsym: cuDeviceGetAttribute - 0x150a77e4c1a0 dlsym: cuDeviceGetUuid - 0x150a77e4c100 dlsym: cuDeviceGetName - 0x150a77e4c0e0 dlsym: cuCtxCreate_v3 - 0x150a77e4c380 dlsym: cuMemGetInfo_v2 - 0x150a77e4cb00 dlsym: cuCtxDestroy - 0x150a77eaaca0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-08-04T20:51:03.146+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB" releasing cuda driver library time=2025-08-04T20:51:03.146+10:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-08-04T20:51:03.160+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-08-04T20:51:03.192+10:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-08-04T20:51:03.192+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]" time=2025-08-04T20:51:03.192+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-08-04T20:51:03.193+10:00 level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 gpu=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e parallel=1 available=84651212800 required="21.5 GiB" time=2025-08-04T20:51:03.193+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.5 GiB" before.free="455.0 GiB" before.free_swap="115.4 GiB" now.total="503.5 GiB" now.free="454.9 GiB" now.free_swap="115.4 GiB" initializing /opt/nvidia/latest/usr/lib64/libcuda.so.570.148.08 dlsym: cuInit - 0x150a77e4c060 dlsym: cuDriverGetVersion - 0x150a77e4c080 dlsym: cuDeviceGetCount - 0x150a77e4c0c0 dlsym: cuDeviceGet - 0x150a77e4c0a0 dlsym: cuDeviceGetAttribute - 0x150a77e4c1a0 dlsym: cuDeviceGetUuid - 0x150a77e4c100 dlsym: cuDeviceGetName - 0x150a77e4c0e0 dlsym: cuCtxCreate_v3 - 0x150a77e4c380 dlsym: cuMemGetInfo_v2 - 0x150a77e4cb00 dlsym: cuCtxDestroy - 0x150a77eaaca0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-08-04T20:51:03.465+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e name="NVIDIA A100-SXM4-80GB" overhead="0 B" before.total="79.3 GiB" before.free="78.8 GiB" now.total="79.3 GiB" now.free="78.8 GiB" now.used="424.1 MiB" releasing cuda driver library time=2025-08-04T20:51:03.465+10:00 level=INFO source=server.go:135 msg="system memory" total="503.5 GiB" free="454.9 GiB" free_swap="115.4 GiB" time=2025-08-04T20:51:03.465+10:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.8 GiB]" time=2025-08-04T20:51:03.465+10:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-08-04T20:51:03.465+10:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[78.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.3 GiB" time=2025-08-04T20:51:03.466+10:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-08-04T20:51:03.724+10:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/fred/oz334/IP_classifier/ollama_20250729/bin/ollama runner --model /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 4096 --batch-size 512 --n-gpu-layers 65 --threads 64 --parallel 1 --port 41963" time=2025-08-04T20:51:03.724+10:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/fred/oz334/IP_classifier/ollama_20250729/models ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-bf5e2ba4-b646-f625-f8d4-d2e42370245e OLLAMA_HOST=0.0.0.0:11434 CUDA_PATH=/apps/modules/software/CUDA/12.8.0 CUDA_ROOT=/apps/modules/software/CUDA/12.8.0 GPU_DEVICE_ORDINAL=0 OLLAMA_LLM_LIBRARY=cuda OLLAMA_LOAD_TIMEOUT=60m OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama:/apps/modules/software/CUDA/12.8.0/nvvm/lib64:/apps/modules/software/CUDA/12.8.0/extras/CUPTI/lib64:/apps/modules/software/CUDA/12.8.0/targets/x86_64-linux/lib:/apps/slurm/latest/lib:/apps/slurm/latest/lib/slurm:/opt/nvidia/latest/usr/lib64:/fred/oz334/IP_classifier/ollama_20250729/lib/ollama CUDA_HOME=/apps/modules/software/CUDA/12.8.0 PATH=/apps/system/software/apptainer/latest/bin:/apps/modules/software/CUDA/12.8.0/nvvm/bin:/apps/modules/software/CUDA/12.8.0/bin:/fred/oz334/IP_classifier/ollama_20250729/bin:/home/zwu/.nvm/versions/node/v22.17.1/bin:/apps/slurm/latest/sbin:/apps/slurm/latest/bin:/opt/nvidia/latest/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/fred/oz334/ollama OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama time=2025-08-04T20:51:03.725+10:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-04T20:51:03.725+10:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-04T20:51:03.726+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-04T20:51:03.744+10:00 level=INFO source=runner.go:815 msg="starting go runner" time=2025-08-04T20:51:03.744+10:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/fred/oz334/IP_classifier/ollama_20250729/lib/ollama load_backend: loaded CPU backend from /fred/oz334/IP_classifier/ollama_20250729/lib/ollama/libggml-cpu-haswell.so time=2025-08-04T20:51:04.003+10:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-08-04T20:51:04.004+10:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:41963" llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 25600 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 load_tensors: layer 29 assigned to device CPU, is_swa = 0 load_tensors: layer 30 assigned to device CPU, is_swa = 0 load_tensors: layer 31 assigned to device CPU, is_swa = 0 load_tensors: layer 32 assigned to device CPU, is_swa = 0 load_tensors: layer 33 assigned to device CPU, is_swa = 0 load_tensors: layer 34 assigned to device CPU, is_swa = 0 load_tensors: layer 35 assigned to device CPU, is_swa = 0 load_tensors: layer 36 assigned to device CPU, is_swa = 0 load_tensors: layer 37 assigned to device CPU, is_swa = 0 load_tensors: layer 38 assigned to device CPU, is_swa = 0 load_tensors: layer 39 assigned to device CPU, is_swa = 0 load_tensors: layer 40 assigned to device CPU, is_swa = 0 load_tensors: layer 41 assigned to device CPU, is_swa = 0 load_tensors: layer 42 assigned to device CPU, is_swa = 0 load_tensors: layer 43 assigned to device CPU, is_swa = 0 load_tensors: layer 44 assigned to device CPU, is_swa = 0 load_tensors: layer 45 assigned to device CPU, is_swa = 0 load_tensors: layer 46 assigned to device CPU, is_swa = 0 load_tensors: layer 47 assigned to device CPU, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: layer 49 assigned to device CPU, is_swa = 0 load_tensors: layer 50 assigned to device CPU, is_swa = 0 load_tensors: layer 51 assigned to device CPU, is_swa = 0 load_tensors: layer 52 assigned to device CPU, is_swa = 0 load_tensors: layer 53 assigned to device CPU, is_swa = 0 load_tensors: layer 54 assigned to device CPU, is_swa = 0 load_tensors: layer 55 assigned to device CPU, is_swa = 0 load_tensors: layer 56 assigned to device CPU, is_swa = 0 load_tensors: layer 57 assigned to device CPU, is_swa = 0 load_tensors: layer 58 assigned to device CPU, is_swa = 0 load_tensors: layer 59 assigned to device CPU, is_swa = 0 load_tensors: layer 60 assigned to device CPU, is_swa = 0 load_tensors: layer 61 assigned to device CPU, is_swa = 0 load_tensors: layer 62 assigned to device CPU, is_swa = 0 load_tensors: layer 63 assigned to device CPU, is_swa = 0 load_tensors: layer 64 assigned to device CPU, is_swa = 0 time=2025-08-04T20:51:04.228+10:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" [GIN] 2025/08/04 - 20:51:12 | 200 | 24.506µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/04 - 20:51:12 | 200 | 215.636µs | 127.0.0.1 | GET "/api/ps" load_tensors: CPU_Mapped model buffer size = 19259.71 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.60 MiB create_memory: n_ctx = 4096 (padded) llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1, padding = 32 llama_kv_cache_unified: layer 0: dev = CPU llama_kv_cache_unified: layer 1: dev = CPU llama_kv_cache_unified: layer 2: dev = CPU llama_kv_cache_unified: layer 3: dev = CPU llama_kv_cache_unified: layer 4: dev = CPU llama_kv_cache_unified: layer 5: dev = CPU llama_kv_cache_unified: layer 6: dev = CPU llama_kv_cache_unified: layer 7: dev = CPU llama_kv_cache_unified: layer 8: dev = CPU llama_kv_cache_unified: layer 9: dev = CPU llama_kv_cache_unified: layer 10: dev = CPU llama_kv_cache_unified: layer 11: dev = CPU llama_kv_cache_unified: layer 12: dev = CPU llama_kv_cache_unified: layer 13: dev = CPU llama_kv_cache_unified: layer 14: dev = CPU llama_kv_cache_unified: layer 15: dev = CPU llama_kv_cache_unified: layer 16: dev = CPU llama_kv_cache_unified: layer 17: dev = CPU llama_kv_cache_unified: layer 18: dev = CPU llama_kv_cache_unified: layer 19: dev = CPU llama_kv_cache_unified: layer 20: dev = CPU llama_kv_cache_unified: layer 21: dev = CPU llama_kv_cache_unified: layer 22: dev = CPU llama_kv_cache_unified: layer 23: dev = CPU llama_kv_cache_unified: layer 24: dev = CPU llama_kv_cache_unified: layer 25: dev = CPU llama_kv_cache_unified: layer 26: dev = CPU llama_kv_cache_unified: layer 27: dev = CPU llama_kv_cache_unified: layer 28: dev = CPU llama_kv_cache_unified: layer 29: dev = CPU llama_kv_cache_unified: layer 30: dev = CPU llama_kv_cache_unified: layer 31: dev = CPU llama_kv_cache_unified: layer 32: dev = CPU llama_kv_cache_unified: layer 33: dev = CPU llama_kv_cache_unified: layer 34: dev = CPU llama_kv_cache_unified: layer 35: dev = CPU llama_kv_cache_unified: layer 36: dev = CPU llama_kv_cache_unified: layer 37: dev = CPU llama_kv_cache_unified: layer 38: dev = CPU llama_kv_cache_unified: layer 39: dev = CPU llama_kv_cache_unified: layer 40: dev = CPU llama_kv_cache_unified: layer 41: dev = CPU llama_kv_cache_unified: layer 42: dev = CPU llama_kv_cache_unified: layer 43: dev = CPU llama_kv_cache_unified: layer 44: dev = CPU llama_kv_cache_unified: layer 45: dev = CPU llama_kv_cache_unified: layer 46: dev = CPU llama_kv_cache_unified: layer 47: dev = CPU llama_kv_cache_unified: layer 48: dev = CPU llama_kv_cache_unified: layer 49: dev = CPU llama_kv_cache_unified: layer 50: dev = CPU llama_kv_cache_unified: layer 51: dev = CPU llama_kv_cache_unified: layer 52: dev = CPU llama_kv_cache_unified: layer 53: dev = CPU llama_kv_cache_unified: layer 54: dev = CPU llama_kv_cache_unified: layer 55: dev = CPU llama_kv_cache_unified: layer 56: dev = CPU llama_kv_cache_unified: layer 57: dev = CPU llama_kv_cache_unified: layer 58: dev = CPU llama_kv_cache_unified: layer 59: dev = CPU llama_kv_cache_unified: layer 60: dev = CPU llama_kv_cache_unified: layer 61: dev = CPU llama_kv_cache_unified: layer 62: dev = CPU llama_kv_cache_unified: layer 63: dev = CPU time=2025-08-04T20:51:21.049+10:00 level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-08-04T20:51:21.300+10:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 1024.00 MiB llama_kv_cache_unified: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CPU compute buffer size = 572.01 MiB llama_context: graph nodes = 2438 llama_context: graph splits = 1 time=2025-08-04T20:51:21.550+10:00 level=INFO source=server.go:637 msg="llama runner started in 17.83 seconds" time=2025-08-04T20:51:21.551+10:00 level=DEBUG source=sched.go:493 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=1 runner.pid=3661685 runner.model=/fred/oz334/IP_classifier/ollama_20250729/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=4096 time=2025-08-04T20:51:21.551+10:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=58 format="" time=2025-08-04T20:51:21.553+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11 ``` </details> Since it can find the cpu backend, why not the `libggml-cuda.so` lol
Author
Owner

@rick-github commented on GitHub (Aug 4, 2025):

The log is hard to read, please fix the markdown tags.

<!-- gh-comment-id:3150420934 --> @rick-github commented on GitHub (Aug 4, 2025): The log is hard to read, please fix the markdown tags.
Author
Owner

@rick-github commented on GitHub (Aug 4, 2025):

It looks like this is running as a slurm job. The slurm scheduler sets both ROCR_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES which confuses the runner about which device to use. Unset ROCR_VISIBLE_DEVICES (unset ROCR_VISIBLE_DEVICES) in the start script or set Flags=nvidia_gpu_env in your gres.conf.

<!-- gh-comment-id:3150588725 --> @rick-github commented on GitHub (Aug 4, 2025): It looks like this is running as a slurm job. The slurm scheduler sets both `ROCR_VISIBLE_DEVICES` and `CUDA_VISIBLE_DEVICES` which confuses the runner about which device to use. Unset `ROCR_VISIBLE_DEVICES` (`unset ROCR_VISIBLE_DEVICES`) in the start script or set `Flags=nvidia_gpu_env` in your `gres.conf`.
Author
Owner

@zora-wuw commented on GitHub (Aug 4, 2025):

That was it, unset ROCR_VISIBLE_DEVICES did the job! I really appreciate your help!

<!-- gh-comment-id:3150653209 --> @zora-wuw commented on GitHub (Aug 4, 2025): That was it, `unset ROCR_VISIBLE_DEVICES` did the job! I really appreciate your help!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69173