[GH-ISSUE #11519] ollama fails to find conda installed cuda #54119

Closed
opened 2026-04-29 05:14:43 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @skwde on GitHub (Jul 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11519

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I try to get latest ollama working with our A100 but it fails to use my conda installed CUDA.

Currently we have the nvidia driver 535.161.07 which supports CUDA version up to 12.2.2.

See following nvidia-smi output:

$ nvidia-smi
Fri Jul 25 07:13:26 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:17:00.0 Off |                    0 |
| N/A   31C    P0              35W / 250W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The latest ollama ships with CUDA 12.9 something
Consequently I need to install my own CUDA.
I installed cuda version 12.2.2 in a conda environment from the nvidia/label/cuda-12.2.2 channel. When the environment is activated I get

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

When I then try to use ollama

./bin/ollama serve

It always gives

time=2025-07-25T07:20:50.603+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0

and

time=2025-07-25T07:20:50.707+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB"

not matter what I do.
When I try to run

./bin/ollama run deepseek-r1:32b "hello"

It loads forever and prints stuff about assigning layers to the CPU.
The same happens when I use a smaller model.

I tried various things:

  • Setting env variables

    # Path is already set when activating the conda env
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
    export CUDA_PATH=$CONDA_PREFIX
    export CUDA_HOME=$CUDA_PATH
    
  • removing the cuda libs in ./lib/ollama and linking to the corresponding ones in $CONDA_PREFIX/lib without success

  • start ollama while OLLAMA_LLM_LIBRARY=cuda_v12 is set (though I am not sure whether cuda_v12 is correct because noting is mentioned in the server log (contrary to what is mentioned here: https://ollama.qubitpi.org/troubleshooting/#llm-libraries)

Based on the docs (https://ollama.qubitpi.org/gpu/#nvidia) my system + driver are supported so what am I missing here?

Relevant log output

time=2025-07-25T07:23:35.062+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:<ollama install dir>/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]"
time=2025-07-25T07:23:35.069+02:00 level=INFO source=images.go:476 msg="total blobs: 10"
time=2025-07-25T07:23:35.071+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-25T07:23:35.072+02:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.6)"
time=2025-07-25T07:23:35.073+02:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler"
time=2025-07-25T07:23:35.074+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-25T07:23:35.088+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-07-25T07:23:35.088+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-07-25T07:23:35.088+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama install dir>/lib/ollama/libcuda.so* <base>/.conda/envs/cuda-12.2/lib/libcuda.so* <other1>libcuda.so* <other2>/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-07-25T07:23:35.091+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[]
time=2025-07-25T07:23:35.091+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcudart.so*
time=2025-07-25T07:23:35.091+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama install dir>/lib/ollama/libcudart.so* <base>/.conda/envs/cuda-12.2/lib/libcudart.so* <other1>libcudart.so* <other2>/libcudart.so* <ollama install dir>/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2025-07-25T07:23:35.097+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[<base>/.conda/envs/cuda-12.2/lib/libcudart.so.12.2.140]
CUDA driver version: 12-2
time=2025-07-25T07:23:35.397+02:00 level=DEBUG source=gpu.go:140 msg="detected GPUs" library=<base>/.conda/envs/cuda-12.2/lib/libcudart.so.12.2.140 count=1
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0
time=2025-07-25T07:23:35.399+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0
time=2025-07-25T07:23:35.399+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cudart library
time=2025-07-25T07:23:35.527+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB"
[GIN] 2025/07/25 - 07:23:40 | 200 |    1.148083ms |       127.0.0.1 | HEAD     "/"
time=2025-07-25T07:23:41.002+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
[GIN] 2025/07/25 - 07:23:41 | 200 |   80.210208ms |       127.0.0.1 | POST     "/api/show"
time=2025-07-25T07:23:41.054+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="494.0 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB"
CUDA driver version: 12-2
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0
time=2025-07-25T07:23:41.203+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B"
releasing cudart library
time=2025-07-25T07:23:41.280+02:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-07-25T07:23:41.300+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-25T07:23:41.341+02:00 level=DEBUG source=sched.go:228 msg="loading first model" model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93
time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]"
time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.vision.block_count default=0
time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.key_length default=128
time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.value_length default=128
time=2025-07-25T07:23:41.343+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d parallel=2 available=41855287296 required="21.5 GiB"
time=2025-07-25T07:23:41.343+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="493.9 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB"
CUDA driver version: 12-2
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0
time=2025-07-25T07:23:41.476+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B"
releasing cudart library
time=2025-07-25T07:23:41.552+02:00 level=INFO source=server.go:135 msg="system memory" total="503.3 GiB" free="493.9 GiB" free_swap="31.9 GiB"
time=2025-07-25T07:23:41.552+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]"
time=2025-07-25T07:23:41.552+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.vision.block_count default=0
time=2025-07-25T07:23:41.553+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.key_length default=128
time=2025-07-25T07:23:41.553+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.value_length default=128
time=2025-07-25T07:23:41.553+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[39.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="17.5 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-07-25T07:23:41.553+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from <ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 32.76 B
print_info: general.name     = DeepSeek R1 Distill Qwen 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-25T07:23:41.781+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="<ollama install dir>/bin/ollama runner --model <ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 64 --parallel 2 --port 36915"
time=2025-07-25T07:23:41.781+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_DEBUG=1 LD_LIBRARY_PATH=<ollama install dir>/lib/ollama:<base>/.conda/envs/cuda-12.2/lib:<other2>:<ollama install dir>/lib/ollama CUDA_PATH=<base>/.conda/envs/cuda-12.2 ROCR_VISIBLE_DEVICES=0 CUDA_HOME=<base>/.conda/envs/cuda-12.2 CUDA_VISIBLE_DEVICES=GPU-30482f99-0f88-5514-0dea-d2f901ad513d PATH=<base>/.conda/envs/cuda-12.2/bin:<other>/sbin:<other>/bin:<HOME>/.vscode-server/cli/servers/Stable-7adae6a56e34cb64d08899664b814cf620465925/server/bin/remote-cli:/usr/local/lsfm/bin:/net/shared/lsfm/common/admin/internal/bin:/net/shared/lsfm/common/admin/external/bin:/net/shared/lsfm/common/admin/external/opt/mambaforge/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin OLLAMA_MODELS=<ollama install dir>/models GPU_DEVICE_ORDINAL=0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=<ollama install dir>/lib/ollama
time=2025-07-25T07:23:41.789+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-25T07:23:41.801+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-25T07:23:41.805+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-25T07:23:41.852+02:00 level=INFO source=runner.go:815 msg="starting go runner"
time=2025-07-25T07:23:41.854+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=<ollama install dir>/lib/ollama
load_backend: loaded CPU backend from <ollama install dir>/lib/ollama/libggml-cpu-icelake.so
time=2025-07-25T07:23:41.905+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-25T07:23:41.906+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:36915"
llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from <ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW)
init_tokenizer: initializing tokenizer for type 2
time=2025-07-25T07:23:42.059+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = DeepSeek R1 Distill Qwen 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CPU, is_swa = 0
load_tensors: layer  33 assigned to device CPU, is_swa = 0
load_tensors: layer  34 assigned to device CPU, is_swa = 0
load_tensors: layer  35 assigned to device CPU, is_swa = 0
load_tensors: layer  36 assigned to device CPU, is_swa = 0
load_tensors: layer  37 assigned to device CPU, is_swa = 0
load_tensors: layer  38 assigned to device CPU, is_swa = 0
load_tensors: layer  39 assigned to device CPU, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 0
load_tensors: layer  41 assigned to device CPU, is_swa = 0
load_tensors: layer  42 assigned to device CPU, is_swa = 0
load_tensors: layer  43 assigned to device CPU, is_swa = 0
load_tensors: layer  44 assigned to device CPU, is_swa = 0
load_tensors: layer  45 assigned to device CPU, is_swa = 0
load_tensors: layer  46 assigned to device CPU, is_swa = 0
load_tensors: layer  47 assigned to device CPU, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: layer  49 assigned to device CPU, is_swa = 0
load_tensors: layer  50 assigned to device CPU, is_swa = 0
load_tensors: layer  51 assigned to device CPU, is_swa = 0
load_tensors: layer  52 assigned to device CPU, is_swa = 0
load_tensors: layer  53 assigned to device CPU, is_swa = 0
load_tensors: layer  54 assigned to device CPU, is_swa = 0
load_tensors: layer  55 assigned to device CPU, is_swa = 0
load_tensors: layer  56 assigned to device CPU, is_swa = 0
load_tensors: layer  57 assigned to device CPU, is_swa = 0
load_tensors: layer  58 assigned to device CPU, is_swa = 0
load_tensors: layer  59 assigned to device CPU, is_swa = 0
load_tensors: layer  60 assigned to device CPU, is_swa = 0
load_tensors: layer  61 assigned to device CPU, is_swa = 0
load_tensors: layer  62 assigned to device CPU, is_swa = 0
load_tensors: layer  63 assigned to device CPU, is_swa = 0
load_tensors: layer  64 assigned to device CPU, is_swa = 0
load_tensors:   CPU_Mapped model buffer size = 18926.01 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     1.20 MiB
create_memory: n_ctx = 8192 (padded)
llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1, padding = 32
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified: layer  32: dev = CPU
llama_kv_cache_unified: layer  33: dev = CPU
llama_kv_cache_unified: layer  34: dev = CPU
llama_kv_cache_unified: layer  35: dev = CPU
llama_kv_cache_unified: layer  36: dev = CPU
llama_kv_cache_unified: layer  37: dev = CPU
llama_kv_cache_unified: layer  38: dev = CPU
llama_kv_cache_unified: layer  39: dev = CPU
llama_kv_cache_unified: layer  40: dev = CPU
llama_kv_cache_unified: layer  41: dev = CPU
llama_kv_cache_unified: layer  42: dev = CPU
llama_kv_cache_unified: layer  43: dev = CPU
llama_kv_cache_unified: layer  44: dev = CPU
llama_kv_cache_unified: layer  45: dev = CPU
llama_kv_cache_unified: layer  46: dev = CPU
llama_kv_cache_unified: layer  47: dev = CPU
llama_kv_cache_unified: layer  48: dev = CPU
llama_kv_cache_unified: layer  49: dev = CPU
llama_kv_cache_unified: layer  50: dev = CPU
llama_kv_cache_unified: layer  51: dev = CPU
llama_kv_cache_unified: layer  52: dev = CPU
llama_kv_cache_unified: layer  53: dev = CPU
llama_kv_cache_unified: layer  54: dev = CPU
llama_kv_cache_unified: layer  55: dev = CPU
llama_kv_cache_unified: layer  56: dev = CPU
llama_kv_cache_unified: layer  57: dev = CPU
llama_kv_cache_unified: layer  58: dev = CPU
llama_kv_cache_unified: layer  59: dev = CPU
llama_kv_cache_unified: layer  60: dev = CPU
llama_kv_cache_unified: layer  61: dev = CPU
llama_kv_cache_unified: layer  62: dev = CPU
llama_kv_cache_unified: layer  63: dev = CPU
time=2025-07-25T07:24:01.599+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-07-25T07:24:01.849+02:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size =  2048.00 MiB
llama_kv_cache_unified: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:        CPU compute buffer size =   696.01 MiB
llama_context: graph nodes  = 2374
llama_context: graph splits = 1
time=2025-07-25T07:24:02.100+02:00 level=INFO source=server.go:637 msg="llama runner started in 20.30 seconds"
time=2025-07-25T07:24:02.100+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/deepseek-r1:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=2 runner.pid=1390103 runner.model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 runner.num_ctx=8192
time=2025-07-25T07:24:02.101+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=34 format=""
time=2025-07-25T07:24:02.103+02:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5
time=2025-07-25T07:29:30.280+02:00 level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-07-25T07:29:30.281+02:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/deepseek-r1:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=2 runner.pid=1390103 runner.model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 runner.num_ctx=8192 duration=5m0s
time=2025-07-25T07:29:30.281+02:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/deepseek-r1:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=2 runner.pid=1390103 runner.model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 runner.num_ctx=8192 refCount=0
time=2025-07-25T07:29:30.281+02:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:36915/completion\": context canceled"
[GIN] 2025/07/25 - 07:29:30 | 200 |         5m49s |       127.0.0.1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

ollama version is 0.9.6

Originally created by @skwde on GitHub (Jul 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11519 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I try to get latest `ollama` working with our A100 but it fails to use my conda installed CUDA. Currently we have the nvidia driver `535.161.07` which supports CUDA version up to `12.2.2`. See following `nvidia-smi` output: ```sh $ nvidia-smi Fri Jul 25 07:13:26 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-PCIE-40GB On | 00000000:17:00.0 Off | 0 | | N/A 31C P0 35W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ ``` The latest `ollama` ships with CUDA `12.9 something` Consequently I need to install my own CUDA. I installed `cuda` version `12.2.2` in a conda environment from the `nvidia/label/cuda-12.2.2` channel. When the environment is activated I get ```sh $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 ``` When I then try to use `ollama` ```sh ./bin/ollama serve ``` It always gives ```out time=2025-07-25T07:20:50.603+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0 ``` and ```out time=2025-07-25T07:20:50.707+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB" ``` not matter what I do. When I try to run ```sh ./bin/ollama run deepseek-r1:32b "hello" ``` It loads forever and prints stuff about assigning layers to the CPU. The same happens when I use a smaller model. I tried various things: - Setting env variables ```sh # Path is already set when activating the conda env export LD_LIBRARY_PATH=$CONDA_PREFIX/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_PATH=$CONDA_PREFIX export CUDA_HOME=$CUDA_PATH ``` - removing the cuda libs in `./lib/ollama` and linking to the corresponding ones in `$CONDA_PREFIX/lib` without success - start `ollama` while `OLLAMA_LLM_LIBRARY=cuda_v12` is set (though I am not sure whether `cuda_v12` is correct because noting is mentioned in the server log (contrary to what is mentioned here: https://ollama.qubitpi.org/troubleshooting/#llm-libraries) Based on the docs (https://ollama.qubitpi.org/gpu/#nvidia) my system + driver are supported so what am I missing here? ### Relevant log output ```shell time=2025-07-25T07:23:35.062+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:<ollama install dir>/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]" time=2025-07-25T07:23:35.069+02:00 level=INFO source=images.go:476 msg="total blobs: 10" time=2025-07-25T07:23:35.071+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-25T07:23:35.072+02:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.6)" time=2025-07-25T07:23:35.073+02:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-07-25T07:23:35.074+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-25T07:23:35.088+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-07-25T07:23:35.088+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-07-25T07:23:35.088+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama install dir>/lib/ollama/libcuda.so* <base>/.conda/envs/cuda-12.2/lib/libcuda.so* <other1>libcuda.so* <other2>/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-07-25T07:23:35.091+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[] time=2025-07-25T07:23:35.091+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcudart.so* time=2025-07-25T07:23:35.091+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama install dir>/lib/ollama/libcudart.so* <base>/.conda/envs/cuda-12.2/lib/libcudart.so* <other1>libcudart.so* <other2>/libcudart.so* <ollama install dir>/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" time=2025-07-25T07:23:35.097+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[<base>/.conda/envs/cuda-12.2/lib/libcudart.so.12.2.140] CUDA driver version: 12-2 time=2025-07-25T07:23:35.397+02:00 level=DEBUG source=gpu.go:140 msg="detected GPUs" library=<base>/.conda/envs/cuda-12.2/lib/libcudart.so.12.2.140 count=1 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0 time=2025-07-25T07:23:35.399+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0 time=2025-07-25T07:23:35.399+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cudart library time=2025-07-25T07:23:35.527+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB" [GIN] 2025/07/25 - 07:23:40 | 200 | 1.148083ms | 127.0.0.1 | HEAD "/" time=2025-07-25T07:23:41.002+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 [GIN] 2025/07/25 - 07:23:41 | 200 | 80.210208ms | 127.0.0.1 | POST "/api/show" time=2025-07-25T07:23:41.054+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="494.0 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB" CUDA driver version: 12-2 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0 time=2025-07-25T07:23:41.203+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B" releasing cudart library time=2025-07-25T07:23:41.280+02:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-07-25T07:23:41.300+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-25T07:23:41.341+02:00 level=DEBUG source=sched.go:228 msg="loading first model" model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]" time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.vision.block_count default=0 time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.key_length default=128 time=2025-07-25T07:23:41.342+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.value_length default=128 time=2025-07-25T07:23:41.343+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d parallel=2 available=41855287296 required="21.5 GiB" time=2025-07-25T07:23:41.343+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="493.9 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB" CUDA driver version: 12-2 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0 time=2025-07-25T07:23:41.476+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B" releasing cudart library time=2025-07-25T07:23:41.552+02:00 level=INFO source=server.go:135 msg="system memory" total="503.3 GiB" free="493.9 GiB" free_swap="31.9 GiB" time=2025-07-25T07:23:41.552+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]" time=2025-07-25T07:23:41.552+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.vision.block_count default=0 time=2025-07-25T07:23:41.553+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.key_length default=128 time=2025-07-25T07:23:41.553+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen2.attention.value_length default=128 time=2025-07-25T07:23:41.553+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[39.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="17.5 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" time=2025-07-25T07:23:41.553+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from <ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen2.block_count u32 = 64 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.48 GiB (4.85 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: control token: 151647 '<|EOT|>' is not marked as EOG load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG load: control token: 151644 '<|User|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151645 '<|Assistant|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 32.76 B print_info: general.name = DeepSeek R1 Distill Qwen 32B print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-07-25T07:23:41.781+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="<ollama install dir>/bin/ollama runner --model <ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 64 --parallel 2 --port 36915" time=2025-07-25T07:23:41.781+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_DEBUG=1 LD_LIBRARY_PATH=<ollama install dir>/lib/ollama:<base>/.conda/envs/cuda-12.2/lib:<other2>:<ollama install dir>/lib/ollama CUDA_PATH=<base>/.conda/envs/cuda-12.2 ROCR_VISIBLE_DEVICES=0 CUDA_HOME=<base>/.conda/envs/cuda-12.2 CUDA_VISIBLE_DEVICES=GPU-30482f99-0f88-5514-0dea-d2f901ad513d PATH=<base>/.conda/envs/cuda-12.2/bin:<other>/sbin:<other>/bin:<HOME>/.vscode-server/cli/servers/Stable-7adae6a56e34cb64d08899664b814cf620465925/server/bin/remote-cli:/usr/local/lsfm/bin:/net/shared/lsfm/common/admin/internal/bin:/net/shared/lsfm/common/admin/external/bin:/net/shared/lsfm/common/admin/external/opt/mambaforge/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin OLLAMA_MODELS=<ollama install dir>/models GPU_DEVICE_ORDINAL=0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=<ollama install dir>/lib/ollama time=2025-07-25T07:23:41.789+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-25T07:23:41.801+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-25T07:23:41.805+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-25T07:23:41.852+02:00 level=INFO source=runner.go:815 msg="starting go runner" time=2025-07-25T07:23:41.854+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=<ollama install dir>/lib/ollama load_backend: loaded CPU backend from <ollama install dir>/lib/ollama/libggml-cpu-icelake.so time=2025-07-25T07:23:41.905+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-07-25T07:23:41.906+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:36915" llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from <ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen2.block_count u32 = 64 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.48 GiB (4.85 BPW) init_tokenizer: initializing tokenizer for type 2 time=2025-07-25T07:23:42.059+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: control token: 151647 '<|EOT|>' is not marked as EOG load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG load: control token: 151644 '<|User|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151645 '<|Assistant|>' is not marked as EOG load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 40 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 5 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 27648 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = DeepSeek R1 Distill Qwen 32B print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 load_tensors: layer 29 assigned to device CPU, is_swa = 0 load_tensors: layer 30 assigned to device CPU, is_swa = 0 load_tensors: layer 31 assigned to device CPU, is_swa = 0 load_tensors: layer 32 assigned to device CPU, is_swa = 0 load_tensors: layer 33 assigned to device CPU, is_swa = 0 load_tensors: layer 34 assigned to device CPU, is_swa = 0 load_tensors: layer 35 assigned to device CPU, is_swa = 0 load_tensors: layer 36 assigned to device CPU, is_swa = 0 load_tensors: layer 37 assigned to device CPU, is_swa = 0 load_tensors: layer 38 assigned to device CPU, is_swa = 0 load_tensors: layer 39 assigned to device CPU, is_swa = 0 load_tensors: layer 40 assigned to device CPU, is_swa = 0 load_tensors: layer 41 assigned to device CPU, is_swa = 0 load_tensors: layer 42 assigned to device CPU, is_swa = 0 load_tensors: layer 43 assigned to device CPU, is_swa = 0 load_tensors: layer 44 assigned to device CPU, is_swa = 0 load_tensors: layer 45 assigned to device CPU, is_swa = 0 load_tensors: layer 46 assigned to device CPU, is_swa = 0 load_tensors: layer 47 assigned to device CPU, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: layer 49 assigned to device CPU, is_swa = 0 load_tensors: layer 50 assigned to device CPU, is_swa = 0 load_tensors: layer 51 assigned to device CPU, is_swa = 0 load_tensors: layer 52 assigned to device CPU, is_swa = 0 load_tensors: layer 53 assigned to device CPU, is_swa = 0 load_tensors: layer 54 assigned to device CPU, is_swa = 0 load_tensors: layer 55 assigned to device CPU, is_swa = 0 load_tensors: layer 56 assigned to device CPU, is_swa = 0 load_tensors: layer 57 assigned to device CPU, is_swa = 0 load_tensors: layer 58 assigned to device CPU, is_swa = 0 load_tensors: layer 59 assigned to device CPU, is_swa = 0 load_tensors: layer 60 assigned to device CPU, is_swa = 0 load_tensors: layer 61 assigned to device CPU, is_swa = 0 load_tensors: layer 62 assigned to device CPU, is_swa = 0 load_tensors: layer 63 assigned to device CPU, is_swa = 0 load_tensors: layer 64 assigned to device CPU, is_swa = 0 load_tensors: CPU_Mapped model buffer size = 18926.01 MiB llama_context: constructing llama_context llama_context: n_seq_max = 2 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 1024 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 1.20 MiB create_memory: n_ctx = 8192 (padded) llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1, padding = 32 llama_kv_cache_unified: layer 0: dev = CPU llama_kv_cache_unified: layer 1: dev = CPU llama_kv_cache_unified: layer 2: dev = CPU llama_kv_cache_unified: layer 3: dev = CPU llama_kv_cache_unified: layer 4: dev = CPU llama_kv_cache_unified: layer 5: dev = CPU llama_kv_cache_unified: layer 6: dev = CPU llama_kv_cache_unified: layer 7: dev = CPU llama_kv_cache_unified: layer 8: dev = CPU llama_kv_cache_unified: layer 9: dev = CPU llama_kv_cache_unified: layer 10: dev = CPU llama_kv_cache_unified: layer 11: dev = CPU llama_kv_cache_unified: layer 12: dev = CPU llama_kv_cache_unified: layer 13: dev = CPU llama_kv_cache_unified: layer 14: dev = CPU llama_kv_cache_unified: layer 15: dev = CPU llama_kv_cache_unified: layer 16: dev = CPU llama_kv_cache_unified: layer 17: dev = CPU llama_kv_cache_unified: layer 18: dev = CPU llama_kv_cache_unified: layer 19: dev = CPU llama_kv_cache_unified: layer 20: dev = CPU llama_kv_cache_unified: layer 21: dev = CPU llama_kv_cache_unified: layer 22: dev = CPU llama_kv_cache_unified: layer 23: dev = CPU llama_kv_cache_unified: layer 24: dev = CPU llama_kv_cache_unified: layer 25: dev = CPU llama_kv_cache_unified: layer 26: dev = CPU llama_kv_cache_unified: layer 27: dev = CPU llama_kv_cache_unified: layer 28: dev = CPU llama_kv_cache_unified: layer 29: dev = CPU llama_kv_cache_unified: layer 30: dev = CPU llama_kv_cache_unified: layer 31: dev = CPU llama_kv_cache_unified: layer 32: dev = CPU llama_kv_cache_unified: layer 33: dev = CPU llama_kv_cache_unified: layer 34: dev = CPU llama_kv_cache_unified: layer 35: dev = CPU llama_kv_cache_unified: layer 36: dev = CPU llama_kv_cache_unified: layer 37: dev = CPU llama_kv_cache_unified: layer 38: dev = CPU llama_kv_cache_unified: layer 39: dev = CPU llama_kv_cache_unified: layer 40: dev = CPU llama_kv_cache_unified: layer 41: dev = CPU llama_kv_cache_unified: layer 42: dev = CPU llama_kv_cache_unified: layer 43: dev = CPU llama_kv_cache_unified: layer 44: dev = CPU llama_kv_cache_unified: layer 45: dev = CPU llama_kv_cache_unified: layer 46: dev = CPU llama_kv_cache_unified: layer 47: dev = CPU llama_kv_cache_unified: layer 48: dev = CPU llama_kv_cache_unified: layer 49: dev = CPU llama_kv_cache_unified: layer 50: dev = CPU llama_kv_cache_unified: layer 51: dev = CPU llama_kv_cache_unified: layer 52: dev = CPU llama_kv_cache_unified: layer 53: dev = CPU llama_kv_cache_unified: layer 54: dev = CPU llama_kv_cache_unified: layer 55: dev = CPU llama_kv_cache_unified: layer 56: dev = CPU llama_kv_cache_unified: layer 57: dev = CPU llama_kv_cache_unified: layer 58: dev = CPU llama_kv_cache_unified: layer 59: dev = CPU llama_kv_cache_unified: layer 60: dev = CPU llama_kv_cache_unified: layer 61: dev = CPU llama_kv_cache_unified: layer 62: dev = CPU llama_kv_cache_unified: layer 63: dev = CPU time=2025-07-25T07:24:01.599+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-07-25T07:24:01.849+02:00 level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 2048.00 MiB llama_kv_cache_unified: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CPU compute buffer size = 696.01 MiB llama_context: graph nodes = 2374 llama_context: graph splits = 1 time=2025-07-25T07:24:02.100+02:00 level=INFO source=server.go:637 msg="llama runner started in 20.30 seconds" time=2025-07-25T07:24:02.100+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/deepseek-r1:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=2 runner.pid=1390103 runner.model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 runner.num_ctx=8192 time=2025-07-25T07:24:02.101+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=34 format="" time=2025-07-25T07:24:02.103+02:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 time=2025-07-25T07:29:30.280+02:00 level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-07-25T07:29:30.281+02:00 level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/deepseek-r1:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=2 runner.pid=1390103 runner.model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 runner.num_ctx=8192 duration=5m0s time=2025-07-25T07:29:30.281+02:00 level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/deepseek-r1:32b runner.inference=cuda runner.devices=1 runner.size="21.5 GiB" runner.vram="21.5 GiB" runner.parallel=2 runner.pid=1390103 runner.model=<ollama install dir>/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 runner.num_ctx=8192 refCount=0 time=2025-07-25T07:29:30.281+02:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:36915/completion\": context canceled" [GIN] 2025/07/25 - 07:29:30 | 200 | 5m49s | 127.0.0.1 | POST "/api/generate" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version ollama version is 0.9.6
GiteaMirror added the bugnvidia labels 2026-04-29 05:14:43 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 27, 2025):

The latest ollama ships with CUDA 12.9 something
Consequently I need to install my own CUDA.

You shouldn't need to. I run 0.9.6 with 12.2 drivers:

nvidia-smi | grep Driver
| NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.2     |

I notice that your logs show loading libcudart libraries from the conda environment. Try using the system libcuda libraries.

<!-- gh-comment-id:3124162471 --> @rick-github commented on GitHub (Jul 27, 2025): > The latest ollama ships with CUDA 12.9 something > Consequently I need to install my own CUDA. You shouldn't need to. I run 0.9.6 with 12.2 drivers: ``` nvidia-smi | grep Driver | NVIDIA-SMI 535.247.01 Driver Version: 535.247.01 CUDA Version: 12.2 | ``` I notice that your logs show loading libcudart libraries from the conda environment. Try using the system libcuda libraries.
Author
Owner

@skwde commented on GitHub (Jul 28, 2025):

Thanks for you reply @rick-github . I am not sure what you mean by "system libcuda libraries".
I am in an HPC environment, where there is no "system libcuda libraries". We merely installed the driver (and leave it to the user to install a fitting cuda toolkit).

I reinstalled the latest version of ollama (to remove my test with conda installed CUDA 12.2).

Following a picture of lib/ollama content.

Image

I get 12.8, which according to table 3 in https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions is not supported by the driver we have.

You can find the full log below.
First I want to highlight some things I find odd:

...
time=2025-07-28T05:58:47.777+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0
...
id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB"
...
time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:375 msg="offloaded 0/27 layers to GPU"
...
time=2025-07-28T05:58:53.621+02:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="68.2 MiB"

So I interpret these as "ollama does see the GPU but is not happy with the driver so it falls back to the CPU"
Then it gets stuck for some reason.

Below the full log.

time=2025-07-28T05:58:46.080+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:<ollama_base>/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]"
time=2025-07-28T05:58:47.381+02:00 level=INFO source=images.go:476 msg="total blobs: 12"
time=2025-07-28T05:58:47.382+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-28T05:58:47.383+02:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.6)"
time=2025-07-28T05:58:47.384+02:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler"
time=2025-07-28T05:58:47.384+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-28T05:58:47.398+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-07-28T05:58:47.398+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-07-28T05:58:47.398+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama_base>/lib/ollama/libcuda.so* <ohter1>/libcuda.so* <ohter2>/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-07-28T05:58:47.400+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[]
time=2025-07-28T05:58:47.400+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcudart.so*
time=2025-07-28T05:58:47.400+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama_base>/lib/ollama/libcudart.so* <ohter1>/libcudart.so* <ohter2>/lib64/libcudart.so* <ollama_base>/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2025-07-28T05:58:47.402+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[<ollama_base>/lib/ollama/libcudart.so.12.8.90]
CUDA driver version: 12-2
time=2025-07-28T05:58:47.775+02:00 level=DEBUG source=gpu.go:140 msg="detected GPUs" library=<ollama_base>/lib/ollama/libcudart.so.12.8.90 count=1
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0
time=2025-07-28T05:58:47.777+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0
time=2025-07-28T05:58:47.777+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cudart library
time=2025-07-28T05:58:47.875+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB"
[GIN] 2025/07/28 - 05:58:51 | 200 |     700.669µs |       127.0.0.1 | HEAD     "/"
time=2025-07-28T05:58:52.025+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
[GIN] 2025/07/28 - 05:58:52 | 200 |  402.480311ms |       127.0.0.1 | POST     "/api/show"
time=2025-07-28T05:58:52.142+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="494.0 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB"
CUDA driver version: 12-2
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0
time=2025-07-28T05:58:52.346+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B"
releasing cudart library
time=2025-07-28T05:58:52.460+02:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-07-28T05:58:52.507+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-28T05:58:52.617+02:00 level=DEBUG source=sched.go:228 msg="loading first model" model=<ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01
time=2025-07-28T05:58:52.618+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]"
time=2025-07-28T05:58:52.618+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0
time=2025-07-28T05:58:52.618+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=<ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d parallel=2 available=41855287296 required="1.8 GiB"
time=2025-07-28T05:58:52.618+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="493.9 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB"
CUDA driver version: 12-2
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0
[GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0
time=2025-07-28T05:58:52.767+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B"
releasing cudart library
time=2025-07-28T05:58:52.853+02:00 level=INFO source=server.go:135 msg="system memory" total="503.3 GiB" free="493.9 GiB" free_swap="31.9 GiB"
time=2025-07-28T05:58:52.853+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]"
time=2025-07-28T05:58:52.853+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0
time=2025-07-28T05:58:52.854+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[39.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.8 GiB" memory.required.partial="1.8 GiB" memory.required.kv="65.0 MiB" memory.required.allocations="[1.8 GiB]" memory.weights.total="762.5 MiB" memory.weights.repeating="456.5 MiB" memory.weights.nonrepeating="306.0 MiB" memory.graph.full="514.2 MiB" memory.graph.partial="750.5 MiB"
time=2025-07-28T05:58:52.854+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
time=2025-07-28T05:58:52.938+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-28T05:58:52.938+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2025-07-28T05:58:52.938+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.num_channels default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.embedding_length default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.head_count default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.freq_scale default=1
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
time=2025-07-28T05:58:52.941+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="<ollama_base>/bin/ollama runner --ollama-engine --model <ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 --ctx-size 8192 --batch-size 512 --n-gpu-layers 27 --threads 64 --parallel 2 --port 33515"
time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_DEBUG=1 LD_LIBRARY_PATH=<ollama_base>/lib/ollama:<ohter1>:<ohter2>/lib64:<ollama_base>/lib/ollama ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-30482f99-0f88-5514-0dea-d2f901ad513d PATH=<ohter2>/sbin:<ohter2>/bin:<home>/.vscode-server/cli/servers/Stable-7adae6a56e34cb64d08899664b814cf620465925/server/bin/remote-cli:/usr/local/lsfm/bin:/net/shared/lsfm/common/admin/internal/bin:/net/shared/lsfm/common/admin/external/bin:/net/shared/lsfm/common/admin/external/opt/mambaforge/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:<home>/.vscode-server/extensions/ms-python.debugpy-2025.10.0/bundled/scripts/noConfigScripts OLLAMA_MODELS=<ollama_base>/models GPU_DEVICE_ORDINAL=0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=<ollama_base>/lib/ollama
time=2025-07-28T05:58:52.947+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-28T05:58:52.953+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-28T05:58:52.977+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-28T05:58:52.993+02:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-07-28T05:58:52.994+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:33515"
time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default=""
time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default=""
time=2025-07-28T05:58:53.085+02:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=340 num_key_values=32
time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=<ollama_base>/lib/ollama
time=2025-07-28T05:58:53.244+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_backend: loaded CPU backend from <ollama_base>/lib/ollama/libggml-cpu-icelake.so
time=2025-07-28T05:58:53.571+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:359 msg="offloading 26 repeating layers to GPU"
time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:363 msg="offloading output layer to CPU"
time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:375 msg="offloaded 0/27 layers to GPU"
time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CPU size="1.0 GiB"
time=2025-07-28T05:58:53.573+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2025-07-28T05:58:53.573+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.num_channels default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.embedding_length default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.head_count default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.freq_scale default=1
time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
time=2025-07-28T05:58:53.621+02:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1151 splits=1
time=2025-07-28T05:58:53.621+02:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="68.2 MiB"
time=2025-07-28T05:58:53.621+02:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=320864256A allocated.CPU.Weights="[19491584A 19491584A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 19491584A 19491584A 19491584A 19491584A 320868864A]" allocated.CPU.Cache="[1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 0U]" allocated.CPU.Graph=71565312A
time=2025-07-28T05:58:53.745+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.14"
time=2025-07-28T05:58:53.996+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.41"
time=2025-07-28T05:58:54.246+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.67"
time=2025-07-28T05:58:54.497+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.91"
time=2025-07-28T05:58:54.747+02:00 level=INFO source=server.go:637 msg="llama runner started in 1.80 seconds"
time=2025-07-28T05:58:54.747+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/gemma3:1b runner.inference=cuda runner.devices=1 runner.size="1.8 GiB" runner.vram="1.8 GiB" runner.parallel=2 runner.pid=2145171 runner.model=<ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 runner.num_ctx=8192
time=2025-07-28T05:58:54.748+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=60 format=""
time=2025-07-28T05:58:54.762+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2]
time=2025-07-28T05:58:54.762+02:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=10 used=0 remaining=10

<!-- gh-comment-id:3125370090 --> @skwde commented on GitHub (Jul 28, 2025): Thanks for you reply @rick-github . I am not sure what you mean by "system libcuda libraries". I am in an HPC environment, where there is no "system libcuda libraries". We merely installed the driver (and leave it to the user to install a fitting cuda toolkit). I reinstalled the latest version of `ollama` (to remove my test with `conda` installed CUDA 12.2). Following a picture of `lib/ollama` content. <img width="331" height="225" alt="Image" src="https://github.com/user-attachments/assets/a3543fb4-bb52-41bd-bc60-0b4f9e60d311" /> I get 12.8, which according to table 3 in https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions is not supported by the driver we have. You can find the full log below. First I want to highlight some things I find odd: ```log ... time=2025-07-28T05:58:47.777+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0 ... id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB" ... time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:375 msg="offloaded 0/27 layers to GPU" ... time=2025-07-28T05:58:53.621+02:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="68.2 MiB" ``` So I interpret these as "ollama does see the GPU but is not happy with the driver so it falls back to the CPU" Then it gets stuck for some reason. Below the full log. ```log time=2025-07-28T05:58:46.080+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:<ollama_base>/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]" time=2025-07-28T05:58:47.381+02:00 level=INFO source=images.go:476 msg="total blobs: 12" time=2025-07-28T05:58:47.382+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-28T05:58:47.383+02:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11434 (version 0.9.6)" time=2025-07-28T05:58:47.384+02:00 level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-07-28T05:58:47.384+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-28T05:58:47.398+02:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-07-28T05:58:47.398+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-07-28T05:58:47.398+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama_base>/lib/ollama/libcuda.so* <ohter1>/libcuda.so* <ohter2>/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-07-28T05:58:47.400+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[] time=2025-07-28T05:58:47.400+02:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcudart.so* time=2025-07-28T05:58:47.400+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[<ollama_base>/lib/ollama/libcudart.so* <ohter1>/libcudart.so* <ohter2>/lib64/libcudart.so* <ollama_base>/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" time=2025-07-28T05:58:47.402+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[<ollama_base>/lib/ollama/libcudart.so.12.8.90] CUDA driver version: 12-2 time=2025-07-28T05:58:47.775+02:00 level=DEBUG source=gpu.go:140 msg="detected GPUs" library=<ollama_base>/lib/ollama/libcudart.so.12.8.90 count=1 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0 time=2025-07-28T05:58:47.777+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0 time=2025-07-28T05:58:47.777+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cudart library time=2025-07-28T05:58:47.875+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-30482f99-0f88-5514-0dea-d2f901ad513d library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB" [GIN] 2025/07/28 - 05:58:51 | 200 | 700.669µs | 127.0.0.1 | HEAD "/" time=2025-07-28T05:58:52.025+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 [GIN] 2025/07/28 - 05:58:52 | 200 | 402.480311ms | 127.0.0.1 | POST "/api/show" time=2025-07-28T05:58:52.142+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="494.0 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB" CUDA driver version: 12-2 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0 time=2025-07-28T05:58:52.346+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B" releasing cudart library time=2025-07-28T05:58:52.460+02:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-07-28T05:58:52.507+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-28T05:58:52.617+02:00 level=DEBUG source=sched.go:228 msg="loading first model" model=<ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 time=2025-07-28T05:58:52.618+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]" time=2025-07-28T05:58:52.618+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0 time=2025-07-28T05:58:52.618+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=<ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d parallel=2 available=41855287296 required="1.8 GiB" time=2025-07-28T05:58:52.618+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="503.3 GiB" before.free="493.9 GiB" before.free_swap="31.9 GiB" now.total="503.3 GiB" now.free="493.9 GiB" now.free_swap="31.9 GiB" CUDA driver version: 12-2 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA totalMem 42298834944 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA freeMem 41855287296 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] CUDA usedMem 0 [GPU-30482f99-0f88-5514-0dea-d2f901ad513d] Compute Capability 8.0 time=2025-07-28T05:58:52.767+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-30482f99-0f88-5514-0dea-d2f901ad513d name="" overhead="0 B" before.total="39.4 GiB" before.free="39.0 GiB" now.total="39.4 GiB" now.free="39.0 GiB" now.used="0 B" releasing cudart library time=2025-07-28T05:58:52.853+02:00 level=INFO source=server.go:135 msg="system memory" total="503.3 GiB" free="493.9 GiB" free_swap="31.9 GiB" time=2025-07-28T05:58:52.853+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[39.0 GiB]" time=2025-07-28T05:58:52.853+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0 time=2025-07-28T05:58:52.854+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[39.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.8 GiB" memory.required.partial="1.8 GiB" memory.required.kv="65.0 MiB" memory.required.allocations="[1.8 GiB]" memory.weights.total="762.5 MiB" memory.weights.repeating="456.5 MiB" memory.weights.nonrepeating="306.0 MiB" memory.graph.full="514.2 MiB" memory.graph.partial="750.5 MiB" time=2025-07-28T05:58:52.854+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] time=2025-07-28T05:58:52.938+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-28T05:58:52.938+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2025-07-28T05:58:52.938+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.num_channels default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.embedding_length default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.head_count default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.freq_scale default=1 time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 time=2025-07-28T05:58:52.941+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="<ollama_base>/bin/ollama runner --ollama-engine --model <ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 --ctx-size 8192 --batch-size 512 --n-gpu-layers 27 --threads 64 --parallel 2 --port 33515" time=2025-07-28T05:58:52.941+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_DEBUG=1 LD_LIBRARY_PATH=<ollama_base>/lib/ollama:<ohter1>:<ohter2>/lib64:<ollama_base>/lib/ollama ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-30482f99-0f88-5514-0dea-d2f901ad513d PATH=<ohter2>/sbin:<ohter2>/bin:<home>/.vscode-server/cli/servers/Stable-7adae6a56e34cb64d08899664b814cf620465925/server/bin/remote-cli:/usr/local/lsfm/bin:/net/shared/lsfm/common/admin/internal/bin:/net/shared/lsfm/common/admin/external/bin:/net/shared/lsfm/common/admin/external/opt/mambaforge/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:<home>/.vscode-server/extensions/ms-python.debugpy-2025.10.0/bundled/scripts/noConfigScripts OLLAMA_MODELS=<ollama_base>/models GPU_DEVICE_ORDINAL=0 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=<ollama_base>/lib/ollama time=2025-07-28T05:58:52.947+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-28T05:58:52.953+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-28T05:58:52.977+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-28T05:58:52.993+02:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-07-28T05:58:52.994+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:33515" time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default="" time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default="" time=2025-07-28T05:58:53.085+02:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=340 num_key_values=32 time=2025-07-28T05:58:53.085+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=<ollama_base>/lib/ollama time=2025-07-28T05:58:53.244+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load_backend: loaded CPU backend from <ollama_base>/lib/ollama/libggml-cpu-icelake.so time=2025-07-28T05:58:53.571+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:359 msg="offloading 26 repeating layers to GPU" time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:363 msg="offloading output layer to CPU" time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:375 msg="offloaded 0/27 layers to GPU" time=2025-07-28T05:58:53.572+02:00 level=INFO source=ggml.go:377 msg="model weights" buffer=CPU size="1.0 GiB" time=2025-07-28T05:58:53.573+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2025-07-28T05:58:53.573+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.num_channels default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.block_count default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.embedding_length default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.head_count default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.image_size default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.patch_size default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.freq_scale default=1 time=2025-07-28T05:58:53.575+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 time=2025-07-28T05:58:53.621+02:00 level=DEBUG source=ggml.go:648 msg="compute graph" nodes=1151 splits=1 time=2025-07-28T05:58:53.621+02:00 level=INFO source=ggml.go:666 msg="compute graph" backend=CPU buffer_type=CPU size="68.2 MiB" time=2025-07-28T05:58:53.621+02:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=320864256A allocated.CPU.Weights="[19491584A 19491584A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 17328128A 19491584A 17328128A 19491584A 19491584A 19491584A 19491584A 320868864A]" allocated.CPU.Cache="[1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 1572864A 1572864A 1572864A 8388608A 1572864A 1572864A 0U]" allocated.CPU.Graph=71565312A time=2025-07-28T05:58:53.745+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.14" time=2025-07-28T05:58:53.996+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.41" time=2025-07-28T05:58:54.246+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.67" time=2025-07-28T05:58:54.497+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.91" time=2025-07-28T05:58:54.747+02:00 level=INFO source=server.go:637 msg="llama runner started in 1.80 seconds" time=2025-07-28T05:58:54.747+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/gemma3:1b runner.inference=cuda runner.devices=1 runner.size="1.8 GiB" runner.vram="1.8 GiB" runner.parallel=2 runner.pid=2145171 runner.model=<ollama_base>/models/blobs/sha256-7cd4618c1faf8b7233c6c906dac1694b6a47684b37b8895d470ac688520b9c01 runner.num_ctx=8192 time=2025-07-28T05:58:54.748+02:00 level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=60 format="" time=2025-07-28T05:58:54.762+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[2] time=2025-07-28T05:58:54.762+02:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=10 used=0 remaining=10 ```
Author
Owner

@skwde commented on GitHub (Jul 28, 2025):

A short update, since I just didn't close the hanging shell and I now saw that output was generated after 1h 50m.

See the log (I aborted the query of above log and started a new one):

time=2025-07-28T06:18:20.266+02:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=17 used=0 remaining=17
[GIN] 2025/07/28 - 08:08:41 | 200 |      1h50m23s |       127.0.0.1 | POST     "/api/generate"

This seems like there is something not working properly...
I rangemma3:1b on an A100 with Hello there, do I get an answer

<!-- gh-comment-id:3125687373 --> @skwde commented on GitHub (Jul 28, 2025): A short update, since I just didn't close the hanging shell and I now saw that output was generated after 1h 50m. See the log (I aborted the query of above log and started a new one): ```log time=2025-07-28T06:18:20.266+02:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=17 used=0 remaining=17 [GIN] 2025/07/28 - 08:08:41 | 200 | 1h50m23s | 127.0.0.1 | POST "/api/generate" ``` This seems like there is something not working properly... I ran`gemma3:1b` on an A100 with `Hello there, do I get an answer`
Author
Owner

@rick-github commented on GitHub (Jul 28, 2025):

On my A100, I installed the system CUDA libraries as suggested here, when ollama starts it loads libraries from /usr/lib:

$ sudo journalctl -u ollama --no-pager | grep libcuda
Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.240Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.241Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.243Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 /usr/lib32/libcuda.so.550.90.07]"
Jul 28 06:38:05 a100-40g ollama[8080]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07
Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.344Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07
$ ollama run gemma3:1b --verbose Hello there, do I get an answer
Hello! Yes, absolutely. 😊 

You can answer whatever you’d like.  Do you have any questions you’d like to ask, or would you like me to just chat?

total duration:       2.613416161s
load duration:        2.122548095s
prompt eval count:    17 token(s)
prompt eval duration: 215.20211ms
prompt eval rate:     79.00 tokens/s
eval count:           41 token(s)
eval duration:        274.484211ms
eval rate:            149.37 tokens/
<!-- gh-comment-id:3125763044 --> @rick-github commented on GitHub (Jul 28, 2025): On my A100, I installed the system CUDA libraries as suggested [here](https://github.com/ollama/ollama/blob/main/docs/linux.md#install-cuda-drivers-optional), when ollama starts it loads libraries from `/usr/lib`: ``` $ sudo journalctl -u ollama --no-pager | grep libcuda Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.240Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.241Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.243Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 /usr/lib32/libcuda.so.550.90.07]" Jul 28 06:38:05 a100-40g ollama[8080]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 Jul 28 06:38:05 a100-40g ollama[8080]: time=2025-07-28T06:38:05.344Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 ``` ```console $ ollama run gemma3:1b --verbose Hello there, do I get an answer Hello! Yes, absolutely. 😊 You can answer whatever you’d like. Do you have any questions you’d like to ask, or would you like me to just chat? total duration: 2.613416161s load duration: 2.122548095s prompt eval count: 17 token(s) prompt eval duration: 215.20211ms prompt eval rate: 79.00 tokens/s eval count: 41 token(s) eval duration: 274.484211ms eval rate: 149.37 tokens/ ```
Author
Owner

@skwde commented on GitHub (Jul 28, 2025):

Ok, yes that's what I thought.

How do I specify a non standard path of CUDA to ollama? CUDA is not under the default path /usr/lib.

As mention setting variables like PATH / LD_LIBRARY / CUDA_HOME / CUDA_PATH or links to lib/ollama does not work.

<!-- gh-comment-id:3126273412 --> @skwde commented on GitHub (Jul 28, 2025): Ok, yes that's what I thought. How do I specify a non standard path of CUDA to `ollama`? CUDA is not under the default path `/usr/lib`. As mention setting variables like `PATH` / LD_LIBRARY / `CUDA_HOME` / `CUDA_PATH` or links to `lib/ollama` does not work.
Author
Owner

@skwde commented on GitHub (Jul 29, 2025):

@rick-github, In addition to my last comment. I tested it again and I am confused about

time=2025-07-29T10:02:29.804+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[<ollama base>/lib/ollama/libcudart.so.12.8.90]
CUDA driver version: 12-2
time=2025-07-29T10:02:30.437+02:00 level=DEBUG source=gpu.go:140 msg="detected GPUs" library=<ollama base>/lib/ollama/libcudart.so.12.8.90 count=4
[GPU-913e6980-b33a-cf46-6c27-1009e419ba11] CUDA totalMem 42298834944
[GPU-913e6980-b33a-cf46-6c27-1009e419ba11] CUDA freeMem 41855287296
[GPU-913e6980-b33a-cf46-6c27-1009e419ba11] CUDA usedMem 0
[GPU-913e6980-b33a-cf46-6c27-1009e419ba11] Compute Capability 8.0
time=2025-07-29T10:02:30.634+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0
time=2025-07-29T10:02:31.012+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cudart library
time=2025-07-29T10:02:31.486+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-913e6980-b33a-cf46-6c27-1009e419ba11 library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB"

It uses the ollama provided CUDA libs, right?
So based on 3515cc377c/discover/cuda_common.go (L62-L68) it should not print this message?
Or what CUDA is ollama acutally using?

<!-- gh-comment-id:3131192300 --> @skwde commented on GitHub (Jul 29, 2025): @rick-github, In addition to my last comment. I tested it again and I am confused about ```log time=2025-07-29T10:02:29.804+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[<ollama base>/lib/ollama/libcudart.so.12.8.90] CUDA driver version: 12-2 time=2025-07-29T10:02:30.437+02:00 level=DEBUG source=gpu.go:140 msg="detected GPUs" library=<ollama base>/lib/ollama/libcudart.so.12.8.90 count=4 [GPU-913e6980-b33a-cf46-6c27-1009e419ba11] CUDA totalMem 42298834944 [GPU-913e6980-b33a-cf46-6c27-1009e419ba11] CUDA freeMem 41855287296 [GPU-913e6980-b33a-cf46-6c27-1009e419ba11] CUDA usedMem 0 [GPU-913e6980-b33a-cf46-6c27-1009e419ba11] Compute Capability 8.0 time=2025-07-29T10:02:30.634+02:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0 time=2025-07-29T10:02:31.012+02:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cudart library time=2025-07-29T10:02:31.486+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-913e6980-b33a-cf46-6c27-1009e419ba11 library=cuda variant=v11 compute=8.0 driver=0.0 name="" total="39.4 GiB" available="39.0 GiB" ``` It uses the `ollama` provided CUDA libs, right? So based on https://github.com/ollama/ollama/blob/3515cc377ce2506c95a0ea408fd5d15d306fc6aa/discover/cuda_common.go#L62-L68 it should not print this message? Or what CUDA is `ollama` acutally using?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54119