[GH-ISSUE #13163] Ollama 0.12.11 Not Using GPU on RTX 5070 Ti (Blackwell/CC 12.0) #8704

Closed
opened 2026-04-12 21:28:37 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @deparko on GitHub (Nov 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13163

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Description

Ollama 0.12.11 fails to detect and use the GPU for local models on NVIDIA GeForce RTX 5070 Ti (Blackwell architecture, Compute Capability 12.0). The GPU is functional and accessible, but Ollama immediately falls back to CPU-only mode without error messages.

Critical: This worked before November 17, 2025, indicating a regression or compatibility issue with Blackwell architecture.

Environment

  • OS: Ubuntu 25.04 (GNU/Linux 6.14.0-35-generic x86_64)
  • GPU: NVIDIA GeForce RTX 5070 Ti (16GB VRAM)
  • GPU Compute Capability: 12.0 (Blackwell architecture)
  • GPU Driver: 580.95.05
  • CUDA Runtime: 12.2.140
  • Ollama Version: 0.12.11 (latest, clean install)
  • Installation Method: Standalone binary via systemd service

Steps to Reproduce

  1. Install Ollama 0.12.11 on system with RTX 5070 Ti
  2. Configure minimal systemd override:
    [Service]
    Environment=OLLAMA_MODELS=/mnt/shared/ollama-models/models
    Environment=CUDA_VISIBLE_DEVICES=0
    
  3. Start Ollama service: sudo systemctl start ollama.service
  4. Load a model: ollama run llama3.1:8b
  5. Check GPU usage: ollama ps or curl http://localhost:11434/api/ps

Expected Behavior

  • Ollama should detect GPU and initialize CUDA backend
  • Models should offload layers to GPU
  • ollama ps should show non-zero size_vram
  • Logs should show: ggml_cuda_init: found 1 CUDA devices and load_backend: loaded CUDA backend

Actual Behavior

  • Ollama discovers GPU but immediately falls back to CPU
  • All models show size_vram: 0 MB
  • Logs show:
    msg="discovering available GPUs..."
    msg="inference compute" id=cpu library=cpu
    msg="entering low vram mode" "total vram"="0 B"
    
  • No error messages (silent fallback)
  • Models run on CPU (slow performance: ~60+ seconds for simple queries)

Evidence It Previously Worked

Logs from November 17, 2025 (when GPU was working):

ggml_cuda_init: found 1 CUDA devices
load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
device=GPU for model weights and KV cache
offloaded 41/41 layers to GPU

After system reboot on November 18, 2025: GPU detection stopped working.

Troubleshooting Attempted

  1. Set environment variables (OLLAMA_NUM_GPU=1, CUDA_VISIBLE_DEVICES=0)
  2. Reinstalled Ollama binary (v0.12.11 from GitHub releases)
  3. Manual CUDA library path configuration (LD_LIBRARY_PATH)
  4. Created symlinks for CUDA libraries
  5. Clean install: Complete removal of all Ollama files/configs + fresh install
  6. Minimal configuration (removed all manual overrides, let Ollama auto-discover)

Result: All attempts show identical behavior - GPU discovery runs but immediately falls back to CPU within ~13ms.

GPU Verification

GPU is functional and accessible:

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8             18W /  300W |   13051MiB /  16303MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+

Other services successfully use GPU:

  • RAG service uses GPU for embeddings/reranking (NomicEmbedder + BGE Reranker on CUDA)
  • gnome-shell uses GPU for graphics

Ollama Service Logs

$ journalctl -u ollama.service --since '10 seconds ago' | grep -i -E 'gpu|cuda|vram|compute|discovering'
Nov 18 22:49:19 Tatami ollama[31412]: msg="discovering available GPUs..."
Nov 18 22:49:19 Tatami ollama[31412]: msg="inference compute" id=cpu library=cpu
Nov 18 22:49:19 Tatami ollama[31412]: msg="entering low vram mode" "total vram"="0 B"

Model Status

$ curl -s http://localhost:11434/api/ps | jq
[
  {
    "name": "llama3.1:8b",
    "model": "llama3.1:8b",
    "size": 4630000000,
    "size_vram": 0,  # <-- Should be non-zero
    "context_length": 4096
  }
]

Hypothesis

Ollama 0.12.11 may not support Compute Capability 12.0 (Blackwell architecture) yet.

The RTX 5070 Ti is very new hardware, and Ollama's bundled CUDA runners may not include kernels compiled for CC 12.0. When CUDA backend initialization fails, Ollama gracefully falls back to CPU without error messages.

Questions

  1. Does Ollama 0.12.11 support Compute Capability 12.0 (Blackwell)?
  2. Are there any debug flags to get more verbose CUDA initialization logs?
  3. Is there a known issue or workaround for RTX 50-series GPUs?
  4. Should I try rolling back to an older Ollama version that worked before Nov 17?

Additional Context

  • Models tested: llama3.1:8b, qwen3:14b, qwen:14b - all show same behavior
  • Cloud models: Work fine (authenticated with Ollama Cloud)
  • Service configuration: Minimal systemd override (no manual library paths)
  • This may be related to CUDA compute capability support
  • Similar issues may exist for other RTX 50-series GPUs (Blackwell architecture)

Relevant log output

### Ollama Service Logs


Nov 19 13:20:52 Tatami systemd[1]: Started ollama.service - Ollama Service.
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.536-08:00 level=INFO source=routes.go:1544 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/mnt/shared/ollama-models/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:522 msg="total blobs: 73"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=routes.go:1597 msg="Listening on [::]:11434 (version 0.12.11)"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35681"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="122.6 GiB" available="109.1 GiB"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
Nov 19 13:20:52 Tatami systemd[1]: Started ollama.service - Ollama Service.
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.536-08:00 level=INFO source=routes.go:1544 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/mnt/shared/ollama-models/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:522 msg="total blobs: 73"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=routes.go:1597 msg="Listening on [::]:11434 (version 0.12.11)"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35681"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="122.6 GiB" available="109.1 GiB"
Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB"
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /mnt/shared/ollama-models/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   1:                               general.type str              = model
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  17:                          general.file_type u32              = 15
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type  f32:   66 tensors
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q4_K:  193 tensors
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q6_K:   33 tensors
Nov 19 13:20:58 Tatami ollama[84614]: print_info: file format = GGUF V3 (latest)
Nov 19 13:20:58 Tatami ollama[84614]: print_info: file type   = Q4_K - Medium
Nov 19 13:20:58 Tatami ollama[84614]: print_info: file size   = 4.58 GiB (4.89 BPW)
Nov 19 13:20:58 Tatami ollama[84614]: load: printing all EOG tokens:
Nov 19 13:20:58 Tatami ollama[84614]: load:   - 128001 ('<|end_of_text|>')
Nov 19 13:20:58 Tatami ollama[84614]: load:   - 128008 ('<|eom_id|>')
Nov 19 13:20:58 Tatami ollama[84614]: load:   - 128009 ('<|eot_id|>')
Nov 19 13:20:58 Tatami ollama[84614]: load: special tokens cache size = 256
Nov 19 13:20:58 Tatami ollama[84614]: load: token to piece cache size = 0.7999 MB
Nov 19 13:20:58 Tatami ollama[84614]: print_info: arch             = llama
Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab_only       = 1
Nov 19 13:20:58 Tatami ollama[84614]: print_info: model type       = ?B
Nov 19 13:20:58 Tatami ollama[84614]: print_info: model params     = 8.03 B
Nov 19 13:20:58 Tatami ollama[84614]: print_info: general.name     = Meta Llama 3.1 8B Instruct
Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab type       = BPE
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_vocab          = 128256
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_merges         = 280147
Nov 19 13:20:58 Tatami ollama[84614]: print_info: BOS token        = 128000 '<|begin_of_text|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOS token        = 128009 '<|eot_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOT token        = 128009 '<|eot_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOM token        = 128008 '<|eom_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: LF token         = 198 'Ċ'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token        = 128001 '<|end_of_text|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token        = 128008 '<|eom_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token        = 128009 '<|eot_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: max token length = 256
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_load: vocab only - skipping tensors
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.394-08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /mnt/shared/ollama-models/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --port 41617"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.394-08:00 level=INFO source=sched.go:443 msg="system memory" total="122.6 GiB" free="109.1 GiB" free_swap="8.0 GiB"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.394-08:00 level=INFO source=server.go:459 msg="loading model" "model layers"=33 requested=-1
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.395-08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="4.3 GiB"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.395-08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="512.0 MiB"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.395-08:00 level=INFO source=device.go:272 msg="total memory" size="4.8 GiB"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.403-08:00 level=INFO source=runner.go:963 msg="starting go runner"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.403-08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.404-08:00 level=INFO source=runner.go:999 msg="Server listening on 127.0.0.1:41617"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.406-08:00 level=INFO source=runner.go:893 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.406-08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.406-08:00 level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model"
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /mnt/shared/ollama-models/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   1:                               general.type str              = model
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  17:                          general.file_type u32              = 15
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type  f32:   66 tensors
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q4_K:  193 tensors
Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q6_K:   33 tensors
Nov 19 13:20:58 Tatami ollama[84614]: print_info: file format = GGUF V3 (latest)
Nov 19 13:20:58 Tatami ollama[84614]: print_info: file type   = Q4_K - Medium
Nov 19 13:20:58 Tatami ollama[84614]: print_info: file size   = 4.58 GiB (4.89 BPW)
Nov 19 13:20:58 Tatami ollama[84614]: load: printing all EOG tokens:
Nov 19 13:20:58 Tatami ollama[84614]: load:   - 128001 ('<|end_of_text|>')
Nov 19 13:20:58 Tatami ollama[84614]: load:   - 128008 ('<|eom_id|>')
Nov 19 13:20:58 Tatami ollama[84614]: load:   - 128009 ('<|eot_id|>')
Nov 19 13:20:58 Tatami ollama[84614]: load: special tokens cache size = 256
Nov 19 13:20:58 Tatami ollama[84614]: load: token to piece cache size = 0.7999 MB
Nov 19 13:20:58 Tatami ollama[84614]: print_info: arch             = llama
Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab_only       = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_ctx_train      = 131072
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd           = 4096
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_layer          = 32
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_head           = 32
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_head_kv        = 8
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_rot            = 128
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_swa            = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: is_swa_any       = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_head_k    = 128
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_head_v    = 128
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_gqa            = 4
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_k_gqa     = 1024
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_v_gqa     = 1024
Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_norm_eps       = 0.0e+00
Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_norm_rms_eps   = 1.0e-05
Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_clamp_kqv      = 0.0e+00
Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_max_alibi_bias = 0.0e+00
Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_logit_scale    = 0.0e+00
Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_attn_scale     = 0.0e+00
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_ff             = 14336
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_expert         = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_expert_used    = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: causal attn      = 1
Nov 19 13:20:58 Tatami ollama[84614]: print_info: pooling type     = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: rope type        = 0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: rope scaling     = linear
Nov 19 13:20:58 Tatami ollama[84614]: print_info: freq_base_train  = 500000.0
Nov 19 13:20:58 Tatami ollama[84614]: print_info: freq_scale_train = 1
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_ctx_orig_yarn  = 131072
Nov 19 13:20:58 Tatami ollama[84614]: print_info: rope_finetuned   = unknown
Nov 19 13:20:58 Tatami ollama[84614]: print_info: model type       = 8B
Nov 19 13:20:58 Tatami ollama[84614]: print_info: model params     = 8.03 B
Nov 19 13:20:58 Tatami ollama[84614]: print_info: general.name     = Meta Llama 3.1 8B Instruct
Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab type       = BPE
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_vocab          = 128256
Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_merges         = 280147
Nov 19 13:20:58 Tatami ollama[84614]: print_info: BOS token        = 128000 '<|begin_of_text|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOS token        = 128009 '<|eot_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOT token        = 128009 '<|eot_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOM token        = 128008 '<|eom_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: LF token         = 198 'Ċ'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token        = 128001 '<|end_of_text|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token        = 128008 '<|eom_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token        = 128009 '<|eot_id|>'
Nov 19 13:20:58 Tatami ollama[84614]: print_info: max token length = 256
Nov 19 13:20:58 Tatami ollama[84614]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Nov 19 13:20:58 Tatami ollama[84614]: load_tensors:          CPU model buffer size =  4685.30 MiB
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: constructing llama_context
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_seq_max     = 1
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ctx         = 4096
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ctx_per_seq = 4096
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_batch       = 512
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ubatch      = 512
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: causal_attn   = 1
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: flash_attn    = disabled
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: kv_unified    = false
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: freq_base     = 500000.0
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: freq_scale    = 1
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Nov 19 13:21:00 Tatami ollama[84614]: llama_context:        CPU  output buffer size =     0.50 MiB
Nov 19 13:21:00 Tatami ollama[84614]: llama_kv_cache:        CPU KV buffer size =   512.00 MiB
Nov 19 13:21:00 Tatami ollama[84614]: llama_kv_cache: size =  512.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
Nov 19 13:21:00 Tatami ollama[84614]: llama_context:        CPU compute buffer size =   300.01 MiB
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: graph nodes  = 1158
Nov 19 13:21:00 Tatami ollama[84614]: llama_context: graph splits = 1
Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.02 seconds"
Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.02 seconds"
Nov 19 13:21:21 Tatami ollama[84614]: [GIN] 2025/11/19 - 13:21:21 | 200 | 23.341049668s |             ::1 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.12.11

Originally created by @deparko on GitHub (Nov 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13163 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ### Description Ollama 0.12.11 fails to detect and use the GPU for local models on NVIDIA GeForce RTX 5070 Ti (Blackwell architecture, Compute Capability 12.0). The GPU is functional and accessible, but Ollama immediately falls back to CPU-only mode without error messages. **Critical**: This worked before November 17, 2025, indicating a regression or compatibility issue with Blackwell architecture. ### Environment - **OS**: Ubuntu 25.04 (GNU/Linux 6.14.0-35-generic x86_64) - **GPU**: NVIDIA GeForce RTX 5070 Ti (16GB VRAM) - **GPU Compute Capability**: 12.0 (Blackwell architecture) - **GPU Driver**: 580.95.05 - **CUDA Runtime**: 12.2.140 - **Ollama Version**: 0.12.11 (latest, clean install) - **Installation Method**: Standalone binary via systemd service ### Steps to Reproduce 1. Install Ollama 0.12.11 on system with RTX 5070 Ti 2. Configure minimal systemd override: ```ini [Service] Environment=OLLAMA_MODELS=/mnt/shared/ollama-models/models Environment=CUDA_VISIBLE_DEVICES=0 ``` 3. Start Ollama service: `sudo systemctl start ollama.service` 4. Load a model: `ollama run llama3.1:8b` 5. Check GPU usage: `ollama ps` or `curl http://localhost:11434/api/ps` ### Expected Behavior - Ollama should detect GPU and initialize CUDA backend - Models should offload layers to GPU - `ollama ps` should show non-zero `size_vram` - Logs should show: `ggml_cuda_init: found 1 CUDA devices` and `load_backend: loaded CUDA backend` ### Actual Behavior - Ollama discovers GPU but immediately falls back to CPU - All models show `size_vram: 0 MB` - Logs show: ``` msg="discovering available GPUs..." msg="inference compute" id=cpu library=cpu msg="entering low vram mode" "total vram"="0 B" ``` - No error messages (silent fallback) - Models run on CPU (slow performance: ~60+ seconds for simple queries) ### Evidence It Previously Worked **Logs from November 17, 2025** (when GPU was working): ``` ggml_cuda_init: found 1 CUDA devices load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so device=GPU for model weights and KV cache offloaded 41/41 layers to GPU ``` **After system reboot on November 18, 2025**: GPU detection stopped working. ### Troubleshooting Attempted 1. ✅ Set environment variables (`OLLAMA_NUM_GPU=1`, `CUDA_VISIBLE_DEVICES=0`) 2. ✅ Reinstalled Ollama binary (v0.12.11 from GitHub releases) 3. ✅ Manual CUDA library path configuration (`LD_LIBRARY_PATH`) 4. ✅ Created symlinks for CUDA libraries 5. ✅ **Clean install**: Complete removal of all Ollama files/configs + fresh install 6. ✅ Minimal configuration (removed all manual overrides, let Ollama auto-discover) **Result**: All attempts show identical behavior - GPU discovery runs but immediately falls back to CPU within ~13ms. ### GPU Verification GPU is functional and accessible: ```bash $ nvidia-smi +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | 0 NVIDIA GeForce RTX 5070 Ti Off | 00000000:01:00.0 Off | N/A | | 0% 40C P8 18W / 300W | 13051MiB / 16303MiB | 0% Default | +-----------------------------------------+------------------------+----------------------+ ``` Other services successfully use GPU: - RAG service uses GPU for embeddings/reranking (NomicEmbedder + BGE Reranker on CUDA) - gnome-shell uses GPU for graphics ### Ollama Service Logs ```bash $ journalctl -u ollama.service --since '10 seconds ago' | grep -i -E 'gpu|cuda|vram|compute|discovering' Nov 18 22:49:19 Tatami ollama[31412]: msg="discovering available GPUs..." Nov 18 22:49:19 Tatami ollama[31412]: msg="inference compute" id=cpu library=cpu Nov 18 22:49:19 Tatami ollama[31412]: msg="entering low vram mode" "total vram"="0 B" ``` ### Model Status ```bash $ curl -s http://localhost:11434/api/ps | jq [ { "name": "llama3.1:8b", "model": "llama3.1:8b", "size": 4630000000, "size_vram": 0, # <-- Should be non-zero "context_length": 4096 } ] ``` ### Hypothesis **Ollama 0.12.11 may not support Compute Capability 12.0 (Blackwell architecture) yet.** The RTX 5070 Ti is very new hardware, and Ollama's bundled CUDA runners may not include kernels compiled for CC 12.0. When CUDA backend initialization fails, Ollama gracefully falls back to CPU without error messages. ### Questions 1. Does Ollama 0.12.11 support Compute Capability 12.0 (Blackwell)? 2. Are there any debug flags to get more verbose CUDA initialization logs? 3. Is there a known issue or workaround for RTX 50-series GPUs? 4. Should I try rolling back to an older Ollama version that worked before Nov 17? ### Additional Context - **Models tested**: `llama3.1:8b`, `qwen3:14b`, `qwen:14b` - all show same behavior - **Cloud models**: Work fine (authenticated with Ollama Cloud) - **Service configuration**: Minimal systemd override (no manual library paths) ### Related - This may be related to CUDA compute capability support - Similar issues may exist for other RTX 50-series GPUs (Blackwell architecture) ### Relevant log output ```shell ### Ollama Service Logs Nov 19 13:20:52 Tatami systemd[1]: Started ollama.service - Ollama Service. Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.536-08:00 level=INFO source=routes.go:1544 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/mnt/shared/ollama-models/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:522 msg="total blobs: 73" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=routes.go:1597 msg="Listening on [::]:11434 (version 0.12.11)" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35681" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="122.6 GiB" available="109.1 GiB" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB" Nov 19 13:20:52 Tatami systemd[1]: Started ollama.service - Ollama Service. Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.536-08:00 level=INFO source=routes.go:1544 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/mnt/shared/ollama-models/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:522 msg="total blobs: 73" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.539-08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=routes.go:1597 msg="Listening on [::]:11434 (version 0.12.11)" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.540-08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35681" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=types.go:60 msg="inference compute" id=cpu library=cpu compute="" name=cpu description=cpu libdirs=ollama driver="" pci_id="" type="" total="122.6 GiB" available="109.1 GiB" Nov 19 13:20:52 Tatami ollama[84614]: time=2025-11-19T13:20:52.555-08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="0 B" threshold="20.0 GiB" Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /mnt/shared/ollama-models/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest)) Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 0: general.architecture str = llama Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 1: general.type str = model Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 3: general.finetune str = Instruct Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 5: general.size_label str = 8B Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 6: general.license str = llama3.1 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 17: general.file_type u32 = 15 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type f32: 66 tensors Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q4_K: 193 tensors Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q6_K: 33 tensors Nov 19 13:20:58 Tatami ollama[84614]: print_info: file format = GGUF V3 (latest) Nov 19 13:20:58 Tatami ollama[84614]: print_info: file type = Q4_K - Medium Nov 19 13:20:58 Tatami ollama[84614]: print_info: file size = 4.58 GiB (4.89 BPW) Nov 19 13:20:58 Tatami ollama[84614]: load: printing all EOG tokens: Nov 19 13:20:58 Tatami ollama[84614]: load: - 128001 ('<|end_of_text|>') Nov 19 13:20:58 Tatami ollama[84614]: load: - 128008 ('<|eom_id|>') Nov 19 13:20:58 Tatami ollama[84614]: load: - 128009 ('<|eot_id|>') Nov 19 13:20:58 Tatami ollama[84614]: load: special tokens cache size = 256 Nov 19 13:20:58 Tatami ollama[84614]: load: token to piece cache size = 0.7999 MB Nov 19 13:20:58 Tatami ollama[84614]: print_info: arch = llama Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab_only = 1 Nov 19 13:20:58 Tatami ollama[84614]: print_info: model type = ?B Nov 19 13:20:58 Tatami ollama[84614]: print_info: model params = 8.03 B Nov 19 13:20:58 Tatami ollama[84614]: print_info: general.name = Meta Llama 3.1 8B Instruct Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab type = BPE Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_vocab = 128256 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_merges = 280147 Nov 19 13:20:58 Tatami ollama[84614]: print_info: BOS token = 128000 '<|begin_of_text|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOS token = 128009 '<|eot_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOT token = 128009 '<|eot_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOM token = 128008 '<|eom_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: LF token = 198 'Ċ' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token = 128001 '<|end_of_text|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token = 128008 '<|eom_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token = 128009 '<|eot_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: max token length = 256 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_load: vocab only - skipping tensors Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.394-08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /mnt/shared/ollama-models/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --port 41617" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.394-08:00 level=INFO source=sched.go:443 msg="system memory" total="122.6 GiB" free="109.1 GiB" free_swap="8.0 GiB" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.394-08:00 level=INFO source=server.go:459 msg="loading model" "model layers"=33 requested=-1 Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.395-08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="4.3 GiB" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.395-08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="512.0 MiB" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.395-08:00 level=INFO source=device.go:272 msg="total memory" size="4.8 GiB" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.403-08:00 level=INFO source=runner.go:963 msg="starting go runner" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.403-08:00 level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.404-08:00 level=INFO source=runner.go:999 msg="Server listening on 127.0.0.1:41617" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.406-08:00 level=INFO source=runner.go:893 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.406-08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" Nov 19 13:20:58 Tatami ollama[84614]: time=2025-11-19T13:20:58.406-08:00 level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model" Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /mnt/shared/ollama-models/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest)) Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 0: general.architecture str = llama Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 1: general.type str = model Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 3: general.finetune str = Instruct Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 5: general.size_label str = 8B Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 6: general.license str = llama3.1 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 17: general.file_type u32 = 15 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type f32: 66 tensors Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q4_K: 193 tensors Nov 19 13:20:58 Tatami ollama[84614]: llama_model_loader: - type q6_K: 33 tensors Nov 19 13:20:58 Tatami ollama[84614]: print_info: file format = GGUF V3 (latest) Nov 19 13:20:58 Tatami ollama[84614]: print_info: file type = Q4_K - Medium Nov 19 13:20:58 Tatami ollama[84614]: print_info: file size = 4.58 GiB (4.89 BPW) Nov 19 13:20:58 Tatami ollama[84614]: load: printing all EOG tokens: Nov 19 13:20:58 Tatami ollama[84614]: load: - 128001 ('<|end_of_text|>') Nov 19 13:20:58 Tatami ollama[84614]: load: - 128008 ('<|eom_id|>') Nov 19 13:20:58 Tatami ollama[84614]: load: - 128009 ('<|eot_id|>') Nov 19 13:20:58 Tatami ollama[84614]: load: special tokens cache size = 256 Nov 19 13:20:58 Tatami ollama[84614]: load: token to piece cache size = 0.7999 MB Nov 19 13:20:58 Tatami ollama[84614]: print_info: arch = llama Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab_only = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_ctx_train = 131072 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd = 4096 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_layer = 32 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_head = 32 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_head_kv = 8 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_rot = 128 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_swa = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: is_swa_any = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_head_k = 128 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_head_v = 128 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_gqa = 4 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_k_gqa = 1024 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_embd_v_gqa = 1024 Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_norm_eps = 0.0e+00 Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_norm_rms_eps = 1.0e-05 Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_clamp_kqv = 0.0e+00 Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_max_alibi_bias = 0.0e+00 Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_logit_scale = 0.0e+00 Nov 19 13:20:58 Tatami ollama[84614]: print_info: f_attn_scale = 0.0e+00 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_ff = 14336 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_expert = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_expert_used = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: causal attn = 1 Nov 19 13:20:58 Tatami ollama[84614]: print_info: pooling type = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: rope type = 0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: rope scaling = linear Nov 19 13:20:58 Tatami ollama[84614]: print_info: freq_base_train = 500000.0 Nov 19 13:20:58 Tatami ollama[84614]: print_info: freq_scale_train = 1 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_ctx_orig_yarn = 131072 Nov 19 13:20:58 Tatami ollama[84614]: print_info: rope_finetuned = unknown Nov 19 13:20:58 Tatami ollama[84614]: print_info: model type = 8B Nov 19 13:20:58 Tatami ollama[84614]: print_info: model params = 8.03 B Nov 19 13:20:58 Tatami ollama[84614]: print_info: general.name = Meta Llama 3.1 8B Instruct Nov 19 13:20:58 Tatami ollama[84614]: print_info: vocab type = BPE Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_vocab = 128256 Nov 19 13:20:58 Tatami ollama[84614]: print_info: n_merges = 280147 Nov 19 13:20:58 Tatami ollama[84614]: print_info: BOS token = 128000 '<|begin_of_text|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOS token = 128009 '<|eot_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOT token = 128009 '<|eot_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOM token = 128008 '<|eom_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: LF token = 198 'Ċ' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token = 128001 '<|end_of_text|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token = 128008 '<|eom_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: EOG token = 128009 '<|eot_id|>' Nov 19 13:20:58 Tatami ollama[84614]: print_info: max token length = 256 Nov 19 13:20:58 Tatami ollama[84614]: load_tensors: loading model tensors, this can take a while... (mmap = false) Nov 19 13:20:58 Tatami ollama[84614]: load_tensors: CPU model buffer size = 4685.30 MiB Nov 19 13:21:00 Tatami ollama[84614]: llama_context: constructing llama_context Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_seq_max = 1 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ctx = 4096 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ctx_per_seq = 4096 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_batch = 512 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ubatch = 512 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: causal_attn = 1 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: flash_attn = disabled Nov 19 13:21:00 Tatami ollama[84614]: llama_context: kv_unified = false Nov 19 13:21:00 Tatami ollama[84614]: llama_context: freq_base = 500000.0 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: freq_scale = 1 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Nov 19 13:21:00 Tatami ollama[84614]: llama_context: CPU output buffer size = 0.50 MiB Nov 19 13:21:00 Tatami ollama[84614]: llama_kv_cache: CPU KV buffer size = 512.00 MiB Nov 19 13:21:00 Tatami ollama[84614]: llama_kv_cache: size = 512.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB Nov 19 13:21:00 Tatami ollama[84614]: llama_context: CPU compute buffer size = 300.01 MiB Nov 19 13:21:00 Tatami ollama[84614]: llama_context: graph nodes = 1158 Nov 19 13:21:00 Tatami ollama[84614]: llama_context: graph splits = 1 Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.02 seconds" Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" Nov 19 13:21:00 Tatami ollama[84614]: time=2025-11-19T13:21:00.412-08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.02 seconds" Nov 19 13:21:21 Tatami ollama[84614]: [GIN] 2025/11/19 - 13:21:21 | 200 | 23.341049668s | ::1 | POST "/api/generate" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.11
GiteaMirror added the nvidiabug labels 2026-04-12 21:28:37 -05:00
Author
Owner

@jessegross commented on GitHub (Nov 19, 2025):

Can you please post the full log?

<!-- gh-comment-id:3554631279 --> @jessegross commented on GitHub (Nov 19, 2025): Can you please post the full log?
Author
Owner

@deparko commented on GitHub (Nov 19, 2025):

just recreated the problem and added log above

<!-- gh-comment-id:3554707927 --> @deparko commented on GitHub (Nov 19, 2025): just recreated the problem and added log above
Author
Owner

@rick-github commented on GitHub (Nov 19, 2025):

Set OLLAMA_DEBUG=2 to log more information about device detection.

<!-- gh-comment-id:3554757160 --> @rick-github commented on GitHub (Nov 19, 2025): Set `OLLAMA_DEBUG=2` to log more information about device detection.
Author
Owner

@deparko commented on GitHub (Nov 19, 2025):

ollama_gpu_failure_logs_debug2.txt

<!-- gh-comment-id:3554774214 --> @deparko commented on GitHub (Nov 19, 2025): [ollama_gpu_failure_logs_debug2.txt](https://github.com/user-attachments/files/23638029/ollama_gpu_failure_logs_debug2.txt)
Author
Owner

@rick-github commented on GitHub (Nov 19, 2025):

Set OLLAMA_DEBUG=2 in the environment of the server to log more information about device detection.

<!-- gh-comment-id:3554854246 --> @rick-github commented on GitHub (Nov 19, 2025): Set `OLLAMA_DEBUG=2` in the environment of the server to log more information about device detection.
Author
Owner

@Mustardsauce commented on GitHub (Nov 20, 2025):

I'm running Ollama 0.12.10 on Docker with a B200 (Blackwell) GPU, but it fails to utilize the GPU. After enabling OLLAMA_DEBUG=2, I'm getting a ggml_cuda_init: failed to initialize CUDA:initialization error in the logs.

<!-- gh-comment-id:3555293353 --> @Mustardsauce commented on GitHub (Nov 20, 2025): I'm running Ollama 0.12.10 on Docker with a B200 (Blackwell) GPU, but it fails to utilize the GPU. After enabling OLLAMA_DEBUG=2, I'm getting a ggml_cuda_init: failed to initialize CUDA:initialization error in the logs.
Author
Owner

@dhiltgen commented on GitHub (Nov 21, 2025):

@Mustardsauce please share the log from startup to the point it reports inference compute so we can see what's going wrong.

@deparko your logs don't have the debug setting set properly - they're still logging only INFO log messages, not TRACE log messages. The simplest way to get this would be something like:

sudo systemctl stop ollama
OLLAMA_DEBUG=2 ollama serve 2>&1 | tee serve.log

Then just hit ^C as soon as it reports inference compute and share the serve.log

<!-- gh-comment-id:3565000040 --> @dhiltgen commented on GitHub (Nov 21, 2025): @Mustardsauce please share the log from startup to the point it reports `inference compute` so we can see what's going wrong. @deparko your logs don't have the debug setting set properly - they're still logging only INFO log messages, not TRACE log messages. The simplest way to get this would be something like: ``` sudo systemctl stop ollama OLLAMA_DEBUG=2 ollama serve 2>&1 | tee serve.log ``` Then just hit `^C` as soon as it reports `inference compute` and share the serve.log
Author
Owner

@deparko commented on GitHub (Nov 22, 2025):

please see attached!

ollama_serve_debug2_startup.log

<!-- gh-comment-id:3565550634 --> @deparko commented on GitHub (Nov 22, 2025): please see attached! [ollama_serve_debug2_startup.log](https://github.com/user-attachments/files/23686710/ollama_serve_debug2_startup.log)
Author
Owner

@deparko commented on GitHub (Nov 27, 2025):

Any updates. Im basically dead in the water.

<!-- gh-comment-id:3584284060 --> @deparko commented on GitHub (Nov 27, 2025): Any updates. Im basically dead in the water.
Author
Owner

@rick-github commented on GitHub (Nov 27, 2025):

There's no attempt to load a CUDA backend, perhaps because there's some confusion about where the backends are:

time=2025-11-21T20:07:02.536-08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
time=2025-11-21T20:07:02.536-08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/ollama

How did you install ollama? What's the output of the following:

ls -lR /usr/local/lib/ollama
systemctl cat ollama
for i in $(pidof ollama) ; do echo $i ; sudo cat /proc/$i/environ | tr \\0 \\n ; done
<!-- gh-comment-id:3585552458 --> @rick-github commented on GitHub (Nov 27, 2025): There's no attempt to load a CUDA backend, perhaps because there's some confusion about where the backends are: ``` time=2025-11-21T20:07:02.536-08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama time=2025-11-21T20:07:02.536-08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/ollama ``` How did you install ollama? What's the output of the following: ``` ls -lR /usr/local/lib/ollama systemctl cat ollama for i in $(pidof ollama) ; do echo $i ; sudo cat /proc/$i/environ | tr \\0 \\n ; done ```
Author
Owner

@deparko commented on GitHub (Nov 27, 2025):

Reply to rick-github on Issue #13163

Solution Found! 🎉

Thanks @rick-github for the diagnostic guidance! I found the issue.

The Problem

Ollama 0.13.0 doesn't search subdirectories for CUDA libraries. The tarball installation puts CUDA libs in:

  • /usr/local/lib/ollama/ollama/cuda_v13/libggml-cuda.so

But Ollama only searches:

  • /usr/local/lib/ollama/
  • /usr/local/lib/ollama/ollama/

It does NOT recursively search cuda_v13/ or cuda_v12/ subdirectories.

Debug Log Evidence

With OLLAMA_DEBUG=1, the log showed:

OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/ollama]"

This confirms Ollama only looks in these two directories, not the cuda_v13/ subdirectory where libggml-cuda.so actually lives.

The Fix

Create symlinks to put the CUDA library where Ollama looks:

sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libggml-cuda.so /usr/local/lib/ollama/ollama/libggml-cuda.so
sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libcudart.so.13 /usr/local/lib/ollama/ollama/libcudart.so.13
sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libcublas.so.13 /usr/local/lib/ollama/ollama/libcublas.so.13
sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libcublasLt.so.13 /usr/local/lib/ollama/ollama/libcublasLt.so.13

Result

After creating the symlinks and restarting:

msg="inference compute" id=GPU-a455be12-220b-715b-6c30-bad6fc091546 library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5070 Ti" total="15.9 GiB" available="11.8 GiB"
NAME           ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
llama3.1:8b    46e0c10c039e    5.5 GB    100% GPU     4096       4 minutes from now

100% GPU acceleration working! 🎉

Is This a Bug?

I believe so. The tarball installation creates this structure:

/usr/local/lib/ollama/ollama/
├── libggml-cpu-*.so (CPU backends - in parent dir)
├── cuda_v12/
│   └── libggml-cuda.so (CUDA 12 - in subdirectory)
├── cuda_v13/
│   └── libggml-cuda.so (CUDA 13 - in subdirectory)
└── vulkan/
    └── libggml-vulkan.so (Vulkan - in subdirectory)

But the library discovery code only searches the parent directory, not subdirectories. Either:

  1. The discovery code should recursively search subdirectories, OR
  2. The tarball should install libggml-cuda.so in the parent directory (perhaps with symlinks to the versioned subdirectories)

Environment

  • Ollama 0.13.0 (tarball installation)
  • Ubuntu 25.04
  • NVIDIA RTX 5070 Ti (Compute Capability 12.0, Blackwell)
  • Driver 580.95.05 (CUDA 13.0)

Thanks again for your help!

<!-- gh-comment-id:3587172041 --> @deparko commented on GitHub (Nov 27, 2025): <!-- Failed to upload "OLLAMA_ISSUE_RESPONSE.md" --> # Reply to rick-github on Issue #13163 ## Solution Found! 🎉 Thanks @rick-github for the diagnostic guidance! I found the issue. ### The Problem Ollama 0.13.0 doesn't search subdirectories for CUDA libraries. The tarball installation puts CUDA libs in: - `/usr/local/lib/ollama/ollama/cuda_v13/libggml-cuda.so` But Ollama only searches: - `/usr/local/lib/ollama/` - `/usr/local/lib/ollama/ollama/` It does NOT recursively search `cuda_v13/` or `cuda_v12/` subdirectories. ### Debug Log Evidence With `OLLAMA_DEBUG=1`, the log showed: ``` OLLAMA_LIBRARY_PATH="[/usr/local/lib/ollama /usr/local/lib/ollama/ollama]" ``` This confirms Ollama only looks in these two directories, not the `cuda_v13/` subdirectory where `libggml-cuda.so` actually lives. ### The Fix Create symlinks to put the CUDA library where Ollama looks: ```bash sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libggml-cuda.so /usr/local/lib/ollama/ollama/libggml-cuda.so sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libcudart.so.13 /usr/local/lib/ollama/ollama/libcudart.so.13 sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libcublas.so.13 /usr/local/lib/ollama/ollama/libcublas.so.13 sudo ln -sf /usr/local/lib/ollama/ollama/cuda_v13/libcublasLt.so.13 /usr/local/lib/ollama/ollama/libcublasLt.so.13 ``` ### Result After creating the symlinks and restarting: ``` msg="inference compute" id=GPU-a455be12-220b-715b-6c30-bad6fc091546 library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5070 Ti" total="15.9 GiB" available="11.8 GiB" ``` ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL llama3.1:8b 46e0c10c039e 5.5 GB 100% GPU 4096 4 minutes from now ``` **100% GPU acceleration working!** 🎉 ### Is This a Bug? I believe so. The tarball installation creates this structure: ``` /usr/local/lib/ollama/ollama/ ├── libggml-cpu-*.so (CPU backends - in parent dir) ├── cuda_v12/ │ └── libggml-cuda.so (CUDA 12 - in subdirectory) ├── cuda_v13/ │ └── libggml-cuda.so (CUDA 13 - in subdirectory) └── vulkan/ └── libggml-vulkan.so (Vulkan - in subdirectory) ``` But the library discovery code only searches the parent directory, not subdirectories. Either: 1. The discovery code should recursively search subdirectories, OR 2. The tarball should install `libggml-cuda.so` in the parent directory (perhaps with symlinks to the versioned subdirectories) ### Environment - Ollama 0.13.0 (tarball installation) - Ubuntu 25.04 - NVIDIA RTX 5070 Ti (Compute Capability 12.0, Blackwell) - Driver 580.95.05 (CUDA 13.0) Thanks again for your help!
Author
Owner

@rick-github commented on GitHub (Nov 27, 2025):

I believe so. The tarball installation creates this structure:

This is why I asked for what installation method you used. A manual install will put the backends in /usr/lib/ollama, the recommended install will put the backends in /usr/local/lib/ollama. No install method should put the backends in /usr/local/lib/ollama/ollama.

<!-- gh-comment-id:3587205710 --> @rick-github commented on GitHub (Nov 27, 2025): > I believe so. The tarball installation creates this structure: This is why I asked for what installation method you used. A [manual install](https://github.com/ollama/ollama/blob/main/docs/linux.mdx#manual-install) will put the backends in /usr/lib/ollama, the [recommended install](https://ollama.com/download/linux) will put the backends in /usr/local/lib/ollama. No install method should put the backends in /usr/local/lib/ollama/ollama.
Author
Owner

@deparko commented on GitHub (Nov 27, 2025):

Thank you and Happy Thanksgiving!🦃

<!-- gh-comment-id:3587345536 --> @deparko commented on GitHub (Nov 27, 2025): Thank you and Happy Thanksgiving!🦃
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8704