[GH-ISSUE #9517] Ollama detects the GPU but still uses the CPU. #68261

Closed
opened 2026-05-04 13:02:56 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @RadEdje on GitHub (Mar 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9517

What is the issue?

This seems to be happening for all my LLMs....

ollama logs that the GPU is detected (it names my rtx 4070 super). using Ollama ps.... it says 15% on cpu and 85% on gpu.... but if I look at my system tray.... my GPU vram usage dosen't budge but my CPU RAM gets full. all my LLM's also run slow... i think this could be an nvidia driver update issue?

based on this thread:
https://github.com/ollama/ollama/issues/4563

this has happened before.
I remember updating nvidia drivers recently.

Could it be the nvidia driver update?

Image

Image

Image

Image

here are the server logs:


2025/03/05 19:04:49 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-03-05T19:04:49.499+08:00 level=INFO source=images.go:432 msg="total blobs: 49"
time=2025-03-05T19:04:49.499+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-05T19:04:49.502+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)"
time=2025-03-05T19:04:49.502+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-05T19:04:49.502+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-05T19:04:49.502+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-03-05T19:04:50.058+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB"
time=2025-03-05T19:04:50.062+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/03/05 - 19:56:12 | 200 |      2.5993ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:56:12 | 200 |     13.6765ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     15.7392ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     23.1269ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     23.6556ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     22.5294ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     24.6891ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     25.2033ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |      26.222ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     26.6927ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     30.3713ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |      1.7979ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:56:12 | 200 |     10.4684ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |      16.756ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     18.1754ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     19.7211ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     19.7211ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     20.2351ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     20.8763ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     23.4699ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     22.2817ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:12 | 200 |     30.5956ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |      1.6281ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:56:38 | 200 |     12.5507ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     14.7088ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |      20.952ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     21.4598ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     20.9034ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |      23.018ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |      20.398ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     21.4307ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     24.0046ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     28.6934ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |      1.0192ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:56:38 | 200 |     11.5496ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     13.7375ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     18.3825ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     19.9869ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     19.9869ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     20.4877ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     21.5134ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     22.5542ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     23.5897ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:56:38 | 200 |     29.7682ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |      1.8119ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:58:51 | 200 |      12.515ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     13.5402ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     17.6395ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     18.6716ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |      19.178ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     19.6973ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     23.1397ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     23.1349ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     22.6099ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |      29.439ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |      1.5357ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:58:51 | 200 |     11.9747ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |      13.535ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     19.2585ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |      20.311ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     20.3369ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     21.4318ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     21.9554ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     22.4728ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     22.4939ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:58:51 | 200 |     28.3127ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |      3.6976ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:03 | 200 |      9.3186ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     15.7435ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     16.2515ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     19.9332ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     19.3712ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     19.8971ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     19.3712ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     20.4237ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |      20.952ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:03 | 200 |     26.7192ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |      1.6096ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:04 | 200 |     12.0002ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     14.5905ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     15.6204ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     18.2716ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     20.3691ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     21.9886ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     22.9654ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |      24.577ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     23.9937ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:04 | 200 |     27.6369ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      1.5394ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:18 | 200 |     12.7631ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     18.4773ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     19.0014ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     20.5625ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     21.0835ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     21.0886ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      22.129ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     25.8145ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     26.8507ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      31.634ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      2.0565ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:18 | 200 |     10.3752ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     12.9453ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     19.6498ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      20.226ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      20.226ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      20.746ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |      20.746ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     22.2997ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     21.7861ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:18 | 200 |     26.4108ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |      2.0712ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:22 | 200 |      12.251ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     16.7867ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     24.1263ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     26.3154ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     28.3696ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     28.3696ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     29.4133ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     30.4816ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     32.0335ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     36.1235ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |      1.5437ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:22 | 200 |     14.7021ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     14.7021ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     18.1082ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     19.8781ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     19.6526ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     20.7403ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     22.2774ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     24.5602ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |     25.5878ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:22 | 200 |      26.605ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |      2.0532ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:26 | 200 |     10.9356ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     18.2135ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |      18.744ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     19.7615ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     21.3087ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     20.8059ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     21.3264ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     22.8596ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     22.3464ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     29.1783ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |      1.6341ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:26 | 200 |      9.6007ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     13.6566ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     16.7855ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     17.2704ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     18.2987ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     18.8119ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     19.8395ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     20.9023ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     20.3846ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:26 | 200 |     27.5717ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |       1.021ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:30 | 200 |     11.5011ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     14.1528ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     17.2303ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     19.2865ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     19.2865ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |      20.324ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     19.8027ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     21.8605ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     24.4758ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:30 | 200 |     26.5235ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |      1.5453ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:31 | 200 |     11.9742ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     16.1187ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     18.7104ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     18.1897ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |      19.808ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     21.3389ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     21.8596ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     22.3709ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     23.3809ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:31 | 200 |     26.9695ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |      1.5595ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:37 | 200 |     11.9002ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     12.9594ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     17.7668ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     18.2857ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |      19.875ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |      21.406ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |      21.406ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     21.9511ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     21.9189ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     25.5304ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |      1.5524ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:37 | 200 |     11.7524ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |      12.736ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     18.4464ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     19.5686ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     18.4774ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     20.6326ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     21.1495ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     23.1232ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     25.7691ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:37 | 200 |     27.2605ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |       1.561ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:41 | 200 |     11.5824ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     15.7844ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     18.3611ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     18.1132ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     18.6336ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     19.6023ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     22.9813ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     22.9813ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     22.6866ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:41 | 200 |     28.7407ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |      1.5334ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 19:59:42 | 200 |     14.7257ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     16.7961ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     17.6276ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     19.3817ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     20.4289ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     20.2214ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     22.5134ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     23.3435ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     21.2694ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 19:59:42 | 200 |     27.6547ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/05 - 20:07:37 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/05 - 20:07:37 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/03/05 - 20:07:57 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/05 - 20:07:57 | 200 |      1.5578ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/05 - 20:08:37 | 200 |       518.7µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/05 - 20:08:37 | 200 |     13.9912ms |       127.0.0.1 | POST     "/api/show"
time=2025-03-05T20:08:37.685+08:00 level=WARN source=sched.go:138 msg="mllama doesn't support parallel requests yet"
time=2025-03-05T20:08:37.731+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-05T20:08:37.731+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-05T20:08:37.733+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-05T20:08:37.733+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-05T20:08:37.755+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="23.9 GiB" free_swap="39.4 GiB"
time=2025-03-05T20:08:37.756+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-05T20:08:37.757+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-05T20:08:37.757+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[10.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.1 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.1 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB"
time=2025-03-05T20:08:37.763+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 31 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 61901"
time=2025-03-05T20:08:37.765+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-05T20:08:37.765+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-05T20:08:37.765+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-05T20:08:37.783+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-05T20:08:37.823+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-05T20:08:37.824+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:61901"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 11B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  5679.33 MiB
time=2025-03-05T20:08:38.017+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   656.25 MiB
llama_init_from_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:        CPU compute buffer size =   258.50 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: mllama_model_load: using CPU backend

mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-03-05T20:08:43.775+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.01 seconds"
[GIN] 2025/03/05 - 20:08:43 | 200 |    6.1022747s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/05 - 20:09:41 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/05 - 20:09:41 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"

Any advice would be much appreciated? Is it an nvidia driver issue?> llama3.2-v would run really fast before... Now it's slow.. I wondered why and then I saw it was stuck in the CPU ram....

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

ollama version 0.5.13

Originally created by @RadEdje on GitHub (Mar 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9517 ### What is the issue? This seems to be happening for all my LLMs.... ollama logs that the GPU is detected (it names my rtx 4070 super). using Ollama ps.... it says 15% on cpu and 85% on gpu.... but if I look at my system tray.... my GPU vram usage dosen't budge but my CPU RAM gets full. all my LLM's also run slow... i think this could be an nvidia driver update issue? based on this thread: https://github.com/ollama/ollama/issues/4563 this has happened before. I remember updating nvidia drivers recently. Could it be the nvidia driver update? ![Image](https://github.com/user-attachments/assets/b9ba2a32-5034-42bc-967a-a0a0194cab96) ![Image](https://github.com/user-attachments/assets/3950e975-06d4-4b8f-97ab-791da937fd51) ![Image](https://github.com/user-attachments/assets/45cb3009-62ed-4742-8f26-3c643507b966) ![Image](https://github.com/user-attachments/assets/904b8199-ff82-476a-9419-e8122882a0c8) here are the server logs: ``` 2025/03/05 19:04:49 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-03-05T19:04:49.499+08:00 level=INFO source=images.go:432 msg="total blobs: 49" time=2025-03-05T19:04:49.499+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-05T19:04:49.502+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)" time=2025-03-05T19:04:49.502+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-05T19:04:49.502+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-03-05T19:04:49.502+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-03-05T19:04:50.058+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB" time=2025-03-05T19:04:50.062+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/03/05 - 19:56:12 | 200 | 2.5993ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:56:12 | 200 | 13.6765ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 15.7392ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 23.1269ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 23.6556ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 22.5294ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 24.6891ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 25.2033ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 26.222ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 26.6927ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 30.3713ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 1.7979ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:56:12 | 200 | 10.4684ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 16.756ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 18.1754ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 19.7211ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 19.7211ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 20.2351ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 20.8763ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 23.4699ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 22.2817ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:12 | 200 | 30.5956ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 1.6281ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:56:38 | 200 | 12.5507ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 14.7088ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 20.952ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 21.4598ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 20.9034ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 23.018ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 20.398ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 21.4307ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 24.0046ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 28.6934ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 1.0192ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:56:38 | 200 | 11.5496ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 13.7375ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 18.3825ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 19.9869ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 19.9869ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 20.4877ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 21.5134ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 22.5542ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 23.5897ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:56:38 | 200 | 29.7682ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 1.8119ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:58:51 | 200 | 12.515ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 13.5402ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 17.6395ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 18.6716ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 19.178ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 19.6973ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 23.1397ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 23.1349ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 22.6099ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 29.439ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 1.5357ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:58:51 | 200 | 11.9747ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 13.535ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 19.2585ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 20.311ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 20.3369ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 21.4318ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 21.9554ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 22.4728ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 22.4939ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:58:51 | 200 | 28.3127ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 3.6976ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:03 | 200 | 9.3186ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 15.7435ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 16.2515ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 19.9332ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 19.3712ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 19.8971ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 19.3712ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 20.4237ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 20.952ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:03 | 200 | 26.7192ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 1.6096ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:04 | 200 | 12.0002ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 14.5905ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 15.6204ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 18.2716ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 20.3691ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 21.9886ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 22.9654ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 24.577ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 23.9937ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:04 | 200 | 27.6369ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 1.5394ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:18 | 200 | 12.7631ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 18.4773ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 19.0014ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 20.5625ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 21.0835ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 21.0886ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 22.129ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 25.8145ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 26.8507ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 31.634ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 2.0565ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:18 | 200 | 10.3752ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 12.9453ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 19.6498ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 20.226ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 20.226ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 20.746ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 20.746ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 22.2997ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 21.7861ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:18 | 200 | 26.4108ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 2.0712ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:22 | 200 | 12.251ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 16.7867ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 24.1263ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 26.3154ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 28.3696ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 28.3696ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 29.4133ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 30.4816ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 32.0335ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 36.1235ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 1.5437ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:22 | 200 | 14.7021ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 14.7021ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 18.1082ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 19.8781ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 19.6526ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 20.7403ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 22.2774ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 24.5602ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 25.5878ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:22 | 200 | 26.605ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 2.0532ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:26 | 200 | 10.9356ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 18.2135ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 18.744ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 19.7615ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 21.3087ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 20.8059ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 21.3264ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 22.8596ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 22.3464ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 29.1783ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 1.6341ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:26 | 200 | 9.6007ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 13.6566ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 16.7855ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 17.2704ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 18.2987ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 18.8119ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 19.8395ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 20.9023ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 20.3846ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:26 | 200 | 27.5717ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 1.021ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:30 | 200 | 11.5011ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 14.1528ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 17.2303ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 19.2865ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 19.2865ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 20.324ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 19.8027ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 21.8605ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 24.4758ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:30 | 200 | 26.5235ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 1.5453ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:31 | 200 | 11.9742ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 16.1187ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 18.7104ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 18.1897ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 19.808ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 21.3389ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 21.8596ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 22.3709ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 23.3809ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:31 | 200 | 26.9695ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 1.5595ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:37 | 200 | 11.9002ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 12.9594ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 17.7668ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 18.2857ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 19.875ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 21.406ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 21.406ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 21.9511ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 21.9189ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 25.5304ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 1.5524ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:37 | 200 | 11.7524ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 12.736ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 18.4464ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 19.5686ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 18.4774ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 20.6326ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 21.1495ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 23.1232ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 25.7691ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:37 | 200 | 27.2605ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 1.561ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:41 | 200 | 11.5824ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 15.7844ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 18.3611ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 18.1132ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 18.6336ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 19.6023ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 22.9813ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 22.9813ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 22.6866ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:41 | 200 | 28.7407ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 1.5334ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 19:59:42 | 200 | 14.7257ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 16.7961ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 17.6276ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 19.3817ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 20.4289ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 20.2214ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 22.5134ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 23.3435ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 21.2694ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 19:59:42 | 200 | 27.6547ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/05 - 20:07:37 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/05 - 20:07:37 | 200 | 0s | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/05 - 20:07:57 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/05 - 20:07:57 | 200 | 1.5578ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/05 - 20:08:37 | 200 | 518.7µs | 127.0.0.1 | HEAD "/" [GIN] 2025/03/05 - 20:08:37 | 200 | 13.9912ms | 127.0.0.1 | POST "/api/show" time=2025-03-05T20:08:37.685+08:00 level=WARN source=sched.go:138 msg="mllama doesn't support parallel requests yet" time=2025-03-05T20:08:37.731+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-05T20:08:37.731+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-05T20:08:37.733+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-05T20:08:37.733+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-05T20:08:37.755+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="23.9 GiB" free_swap="39.4 GiB" time=2025-03-05T20:08:37.756+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-05T20:08:37.757+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-05T20:08:37.757+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[10.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.1 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.1 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB" time=2025-03-05T20:08:37.763+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 31 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 61901" time=2025-03-05T20:08:37.765+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-05T20:08:37.765+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-05T20:08:37.765+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-05T20:08:37.783+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-05T20:08:37.823+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 time=2025-03-05T20:08:37.824+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:61901" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 11B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 5679.33 MiB time=2025-03-05T20:08:38.017+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_init_from_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_init_from_model: CPU output buffer size = 0.50 MiB llama_init_from_model: CPU compute buffer size = 258.50 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: mllama_model_load: using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-03-05T20:08:43.775+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.01 seconds" [GIN] 2025/03/05 - 20:08:43 | 200 | 6.1022747s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/05 - 20:09:41 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/05 - 20:09:41 | 200 | 0s | 127.0.0.1 | GET "/api/ps" ``` Any advice would be much appreciated? Is it an nvidia driver issue?> llama3.2-v would run really fast before... Now it's slow.. I wondered why and then I saw it was stuck in the CPU ram.... ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version ollama version 0.5.13
GiteaMirror added the bug label 2026-05-04 13:02:56 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

time=2025-03-05T20:08:37.783+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-05T20:08:37.823+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8

Apparently no GPU backends. What does the following show:

dir /s C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama
<!-- gh-comment-id:2701033857 --> @rick-github commented on GitHub (Mar 5, 2025): ``` time=2025-03-05T20:08:37.783+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-05T20:08:37.823+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 ``` Apparently no GPU backends. What does the following show: ``` dir /s C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama ```
Author
Owner

@RadEdje commented on GitHub (Mar 5, 2025):

time=2025-03-05T20:08:37.783+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-05T20:08:37.823+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8

Apparently no GPU backends. What does the following show:

dir /s C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama

this is what it shows.....

Image

Image

<!-- gh-comment-id:2701316587 --> @RadEdje commented on GitHub (Mar 5, 2025): > ``` > time=2025-03-05T20:08:37.783+08:00 level=INFO source=runner.go:931 msg="starting go runner" > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll > time=2025-03-05T20:08:37.823+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 > ``` > > Apparently no GPU backends. What does the following show: > > ``` > dir /s C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama > ``` this is what it shows..... ![Image](https://github.com/user-attachments/assets/9145c475-bc4d-4827-9866-f4bae05d7009) ![Image](https://github.com/user-attachments/assets/1d0a9020-ddc1-4a68-9fe1-26117bf61ec8)
Author
Owner

@RadEdje commented on GitHub (Mar 5, 2025):

the cuda folders both contain files... they're not empty.

I already uninstalled and then re-installed ollama.

To test, I tried LM studio.... and LM studio uses the GPU.... but I feel Ollama is just better.

<!-- gh-comment-id:2701320314 --> @RadEdje commented on GitHub (Mar 5, 2025): the cuda folders both contain files... they're not empty. I already uninstalled and then re-installed ollama. To test, I tried LM studio.... and LM studio uses the GPU.... but I feel Ollama is just better.
Author
Owner

@NGC13009 commented on GitHub (Mar 6, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

<!-- gh-comment-id:2703389080 --> @NGC13009 commented on GitHub (Mar 6, 2025): 我在Linux下面遇见了相同的问题,我重装ollama之后解决了
Author
Owner

@RadEdje commented on GitHub (Mar 6, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?

<!-- gh-comment-id:2704337140 --> @RadEdje commented on GitHub (Mar 6, 2025): > 我在Linux下面遇见了相同的问题,我重装ollama之后解决了 Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?
Author
Owner

@NGC13009 commented on GitHub (Mar 6, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?

You can try the steps I did. After I read the install.sh, followed the installation steps, I did the following to uninstall:

  1. sudo systemctl stop ollama
  2. sudo systemctl disable ollama
  3. sudo rm /usr/local/bin/ollama
  4. backup the config: cp /etc/systemd/system/ollama.service ~/backup/ which will be remove in re-install, if it is needed.
  5. sudo rm -rf /etc/ollama

what i did NOT do:

  1. remove the ollama user/group
  2. remove the config in /etc/systemd/system/ollama.service.
  3. remove the model which was used

In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, install.sh automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of install.sh. i guess.

<!-- gh-comment-id:2704394645 --> @NGC13009 commented on GitHub (Mar 6, 2025): > > 我在Linux下面遇见了相同的问题,我重装ollama之后解决了 > > Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install? You can try the steps I did. After I read the `install.sh`, followed the installation steps, I did the following to uninstall: 1. `sudo systemctl stop ollama` 2. `sudo systemctl disable ollama` 3. `sudo rm /usr/local/bin/ollama` 4. backup the config: `cp /etc/systemd/system/ollama.service ~/backup/` which will be remove in re-install, if it is needed. 5. `sudo rm -rf /etc/ollama` what i did NOT do: 1. remove the `ollama` user/group 2. remove the config in `/etc/systemd/system/ollama.service`. 3. remove the model which was used In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, `install.sh` automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of `install.sh`. i guess.
Author
Owner

@RadEdje commented on GitHub (Mar 6, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?

You can try the steps I did. After I read the install.sh, followed the installation steps, I did the following to uninstall:

  1. sudo systemctl stop ollama
  2. sudo systemctl disable ollama
  3. sudo rm /usr/local/bin/ollama
  4. backup the config: cp /etc/systemd/system/ollama.service ~/backup/ which will be remove in re-install, if it is needed.
  5. sudo rm -rf /etc/ollama

what i did NOT do:

  1. remove the ollama user/group
  2. remove the config in /etc/systemd/system/ollama.service.
  3. remove the model which was used

In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, install.sh automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of install.sh. i guess.

i tried uninstalling and reinstalling.... same problem.....

But I just read your replay... will try these. thanks... Unfortunately I'm on windows.... not linux.....but I'll see if I can replicate what you did. gotta sleep... work tomorrow.... but i will update here what happens.... when I try your suggestion. Thanks again!

<!-- gh-comment-id:2704399895 --> @RadEdje commented on GitHub (Mar 6, 2025): > > > 我在Linux下面遇见了相同的问题,我重装ollama之后解决了 > > > > > > Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install? > > You can try the steps I did. After I read the `install.sh`, followed the installation steps, I did the following to uninstall: > > 1. `sudo systemctl stop ollama` > 2. `sudo systemctl disable ollama` > 3. `sudo rm /usr/local/bin/ollama` > 4. backup the config: `cp /etc/systemd/system/ollama.service ~/backup/` which will be remove in re-install, if it is needed. > 5. `sudo rm -rf /etc/ollama` > > what i did NOT do: > > 1. remove the `ollama` user/group > 2. remove the config in `/etc/systemd/system/ollama.service`. > 3. remove the model which was used > > In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, `install.sh` automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of `install.sh`. i guess. i tried uninstalling and reinstalling.... same problem..... But I just read your replay... will try these. thanks... Unfortunately I'm on windows.... not linux.....but I'll see if I can replicate what you did. gotta sleep... work tomorrow.... but i will update here what happens.... when I try your suggestion. Thanks again!
Author
Owner

@RadEdje commented on GitHub (Mar 8, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?

You can try the steps I did. After I read the install.sh, followed the installation steps, I did the following to uninstall:

  1. sudo systemctl stop ollama
  2. sudo systemctl disable ollama
  3. sudo rm /usr/local/bin/ollama
  4. backup the config: cp /etc/systemd/system/ollama.service ~/backup/ which will be remove in re-install, if it is needed.
  5. sudo rm -rf /etc/ollama

what i did NOT do:

  1. remove the ollama user/group
  2. remove the config in /etc/systemd/system/ollama.service.
  3. remove the model which was used

In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, install.sh automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of install.sh. i guess.

i tried but I couldn't figure out what folders to remove on the windows side of things. Would you have some idea what your linux folders are on the windows side? all i removed was .ollama in the windows user folder aside from the folder automatically removed during uninstall..... this is really sad..... One think I noticed.... if I use a really REALLY small 2gig model..... it seems to run on the GPU.... but when I use the bigger models like deepseek 14b or the llama3.2-vision which used to fit on my GPU.... now ollama goes straight to full CPU and doesn't even touch my GPU before offloading to the CPU ram.... so 2 gig models seem to run on the GPU.... could there be a setting I need to fix?

<!-- gh-comment-id:2708237642 --> @RadEdje commented on GitHub (Mar 8, 2025): > > > 我在Linux下面遇见了相同的问题,我重装ollama之后解决了 > > > > > > Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install? > > You can try the steps I did. After I read the `install.sh`, followed the installation steps, I did the following to uninstall: > > 1. `sudo systemctl stop ollama` > 2. `sudo systemctl disable ollama` > 3. `sudo rm /usr/local/bin/ollama` > 4. backup the config: `cp /etc/systemd/system/ollama.service ~/backup/` which will be remove in re-install, if it is needed. > 5. `sudo rm -rf /etc/ollama` > > what i did NOT do: > > 1. remove the `ollama` user/group > 2. remove the config in `/etc/systemd/system/ollama.service`. > 3. remove the model which was used > > In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, `install.sh` automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of `install.sh`. i guess. i tried but I couldn't figure out what folders to remove on the windows side of things. Would you have some idea what your linux folders are on the windows side? all i removed was .ollama in the windows user folder aside from the folder automatically removed during uninstall..... this is really sad..... One think I noticed.... if I use a really REALLY small 2gig model..... it seems to run on the GPU.... but when I use the bigger models like deepseek 14b or the llama3.2-vision which used to fit on my GPU.... now ollama goes straight to full CPU and doesn't even touch my GPU before offloading to the CPU ram.... so 2 gig models seem to run on the GPU.... could there be a setting I need to fix?
Author
Owner

@NGC13009 commented on GitHub (Mar 8, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?

You can try the steps I did. After I read the install.sh, followed the installation steps, I did the following to uninstall:

  1. sudo systemctl stop ollama
  2. sudo systemctl disable ollama
  3. sudo rm /usr/local/bin/ollama
  4. backup the config: cp /etc/systemd/system/ollama.service ~/backup/ which will be remove in re-install, if it is needed.
  5. sudo rm -rf /etc/ollama

what i did NOT do:

  1. remove the ollama user/group
  2. remove the config in /etc/systemd/system/ollama.service.
  3. remove the model which was used

In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, install.sh automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of install.sh. i guess.

i tried but I couldn't figure out what folders to remove on the windows side of things. Would you have some idea what your linux folders are on the windows side? all i removed was .ollama in the windows user folder aside from the folder automatically removed during uninstall..... this is really sad..... One think I noticed.... if I use a really REALLY small 2gig model..... it seems to run on the GPU.... but when I use the bigger models like deepseek 14b or the llama3.2-vision which used to fit on my GPU.... now ollama goes straight to full CPU and doesn't even touch my GPU before offloading to the CPU ram.... so 2 gig models seem to run on the GPU.... could there be a setting I need to fix?

您的显卡有多大显存?一些模型支持cpu/gpu同时推理,这类模型可以让部分层加载到不同的设备上,但是另外一些模型则必须使用单独的设备完成推理,无法分散部署在异构设备上面。

所以可能是显存不够,恰巧这个模型又不支持分布式推理,那就全加载到CPU上了。

您可以参考这个issue:#8509

使用这个命令将模型强制在GPU上计算

# model eg. qwen2.5-coder:latest
PS > ollama show --modelfile qwen2.5-coder:latest > Modelfile
PS > echo PARAMETER num_gpu 999 >> Modelfile   # how many layers are involved in GPU inference in total
PS > ollama create qwen2.5-coder:gpu

虽然使用 ollama ps 仍旧看起来是运行在cpu,但是此时模型实际上由gpu进行运算。gpu会在Windows上通过共享显存处理,即使物理显存不足的时候。

但是使用共享显存的速度将会非常慢,可能不如cpu推理的速度。

模型消耗的显存大小 = 模型本身的参数量 + kv cache 需要的大小,当 ctx size 设置的比较大的时候,kv cache 会吃很大的显存。如果模型本身小于显卡的物理显存,那么可以缩短上下文试试。

此外,ollama的默认并发数是 3,这会导致 ctx size 为设置的并发数倍。设置Windows的系统环境变量,将并发改为1试试?

OLLAMA_NUM_PARALLEL      1

Image

<!-- gh-comment-id:2708244686 --> @NGC13009 commented on GitHub (Mar 8, 2025): > > > > 我在Linux下面遇见了相同的问题,我重装ollama之后解决了 > > > > > > > > > Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install? > > > > > > You can try the steps I did. After I read the `install.sh`, followed the installation steps, I did the following to uninstall: > > > > 1. `sudo systemctl stop ollama` > > 2. `sudo systemctl disable ollama` > > 3. `sudo rm /usr/local/bin/ollama` > > 4. backup the config: `cp /etc/systemd/system/ollama.service ~/backup/` which will be remove in re-install, if it is needed. > > 5. `sudo rm -rf /etc/ollama` > > > > what i did NOT do: > > > > 1. remove the `ollama` user/group > > 2. remove the config in `/etc/systemd/system/ollama.service`. > > 3. remove the model which was used > > > > In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, `install.sh` automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of `install.sh`. i guess. > > i tried but I couldn't figure out what folders to remove on the windows side of things. Would you have some idea what your linux folders are on the windows side? all i removed was .ollama in the windows user folder aside from the folder automatically removed during uninstall..... this is really sad..... One think I noticed.... if I use a really REALLY small 2gig model..... it seems to run on the GPU.... but when I use the bigger models like deepseek 14b or the llama3.2-vision which used to fit on my GPU.... now ollama goes straight to full CPU and doesn't even touch my GPU before offloading to the CPU ram.... so 2 gig models seem to run on the GPU.... could there be a setting I need to fix? 您的显卡有多大显存?一些模型支持cpu/gpu同时推理,这类模型可以让部分层加载到不同的设备上,但是另外一些模型则必须使用单独的设备完成推理,无法分散部署在异构设备上面。 所以可能是显存不够,恰巧这个模型又不支持分布式推理,那就全加载到CPU上了。 您可以参考这个issue:[#8509](https://github.com/ollama/ollama/issues/8509) 使用这个命令将模型强制在GPU上计算 ```powershell # model eg. qwen2.5-coder:latest PS > ollama show --modelfile qwen2.5-coder:latest > Modelfile PS > echo PARAMETER num_gpu 999 >> Modelfile # how many layers are involved in GPU inference in total PS > ollama create qwen2.5-coder:gpu ``` 虽然使用 `ollama ps` 仍旧看起来是运行在cpu,但是此时模型实际上由gpu进行运算。gpu会在Windows上通过共享显存处理,即使物理显存不足的时候。 但是使用共享显存的速度将会非常慢,可能不如cpu推理的速度。 模型消耗的显存大小 = 模型本身的参数量 + kv cache 需要的大小,当 ctx size 设置的比较大的时候,kv cache 会吃很大的显存。如果模型本身小于显卡的物理显存,那么可以缩短上下文试试。 此外,ollama的默认并发数是 3,这会导致 ctx size 为设置的并发数倍。设置Windows的系统环境变量,将并发改为1试试? ```text OLLAMA_NUM_PARALLEL 1 ``` ![Image](https://github.com/user-attachments/assets/cc044427-4ad9-46a1-b0ef-41cbcbec6e45)
Author
Owner

@RadEdje commented on GitHub (Mar 8, 2025):

虽然使用 ollama ps 仍旧看起来是运行在cpu,但是此时模型实际上由gpu进行运算。gpu会在Windows上通过共享显存处理,即使物理显存不足的时候。

但是使用共享显存的速度将会非常慢,可能不如cpu推理的速度。

模型消耗的显存大小 = 模型本身的参数量 + kv cache 需要的大小,当 ctx size 设置的比较大的时候,kv cache 会吃很大的显存。如果模型本身小于显卡的物理显存,那么可以缩短上下文试试。

此外,ollama的默认并发数是 3,这会导致 ctx size 为设置的并发数倍。设置Windows的系统环境变量,将并发改为1试试?

Thank you for your help and quick replies.
I did as you suggested....
changed environmental variable to 1.

Created a new model called llama3.2-vision:gpu and still it runs on CPU ram... This is weird since these use to all run on my GPU. I had deepseek-r1 running on top of llama3.2-vision on openWebUi.... and it all fit into my gpu.... now even llama3.2-v won't work.... even the tiny model Granite vision doesn't run on my GPU.... this is really sad.... btw I also installed CUDA toolkit.... to update my cuda after my nvidia drivers... Still nothing happend....

I'm using an RTX 4070 super with 12 gigs of vram.....

thank you again for any advice...
at this rate I will have to remove ollama and rely mainily on LM studio which is sad since LM studio still doesn't support mllama for vision models.... so llama3.2-Vision doesn't work on LM studio.... sad since if it's this slow.... running llama vision on OLLAMA cpu ram is also pointless due to how slow it has become.... it's basically unusable now locally....

still

thank you for all your help.

<!-- gh-comment-id:2708276936 --> @RadEdje commented on GitHub (Mar 8, 2025): > 虽然使用 `ollama ps` 仍旧看起来是运行在cpu,但是此时模型实际上由gpu进行运算。gpu会在Windows上通过共享显存处理,即使物理显存不足的时候。 > > 但是使用共享显存的速度将会非常慢,可能不如cpu推理的速度。 > > 模型消耗的显存大小 = 模型本身的参数量 + kv cache 需要的大小,当 ctx size 设置的比较大的时候,kv cache 会吃很大的显存。如果模型本身小于显卡的物理显存,那么可以缩短上下文试试。 > > 此外,ollama的默认并发数是 3,这会导致 ctx size 为设置的并发数倍。设置Windows的系统环境变量,将并发改为1试试? Thank you for your help and quick replies. I did as you suggested.... changed environmental variable to 1. Created a new model called llama3.2-vision:gpu and still it runs on CPU ram... This is weird since these use to all run on my GPU. I had deepseek-r1 running on top of llama3.2-vision on openWebUi.... and it all fit into my gpu.... now even llama3.2-v won't work.... even the tiny model Granite vision doesn't run on my GPU.... this is really sad.... btw I also installed CUDA toolkit.... to update my cuda after my nvidia drivers... Still nothing happend.... I'm using an RTX 4070 super with 12 gigs of vram..... thank you again for any advice... at this rate I will have to remove ollama and rely mainily on LM studio which is sad since LM studio still doesn't support mllama for vision models.... so llama3.2-Vision doesn't work on LM studio.... sad since if it's this slow.... running llama vision on OLLAMA cpu ram is also pointless due to how slow it has become.... it's basically unusable now locally.... still thank you for all your help.
Author
Owner

@rick-github commented on GitHub (Mar 8, 2025):

Server log will show why GPU is not used.

<!-- gh-comment-id:2708287863 --> @rick-github commented on GitHub (Mar 8, 2025): Server log will show why GPU is not used.
Author
Owner

@RadEdje commented on GitHub (Mar 9, 2025):

Server log will show why GPU is not used.

oh sorry.... will put server logs here....

initial start of ollama from the terminal as of today.

2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51"
time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB"
time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB"
time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/03/09 - 16:20:04 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:20:04 | 200 |       38.42ms |       127.0.0.1 | GET      "/api/tags"


after using Ollama run llama3.2-vision

Image

Image

ollama ps notes that GPU is utlizied way more than CPU but this is not reflected in the system tray..

Image

updated server logs:

2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51"
time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB"
time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB"
time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/03/09 - 16:20:04 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:20:04 | 200 |       38.42ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/09 - 16:24:52 | 200 |       2.035ms |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:24:52 | 200 |     40.7808ms |       127.0.0.1 | POST     "/api/show"
time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.765+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.3 GiB"
time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.766+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB"
time=2025-03-09T16:24:52.771+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50660"
time=2025-03-09T16:24:52.774+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T16:24:52.774+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T16:24:52.775+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T16:24:52.792+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T16:25:02.406+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50660"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
time=2025-03-09T16:25:02.538+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 11B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  5679.33 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   656.25 MiB
llama_init_from_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:        CPU compute buffer size =   258.50 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: mllama_model_load: using CPU backend

mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-03-09T16:25:07.544+08:00 level=INFO source=server.go:596 msg="llama runner started in 14.77 seconds"
[GIN] 2025/03/09 - 16:25:07 | 200 |    14.857539s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/09 - 16:25:26 | 200 |    7.7269084s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/09 - 16:26:35 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:26:35 | 200 |       521.9µs |       127.0.0.1 | GET      "/api/ps"

unloading the model with OLLAMA stop shows a drastic decrease in CPU ram usage

Image

here is the updated server log after using ollama stop:

2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51"
time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB"
time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB"
time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/03/09 - 16:20:04 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:20:04 | 200 |       38.42ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/09 - 16:24:52 | 200 |       2.035ms |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:24:52 | 200 |     40.7808ms |       127.0.0.1 | POST     "/api/show"
time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.765+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.3 GiB"
time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.766+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB"
time=2025-03-09T16:24:52.771+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50660"
time=2025-03-09T16:24:52.774+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T16:24:52.774+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T16:24:52.775+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T16:24:52.792+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T16:25:02.406+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50660"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
time=2025-03-09T16:25:02.538+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 11B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  5679.33 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   656.25 MiB
llama_init_from_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:        CPU compute buffer size =   258.50 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: mllama_model_load: using CPU backend

mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-03-09T16:25:07.544+08:00 level=INFO source=server.go:596 msg="llama runner started in 14.77 seconds"
[GIN] 2025/03/09 - 16:25:07 | 200 |    14.857539s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/09 - 16:25:26 | 200 |    7.7269084s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/09 - 16:26:35 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:26:35 | 200 |       521.9µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/03/09 - 16:29:46 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:29:46 | 200 |     45.4588ms |       127.0.0.1 | POST     "/api/generate"
time=2025-03-09T16:29:51.799+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0243079 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-03-09T16:29:52.048+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2738438 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-03-09T16:29:52.299+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5240985 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068

I hope this sheds some light on what is going on with my ollama installation...
I already updated nvidia drivers, and installed nvidia toolkit...

Ollama use to fully utilize my GPU (rtx 4070 super with 12 gigs vram).

based on the last line on the server, ollama is loading the mllama model to the CPU back end instead of the GPU even if ollama detects my 12 gigs of VRam...

I'm not sure what happened or which updated broke GPU support....
But i think this was before even the last OLLAMA update.... so I suspect it was an nvidia driver update?

anyway thank you all for your help.

<!-- gh-comment-id:2708735482 --> @RadEdje commented on GitHub (Mar 9, 2025): > Server log will show why GPU is not used. oh sorry.... will put server logs here.... initial start of ollama from the terminal as of today. ``` 2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51" time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB" time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB" time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/03/09 - 16:20:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:20:04 | 200 | 38.42ms | 127.0.0.1 | GET "/api/tags" ``` after using Ollama run llama3.2-vision ![Image](https://github.com/user-attachments/assets/bbec4e3b-ceca-415d-87ea-1ac2f55fe1da) ![Image](https://github.com/user-attachments/assets/3abd713a-a309-4ae5-97c4-274a6aaf4370) ollama ps notes that GPU is utlizied way more than CPU but this is not reflected in the system tray.. ![Image](https://github.com/user-attachments/assets/cbdb68b8-faf9-4d5a-9189-5d153305abd6) updated server logs: ``` 2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51" time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB" time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB" time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/03/09 - 16:20:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:20:04 | 200 | 38.42ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/09 - 16:24:52 | 200 | 2.035ms | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:24:52 | 200 | 40.7808ms | 127.0.0.1 | POST "/api/show" time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.765+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.3 GiB" time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.766+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB" time=2025-03-09T16:24:52.771+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50660" time=2025-03-09T16:24:52.774+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-09T16:24:52.774+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-09T16:24:52.775+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-09T16:24:52.792+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 time=2025-03-09T16:25:02.406+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50660" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 time=2025-03-09T16:25:02.538+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 11B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 5679.33 MiB llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_init_from_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_init_from_model: CPU output buffer size = 0.50 MiB llama_init_from_model: CPU compute buffer size = 258.50 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: mllama_model_load: using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-03-09T16:25:07.544+08:00 level=INFO source=server.go:596 msg="llama runner started in 14.77 seconds" [GIN] 2025/03/09 - 16:25:07 | 200 | 14.857539s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/09 - 16:25:26 | 200 | 7.7269084s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/09 - 16:26:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:26:35 | 200 | 521.9µs | 127.0.0.1 | GET "/api/ps" ``` unloading the model with OLLAMA stop shows a drastic decrease in CPU ram usage ![Image](https://github.com/user-attachments/assets/5b2f5dde-c7ff-4c8b-9bed-9288b8c287f9) here is the updated server log after using ollama stop: ``` 2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51" time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB" time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB" time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/03/09 - 16:20:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:20:04 | 200 | 38.42ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/09 - 16:24:52 | 200 | 2.035ms | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:24:52 | 200 | 40.7808ms | 127.0.0.1 | POST "/api/show" time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.765+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.3 GiB" time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.766+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB" time=2025-03-09T16:24:52.771+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50660" time=2025-03-09T16:24:52.774+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-09T16:24:52.774+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-09T16:24:52.775+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-09T16:24:52.792+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 time=2025-03-09T16:25:02.406+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50660" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 time=2025-03-09T16:25:02.538+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 11B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 5679.33 MiB llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_init_from_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_init_from_model: CPU output buffer size = 0.50 MiB llama_init_from_model: CPU compute buffer size = 258.50 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: mllama_model_load: using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-03-09T16:25:07.544+08:00 level=INFO source=server.go:596 msg="llama runner started in 14.77 seconds" [GIN] 2025/03/09 - 16:25:07 | 200 | 14.857539s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/09 - 16:25:26 | 200 | 7.7269084s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/09 - 16:26:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:26:35 | 200 | 521.9µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/09 - 16:29:46 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:29:46 | 200 | 45.4588ms | 127.0.0.1 | POST "/api/generate" time=2025-03-09T16:29:51.799+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0243079 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 time=2025-03-09T16:29:52.048+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2738438 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 time=2025-03-09T16:29:52.299+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5240985 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 ``` I hope this sheds some light on what is going on with my ollama installation... I already updated nvidia drivers, and installed nvidia toolkit... Ollama use to fully utilize my GPU (rtx 4070 super with 12 gigs vram). based on the last line on the server, ollama is loading the mllama model to the CPU back end instead of the GPU even if ollama detects my 12 gigs of VRam... I'm not sure what happened or which updated broke GPU support.... But i think this was before even the last OLLAMA update.... so I suspect it was an nvidia driver update? anyway thank you all for your help.
Author
Owner

@RadEdje commented on GitHub (Mar 9, 2025):

我在Linux下面遇见了相同的问题,我重装ollama之后解决了

Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install?

You can try the steps I did. After I read the install.sh, followed the installation steps, I did the following to uninstall:

  1. sudo systemctl stop ollama
  2. sudo systemctl disable ollama
  3. sudo rm /usr/local/bin/ollama
  4. backup the config: cp /etc/systemd/system/ollama.service ~/backup/ which will be remove in re-install, if it is needed.
  5. sudo rm -rf /etc/ollama

what i did NOT do:

  1. remove the ollama user/group
  2. remove the config in /etc/systemd/system/ollama.service.
  3. remove the model which was used

In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, install.sh automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of install.sh. i guess.

not sure if this will help explain what I'm going through but this is what happened in the server logs when I used the ollama run llama3.2-vision:gpu (which is suppose to force GPU use correct?).

Image

server logs:

2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51"
time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB"
time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB"
time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/03/09 - 16:20:04 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:20:04 | 200 |       38.42ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/09 - 16:24:52 | 200 |       2.035ms |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:24:52 | 200 |     40.7808ms |       127.0.0.1 | POST     "/api/show"
time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.765+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.3 GiB"
time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:24:52.766+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB"
time=2025-03-09T16:24:52.771+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50660"
time=2025-03-09T16:24:52.774+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T16:24:52.774+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T16:24:52.775+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T16:24:52.792+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T16:25:02.406+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50660"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
time=2025-03-09T16:25:02.538+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 11B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  5679.33 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   656.25 MiB
llama_init_from_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:        CPU compute buffer size =   258.50 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: mllama_model_load: using CPU backend

mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-03-09T16:25:07.544+08:00 level=INFO source=server.go:596 msg="llama runner started in 14.77 seconds"
[GIN] 2025/03/09 - 16:25:07 | 200 |    14.857539s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/09 - 16:25:26 | 200 |    7.7269084s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/09 - 16:26:35 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:26:35 | 200 |       521.9µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/03/09 - 16:29:46 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:29:46 | 200 |     45.4588ms |       127.0.0.1 | POST     "/api/generate"
time=2025-03-09T16:29:51.799+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0243079 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-03-09T16:29:52.048+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2738438 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
time=2025-03-09T16:29:52.299+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5240985 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068
[GIN] 2025/03/09 - 16:39:11 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/09 - 16:39:11 | 200 |     213.381ms |       127.0.0.1 | POST     "/api/show"
time=2025-03-09T16:39:12.007+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:39:12.007+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:39:12.008+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:39:12.009+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:39:12.023+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.2 GiB"
time=2025-03-09T16:39:12.025+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-09T16:39:12.025+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-09T16:39:12.026+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB"
time=2025-03-09T16:39:12.031+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50813"
time=2025-03-09T16:39:12.036+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T16:39:12.036+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T16:39:12.037+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T16:39:12.075+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-09T16:39:13.574+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T16:39:13.574+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50813"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 11B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors:          CPU model buffer size =  5679.33 MiB
time=2025-03-09T16:39:13.790+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   656.25 MiB
llama_init_from_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:        CPU compute buffer size =   258.50 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: mllama_model_load: using CPU backend

mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-03-09T16:39:18.546+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.51 seconds"
[GIN] 2025/03/09 - 16:39:18 | 200 |    6.6060601s |       127.0.0.1 | POST     "/api/generate"


it still uses CPU back end. :-(

<!-- gh-comment-id:2708738989 --> @RadEdje commented on GitHub (Mar 9, 2025): > > > 我在Linux下面遇见了相同的问题,我重装ollama之后解决了 > > > > > > Hello.... I tried uninstalling and reinstalling....but same issue.... I think I need to purge something that ollama might be leaving behind.... ? That affects the re install? > > You can try the steps I did. After I read the `install.sh`, followed the installation steps, I did the following to uninstall: > > 1. `sudo systemctl stop ollama` > 2. `sudo systemctl disable ollama` > 3. `sudo rm /usr/local/bin/ollama` > 4. backup the config: `cp /etc/systemd/system/ollama.service ~/backup/` which will be remove in re-install, if it is needed. > 5. `sudo rm -rf /etc/ollama` > > what i did NOT do: > > 1. remove the `ollama` user/group > 2. remove the config in `/etc/systemd/system/ollama.service`. > 3. remove the model which was used > > In addition, my CUDA environments (including cuDNN and cuBLAS, etc.) were not reinstalled, and based on the prompts during reinstallation, `install.sh` automatically skipped installing the cuda environments. This is due to the fact that the relevant environments were installed by me before installing ollama. If your cuda environments were installed automatically by ollama, you may still need to find the dependencies that were once installed and reinstall them based on the contents of `install.sh`. i guess. not sure if this will help explain what I'm going through but this is what happened in the server logs when I used the ollama run llama3.2-vision:gpu (which is suppose to force GPU use correct?). ![Image](https://github.com/user-attachments/assets/d0c08b89-a524-4708-8153-4f0b8e76e4de) server logs: ``` 2025/03/09 16:19:33 routes.go:1215: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-03-09T16:19:33.955+08:00 level=INFO source=images.go:432 msg="total blobs: 51" time=2025-03-09T16:19:33.956+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-09T16:19:33.961+08:00 level=INFO source=routes.go:1277 msg="Listening on 127.0.0.1:11434 (version 0.5.13)" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-03-09T16:19:33.962+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-03-09T16:19:34.076+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" overhead="91.5 MiB" time=2025-03-09T16:19:34.486+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB" time=2025-03-09T16:19:34.489+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/03/09 - 16:20:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:20:04 | 200 | 38.42ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/09 - 16:24:52 | 200 | 2.035ms | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:24:52 | 200 | 40.7808ms | 127.0.0.1 | POST "/api/show" time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.746+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.748+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.765+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.3 GiB" time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:24:52.766+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:24:52.766+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=32 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.2 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB" time=2025-03-09T16:24:52.771+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 32 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50660" time=2025-03-09T16:24:52.774+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-09T16:24:52.774+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-09T16:24:52.775+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-09T16:24:52.792+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 time=2025-03-09T16:25:02.406+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50660" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 time=2025-03-09T16:25:02.538+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 11B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 5679.33 MiB llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_init_from_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_init_from_model: CPU output buffer size = 0.50 MiB llama_init_from_model: CPU compute buffer size = 258.50 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: mllama_model_load: using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-03-09T16:25:07.544+08:00 level=INFO source=server.go:596 msg="llama runner started in 14.77 seconds" [GIN] 2025/03/09 - 16:25:07 | 200 | 14.857539s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/09 - 16:25:26 | 200 | 7.7269084s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/09 - 16:26:35 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:26:35 | 200 | 521.9µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/09 - 16:29:46 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:29:46 | 200 | 45.4588ms | 127.0.0.1 | POST "/api/generate" time=2025-03-09T16:29:51.799+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0243079 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 time=2025-03-09T16:29:52.048+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2738438 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 time=2025-03-09T16:29:52.299+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5240985 model=D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 [GIN] 2025/03/09 - 16:39:11 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/09 - 16:39:11 | 200 | 213.381ms | 127.0.0.1 | POST "/api/show" time=2025-03-09T16:39:12.007+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:39:12.007+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:39:12.008+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:39:12.009+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:39:12.023+08:00 level=INFO source=server.go:97 msg="system memory" total="31.1 GiB" free="22.9 GiB" free_swap="37.2 GiB" time=2025-03-09T16:39:12.025+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-09T16:39:12.025+08:00 level=WARN source=ggml.go:136 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-09T16:39:12.026+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=999 layers.model=41 layers.offload=33 layers.split="" memory.available="[10.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.3 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB" time=2025-03-09T16:39:12.031+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:\\ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --mmproj D:\\ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 8 --no-mmap --parallel 1 --port 50813" time=2025-03-09T16:39:12.036+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-09T16:39:12.036+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-09T16:39:12.037+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-09T16:39:12.075+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-09T16:39:13.574+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 time=2025-03-09T16:39:13.574+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:50813" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:\ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 11B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: CPU model buffer size = 5679.33 MiB time=2025-03-09T16:39:13.790+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_init_from_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_init_from_model: CPU output buffer size = 0.50 MiB llama_init_from_model: CPU compute buffer size = 258.50 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: mllama_model_load: using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-03-09T16:39:18.546+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.51 seconds" [GIN] 2025/03/09 - 16:39:18 | 200 | 6.6060601s | 127.0.0.1 | POST "/api/generate" ``` it still uses CPU back end. :-(
Author
Owner

@NGC13009 commented on GitHub (Mar 9, 2025):

ollama ps以及任务管理器里面看着是不准确的,首先ollama ps看起来是显存不够加载整个模型,所以13%的层被加载到了内存中。虽然您的模型只有7.9GB,但是推理的时候,由于上下文需要更大的kvcache以及不同层中特征图占用的内存,所以可能需要超过您显卡显存大小的更多内存。

然而,如果您在modelfile中指定了强制GPU并构造了新的模型,那么实际推理肯定是全GPU运算的,只不过是因为显存不够,所以使用了共享显存(也就是cpu内存的一部分). 此外,任务管理器里面看见的显存和共享显存是不正确的,不要相信任务管理器的统计结果。

您需要随便让模型先输出点什么,然后从任务管理器看cuda的使用率,cpu的使用率。

一般来说,如果cpu参与了运算,那么应该会超过90%以上,同时显卡的cuda使用率比较低(<40%)。如果确实是GPU运算100%,那么cuda占用率应该超过90%,cpu使用率不会高于20%(还需要使用cpu是因为共享显存需要cpu去操作造成的,实际上cpu没有参与模型的推理运算)

<!-- gh-comment-id:2708754151 --> @NGC13009 commented on GitHub (Mar 9, 2025): `ollama ps`以及任务管理器里面看着是不准确的,首先`ollama ps`看起来是显存不够加载整个模型,所以13%的层被加载到了内存中。虽然您的模型只有7.9GB,但是推理的时候,由于上下文需要更大的kvcache以及不同层中特征图占用的内存,所以可能需要超过您显卡显存大小的更多内存。 然而,如果您在modelfile中指定了强制GPU并构造了新的模型,那么实际推理肯定是全GPU运算的,只不过是因为显存不够,所以使用了共享显存(也就是cpu内存的一部分). 此外,任务管理器里面看见的显存和共享显存是不正确的,不要相信任务管理器的统计结果。 您需要随便让模型先输出点什么,然后从任务管理器看cuda的使用率,cpu的使用率。 一般来说,如果cpu参与了运算,那么应该会超过90%以上,同时显卡的cuda使用率比较低(<40%)。如果确实是GPU运算100%,那么cuda占用率应该超过90%,cpu使用率不会高于20%(还需要使用cpu是因为共享显存需要cpu去操作造成的,实际上cpu没有参与模型的推理运算)
Author
Owner

@RadEdje commented on GitHub (Mar 9, 2025):

ollama ps以及任务管理器里面看着是不准确的,首先ollama ps看起来是显存不够加载整个模型,所以13%的层被加载到了内存中。虽然您的模型只有7.9GB,但是推理的时候,由于上下文需要更大的kvcache以及不同层中特征图占用的内存,所以可能需要超过您显卡显存大小的更多内存。

然而,如果您在modelfile中指定了强制GPU并构造了新的模型,那么实际推理肯定是全GPU运算的,只不过是因为显存不够,所以使用了共享显存(也就是cpu内存的一部分). 此外,任务管理器里面看见的显存和共享显存是不正确的,不要相信任务管理器的统计结果。

您需要随便让模型先输出点什么,然后从任务管理器看cuda的使用率,cpu的使用率。

一般来说,如果cpu参与了运算,那么应该会超过90%以上,同时显卡的cuda使用率比较低(<40%)。如果确实是GPU运算100%,那么cuda占用率应该超过90%,cpu使用率不会高于20%(还需要使用cpu是因为共享显存需要cpu去操作造成的,实际上cpu没有参与模型的推理运算)

Thank you for your reply. I tried what you said. I'm using the llama3.2-vision:gpu model version where is set the GPU to be used by force. The thing is .. this use to work before. I would see my GPU VRAM rise when running ollama. Now it doesn't. However if I use LM Studio, my VRAM rises and my CPU ram stays at a typical level when using other big models like granite vision.

I'm just wondering, if my GPU could run llama3.2-vision before on top of deepseek-r1... What changed such that ollama prefers to put it all into CPU ram now?

Thank you again for your advice

<!-- gh-comment-id:2708762288 --> @RadEdje commented on GitHub (Mar 9, 2025): > `ollama ps`以及任务管理器里面看着是不准确的,首先`ollama ps`看起来是显存不够加载整个模型,所以13%的层被加载到了内存中。虽然您的模型只有7.9GB,但是推理的时候,由于上下文需要更大的kvcache以及不同层中特征图占用的内存,所以可能需要超过您显卡显存大小的更多内存。 > > 然而,如果您在modelfile中指定了强制GPU并构造了新的模型,那么实际推理肯定是全GPU运算的,只不过是因为显存不够,所以使用了共享显存(也就是cpu内存的一部分). 此外,任务管理器里面看见的显存和共享显存是不正确的,不要相信任务管理器的统计结果。 > > 您需要随便让模型先输出点什么,然后从任务管理器看cuda的使用率,cpu的使用率。 > > 一般来说,如果cpu参与了运算,那么应该会超过90%以上,同时显卡的cuda使用率比较低(<40%)。如果确实是GPU运算100%,那么cuda占用率应该超过90%,cpu使用率不会高于20%(还需要使用cpu是因为共享显存需要cpu去操作造成的,实际上cpu没有参与模型的推理运算) Thank you for your reply. I tried what you said. I'm using the llama3.2-vision:gpu model version where is set the GPU to be used by force. The thing is .. this use to work before. I would see my GPU VRAM rise when running ollama. Now it doesn't. However if I use LM Studio, my VRAM rises and my CPU ram stays at a typical level when using other big models like granite vision. I'm just wondering, if my GPU could run llama3.2-vision before on top of deepseek-r1... What changed such that ollama prefers to put it all into CPU ram now? Thank you again for your advice
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8

This is the problem, ollama is failing to load the backends required for GPU/advanced CPU use. There's been a couple of other reports for this (#9266, #9245) but no root cause determined yet.

<!-- gh-comment-id:2708777986 --> @rick-github commented on GitHub (Mar 9, 2025): ``` ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 ``` This is the problem, ollama is failing to load the backends required for GPU/advanced CPU use. There's been a couple of other reports for this (#9266, #9245) but no root cause determined yet.
Author
Owner

@RadEdje commented on GitHub (Mar 9, 2025):

ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8

This is the problem, ollama is failing to load the backends required for GPU/advanced CPU use. There's been a couple of other reports for this (#9266, #9245) but no root cause determined yet.

Ohhhh so it isn't just me.... Glad to know..... I hope it gets fixed... So for now just watch and wait?

Really sad, granite vision and Phi4-mini models just became useable on the latest and greatest ollama.... Guess I'll have to wait and see. Thank you....

<!-- gh-comment-id:2708795445 --> @RadEdje commented on GitHub (Mar 9, 2025): > ``` > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll > ggml_backend_load_best: failed to load C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-skylakex.dll > time=2025-03-09T16:25:02.405+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | cgo(clang)" threads=8 > ``` > This is the problem, ollama is failing to load the backends required for GPU/advanced CPU use. There's been a couple of other reports for this (#9266, #9245) but no root cause determined yet. Ohhhh so it isn't just me.... Glad to know..... I hope it gets fixed... So for now just watch and wait? Really sad, granite vision and Phi4-mini models just became useable on the latest and greatest ollama.... Guess I'll have to wait and see. Thank you....
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

Well, it is sort of "just you" - there have been only a few reports so it seems most users don't have a problem. Something specific to the environment of those few users is causing the issue but it hasn't been identified yet.

<!-- gh-comment-id:2708797754 --> @rick-github commented on GitHub (Mar 9, 2025): Well, it is sort of "just you" - there have been only a few reports so it seems most users don't have a problem. Something specific to the environment of those few users is causing the issue but it hasn't been identified yet.
Author
Owner

@RadEdje commented on GitHub (Mar 9, 2025):

Well, it is sort of "just you" - there have been only a few reports so it seems most users don't have a problem. Something specific to the environment of those few users is causing the issue but it hasn't been identified yet.

Oh.... I guess I have to figure this out.... 😔 I wonder... Would rolling back Nvidia drivers work?

<!-- gh-comment-id:2708821011 --> @RadEdje commented on GitHub (Mar 9, 2025): > Well, it is sort of "just you" - there have been only a few reports so it seems most users don't have a problem. Something specific to the environment of those few users is causing the issue but it hasn't been identified yet. Oh.... I guess I have to figure this out.... 😔 I wonder... Would rolling back Nvidia drivers work?
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

I don't think rolling back drivers will help - the failures to load the CPU backends in the logs have nothing to do with external drivers. It seems like there's an inability to load a file or link a DLL into the running process.

<!-- gh-comment-id:2708827133 --> @rick-github commented on GitHub (Mar 9, 2025): I don't think rolling back drivers will help - the failures to load the CPU backends in the logs have nothing to do with external drivers. It seems like there's an inability to load a file or link a DLL into the running process.
Author
Owner

@RadEdje commented on GitHub (Mar 9, 2025):

I don't think rolling back drivers will help - the failures to load the CPU backends in the logs have nothing to do with external drivers. It seems like there's an inability to load a file or link a DLL into the running process.

Oh oki. Thanks for the heads up. You saved me a lot of painfall rollbacks.... I just wish I new what changed. Would totally removing everything: uninstall, remove environmental, variables, remove all downloaded models, remove .ollama folders that remain even after uninstall.... Could that work? I wonder if uninstalling leaves something behind that should be purged on windows.....

My ollama models folder is in drive D.... Since I didn't want to fill up my drive C with models.....

<!-- gh-comment-id:2708921544 --> @RadEdje commented on GitHub (Mar 9, 2025): > I don't think rolling back drivers will help - the failures to load the CPU backends in the logs have nothing to do with external drivers. It seems like there's an inability to load a file or link a DLL into the running process. Oh oki. Thanks for the heads up. You saved me a lot of painfall rollbacks.... I just wish I new what changed. Would totally removing everything: uninstall, remove environmental, variables, remove all downloaded models, remove .ollama folders that remain even after uninstall.... Could that work? I wonder if uninstalling leaves something behind that should be purged on windows..... My ollama models folder is in drive D.... Since I didn't want to fill up my drive C with models.....
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

Un-installing, clearing variables, removing everything in C:\Users\PC\AppData\Local\Programs\Ollama and re-installing is worth trying. Removing the models shouldn't be necessary as there are no code related objects there.

<!-- gh-comment-id:2708924863 --> @rick-github commented on GitHub (Mar 9, 2025): Un-installing, clearing variables, removing everything in C:\Users\PC\AppData\Local\Programs\Ollama and re-installing is worth trying. Removing the models shouldn't be necessary as there are no code related objects there.
Author
Owner

@pluberd commented on GitHub (Mar 10, 2025):

I don't know if it helps you. Today I start my ollama (Linux) that was installed about six weeks ago and it tells me it uses the GPU. But in fact it was using CPU. I reinstall and now it works again. Strange.

<!-- gh-comment-id:2710392425 --> @pluberd commented on GitHub (Mar 10, 2025): I don't know if it helps you. Today I start my ollama (Linux) that was installed about six weeks ago and it tells me it uses the GPU. But in fact it was using CPU. I reinstall and now it works again. Strange.
Author
Owner

@RadEdje commented on GitHub (Mar 12, 2025):

Un-installing, clearing variables, removing everything in C:\Users\PC\AppData\Local\Programs\Ollama and re-installing is worth trying. Removing the models shouldn't be necessary as there are no code related objects there.

hello, to update....
i tried uninstall everything, removing all environmental valuables.
I even updated to version 0.6.... nothing worked....

then I saw this post somewhere.... from
https://github.com/ollama/ollama/issues/9266

by https://github.com/Hsq12138

i added this to the PATH:

C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama

THIS FIXED IT...
llama3.2-vision now fully loads onto the GPU....

here is the server log:


2025/03/12 23:32:30 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-03-12T23:32:30.768+08:00 level=INFO source=images.go:432 msg="total blobs: 51"
time=2025-03-12T23:32:30.769+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-12T23:32:30.770+08:00 level=INFO source=routes.go:1292 msg="Listening on 127.0.0.1:11434 (version 0.6.0)"
time=2025-03-12T23:32:30.770+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-12T23:32:30.770+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-12T23:32:30.770+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16
time=2025-03-12T23:32:31.260+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB"
time=2025-03-12T23:32:31.261+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB"
[GIN] 2025/03/12 - 23:32:50 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/12 - 23:32:50 | 200 |      3.1285ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/03/12 - 23:33:27 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/12 - 23:33:27 | 200 |     13.8624ms |       127.0.0.1 | POST     "/api/show"
time=2025-03-12T23:33:27.064+08:00 level=WARN source=sched.go:138 msg="mllama doesn't support parallel requests yet"
time=2025-03-12T23:33:27.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-12T23:33:27.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-12T23:33:27.106+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-12T23:33:27.106+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-12T23:33:27.129+08:00 level=INFO source=server.go:105 msg="system memory" total="31.1 GiB" free="23.2 GiB" free_swap="37.3 GiB"
time=2025-03-12T23:33:27.130+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.key_length default=128
time=2025-03-12T23:33:27.130+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.value_length default=128
time=2025-03-12T23:33:27.130+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[10.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.1 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.1 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-03-12T23:33:27.276+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 31 --threads 8 --no-mmap --parallel 1 --mmproj D:ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --port 50771"
time=2025-03-12T23:33:27.278+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-12T23:33:27.278+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-12T23:33:27.278+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-12T23:33:27.294+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-03-12T23:33:27.456+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-03-12T23:33:27.457+08:00 level=INFO source=runner.go:991 msg="Server listening on 127.0.0.1:50771"
time=2025-03-12T23:33:27.529+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mllama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Model
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                         mllama.block_count u32              = 40
llama_model_loader: - kv   5:                      mllama.context_length u32              = 131072
llama_model_loader: - kv   6:                    mllama.embedding_length u32              = 4096
llama_model_loader: - kv   7:                 mllama.feed_forward_length u32              = 14336
llama_model_loader: - kv   8:                mllama.attention.head_count u32              = 32
llama_model_loader: - kv   9:             mllama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                      mllama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:    mllama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                          general.file_type u32              = 15
llama_model_loader: - kv  13:                          mllama.vocab_size u32              = 128256
llama_model_loader: - kv  14:                mllama.rope.dimension_count u32              = 128
llama_model_loader: - kv  15:    mllama.attention.cross_attention_layers arr[i32,8]       = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv  16:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,128257]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,128257]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  114 tensors
llama_model_loader: - type q4_K:  245 tensors
llama_model_loader: - type q6_K:   37 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.55 GiB (4.87 BPW) 
load: special tokens cache size = 257
load: token to piece cache size = 0.7999 MB
print_info: arch             = mllama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 11B
print_info: model params     = 9.78 B
print_info: general.name     = Model
print_info: vocab type       = BPE
print_info: n_vocab          = 128257
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128004 '<|finetune_right_pad_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 31/41 layers to GPU
load_tensors:    CUDA_Host model buffer size =  1556.06 MiB
load_tensors:        CUDA0 model buffer size =  3841.45 MiB
load_tensors:          CPU model buffer size =   281.83 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   500.19 MiB
llama_kv_cache_init:        CPU KV buffer size =   156.06 MiB
llama_init_from_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_init_from_model:        CPU  output buffer size =     0.50 MiB
llama_init_from_model:      CUDA0 compute buffer size =   669.48 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 82 (with bs=512), 3 (with bs=1)
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: mllama_model_load: using CUDA0 backend

mllama_model_load: compute allocated memory: 2853.34 MB
time=2025-03-12T23:33:29.532+08:00 level=INFO source=server.go:624 msg="llama runner started in 2.25 seconds"
[GIN] 2025/03/12 - 23:33:29 | 200 |    2.4808231s |       127.0.0.1 | POST     "/api/generate"
time=2025-03-12T23:33:46.059+08:00 level=WARN source=sched.go:138 msg="mllama doesn't support parallel requests yet"
[GIN] 2025/03/12 - 23:33:47 | 200 |    1.2413645s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 23:34:23 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/12 - 23:34:23 | 200 |      2.0696ms |       127.0.0.1 | POST     "/api/generate"


I hope this helps clear up anything that might be needed to fix this bug.
Thanks again everyone.

(time to try out Gemma3 !!!! multimodal! yahooooo.....)

<!-- gh-comment-id:2718320308 --> @RadEdje commented on GitHub (Mar 12, 2025): > Un-installing, clearing variables, removing everything in C:\Users\PC\AppData\Local\Programs\Ollama and re-installing is worth trying. Removing the models shouldn't be necessary as there are no code related objects there. hello, to update.... i tried uninstall everything, removing all environmental valuables. I even updated to version 0.6.... nothing worked.... then I saw this post somewhere.... from https://github.com/ollama/ollama/issues/9266 by https://github.com/Hsq12138 i added this to the PATH: ``` C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama ``` THIS FIXED IT... llama3.2-vision now fully loads onto the GPU.... here is the server log: ``` 2025/03/12 23:32:30 routes.go:1225: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:ollama_models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-03-12T23:32:30.768+08:00 level=INFO source=images.go:432 msg="total blobs: 51" time=2025-03-12T23:32:30.769+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-12T23:32:30.770+08:00 level=INFO source=routes.go:1292 msg="Listening on 127.0.0.1:11434 (version 0.6.0)" time=2025-03-12T23:32:30.770+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-12T23:32:30.770+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-03-12T23:32:30.770+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-03-12T23:32:31.260+08:00 level=INFO source=amd_windows.go:127 msg="unsupported Radeon iGPU detected skipping" id=0 total="12.0 GiB" time=2025-03-12T23:32:31.261+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-eca11db2-5a89-550c-36c9-adedf39c9da1 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4070 SUPER" total="12.0 GiB" available="10.8 GiB" [GIN] 2025/03/12 - 23:32:50 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/12 - 23:32:50 | 200 | 3.1285ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/03/12 - 23:33:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/12 - 23:33:27 | 200 | 13.8624ms | 127.0.0.1 | POST "/api/show" time=2025-03-12T23:33:27.064+08:00 level=WARN source=sched.go:138 msg="mllama doesn't support parallel requests yet" time=2025-03-12T23:33:27.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-12T23:33:27.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-12T23:33:27.106+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-12T23:33:27.106+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-12T23:33:27.129+08:00 level=INFO source=server.go:105 msg="system memory" total="31.1 GiB" free="23.2 GiB" free_swap="37.3 GiB" time=2025-03-12T23:33:27.130+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.key_length default=128 time=2025-03-12T23:33:27.130+08:00 level=WARN source=ggml.go:149 msg="key not found" key=mllama.attention.value_length default=128 time=2025-03-12T23:33:27.130+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[10.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.8 GiB" memory.required.partial="10.1 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[10.1 GiB]" memory.weights.total="5.5 GiB" memory.weights.repeating="5.1 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" projector.weights="1.8 GiB" projector.graph="2.8 GiB" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-03-12T23:33:27.276+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\PC\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model D:ollama_models\\blobs\\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 --ctx-size 2048 --batch-size 512 --n-gpu-layers 31 --threads 8 --no-mmap --parallel 1 --mmproj D:ollama_models\\blobs\\sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --port 50771" time=2025-03-12T23:33:27.278+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-12T23:33:27.278+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-12T23:33:27.278+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-12T23:33:27.294+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-03-12T23:33:27.456+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-12T23:33:27.457+08:00 level=INFO source=runner.go:991 msg="Server listening on 127.0.0.1:50771" time=2025-03-12T23:33:27.529+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070 SUPER) - 11053 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from D:ollama_models\blobs\sha256-11f274007f093fefeec994a5dbbb33d0733a4feb87f7ab66dcd7c1069fef0068 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q4_K: 245 tensors llama_model_loader: - type q6_K: 37 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 5.55 GiB (4.87 BPW) load: special tokens cache size = 257 load: token to piece cache size = 0.7999 MB print_info: arch = mllama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 40 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 11B print_info: model params = 9.78 B print_info: general.name = Model print_info: vocab type = BPE print_info: n_vocab = 128257 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: PAD token = 128004 '<|finetune_right_pad_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 31 repeating layers to GPU load_tensors: offloaded 31/41 layers to GPU load_tensors: CUDA_Host model buffer size = 1556.06 MiB load_tensors: CUDA0 model buffer size = 3841.45 MiB load_tensors: CPU model buffer size = 281.83 MiB llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CUDA0 KV buffer size = 500.19 MiB llama_kv_cache_init: CPU KV buffer size = 156.06 MiB llama_init_from_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_init_from_model: CPU output buffer size = 0.50 MiB llama_init_from_model: CUDA0 compute buffer size = 669.48 MiB llama_init_from_model: CUDA_Host compute buffer size = 12.01 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 82 (with bs=512), 3 (with bs=1) mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: mllama_model_load: using CUDA0 backend mllama_model_load: compute allocated memory: 2853.34 MB time=2025-03-12T23:33:29.532+08:00 level=INFO source=server.go:624 msg="llama runner started in 2.25 seconds" [GIN] 2025/03/12 - 23:33:29 | 200 | 2.4808231s | 127.0.0.1 | POST "/api/generate" time=2025-03-12T23:33:46.059+08:00 level=WARN source=sched.go:138 msg="mllama doesn't support parallel requests yet" [GIN] 2025/03/12 - 23:33:47 | 200 | 1.2413645s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/12 - 23:34:23 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/12 - 23:34:23 | 200 | 2.0696ms | 127.0.0.1 | POST "/api/generate" ``` I hope this helps clear up anything that might be needed to fix this bug. Thanks again everyone. (time to try out Gemma3 !!!! multimodal! yahooooo.....)
Author
Owner

@Jay021 commented on GitHub (Mar 16, 2025):

中文内容:

我的Ollama在运行Ollama ps命令时,虽然显示GPU使用率为100%,但实际运行大模型时,CPU运转飞快,而NVIDIA MSI显示GPU占用率为0%,显存也未占用。我尝试过卸载并重新安装Ollama,但问题依旧。直到我看到了这个帖子:

#9266
by https://github.com/Hsq12138

我在PATH中添加了以下路径:
C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama

完美解决了问题,你可以试试。


英文:

When running the Ollama ps command, my Ollama shows 100% GPU usage, but in reality, the CPU is running at full speed while the NVIDIA MSI shows 0% GPU usage and no VRAM is being utilized. I tried uninstalling and reinstalling Ollama, but the issue persisted. Until I came across this post:

#9266
by https://github.com/Hsq12138

I added the following path to the PATH environment variable:
C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama

This perfectly resolved the issue. You can give it a try.

<!-- gh-comment-id:2727175626 --> @Jay021 commented on GitHub (Mar 16, 2025): ### 中文内容: 我的Ollama在运行`Ollama ps`命令时,虽然显示GPU使用率为100%,但实际运行大模型时,CPU运转飞快,而NVIDIA MSI显示GPU占用率为0%,显存也未占用。我尝试过卸载并重新安装Ollama,但问题依旧。直到我看到了这个帖子: #9266 by https://github.com/Hsq12138 我在PATH中添加了以下路径: `C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama` 完美解决了问题,你可以试试。 --- ### 英文: When running the `Ollama ps` command, my Ollama shows 100% GPU usage, but in reality, the CPU is running at full speed while the NVIDIA MSI shows 0% GPU usage and no VRAM is being utilized. I tried uninstalling and reinstalling Ollama, but the issue persisted. Until I came across this post: #9266 by https://github.com/Hsq12138 I added the following path to the PATH environment variable: `C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama` This perfectly resolved the issue. You can give it a try.
Author
Owner

@RadEdje commented on GitHub (Mar 17, 2025):

中文内容:

我的Ollama在运行Ollama ps命令时,虽然显示GPU使用率为100%,但实际运行大模型时,CPU运转飞快,而NVIDIA MSI显示GPU占用率为0%,显存也未占用。我尝试过卸载并重新安装Ollama,但问题依旧。直到我看到了这个帖子:

#9266 by https://github.com/Hsq12138

我在PATH中添加了以下路径: C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama

完美解决了问题,你可以试试。

英文:

When running the Ollama ps command, my Ollama shows 100% GPU usage, but in reality, the CPU is running at full speed while the NVIDIA MSI shows 0% GPU usage and no VRAM is being utilized. I tried uninstalling and reinstalling Ollama, but the issue persisted. Until I came across this post:

#9266 by https://github.com/Hsq12138

I added the following path to the PATH environment variable: C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama

This perfectly resolved the issue. You can give it a try.

yes thanks.... I tried this too. It fixed everything..... Will close this thread now.... thanks again everyone.

<!-- gh-comment-id:2729937822 --> @RadEdje commented on GitHub (Mar 17, 2025): > ### 中文内容: > 我的Ollama在运行`Ollama ps`命令时,虽然显示GPU使用率为100%,但实际运行大模型时,CPU运转飞快,而NVIDIA MSI显示GPU占用率为0%,显存也未占用。我尝试过卸载并重新安装Ollama,但问题依旧。直到我看到了这个帖子: > > [#9266](https://github.com/ollama/ollama/issues/9266) by https://github.com/Hsq12138 > > 我在PATH中添加了以下路径: `C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama` > > 完美解决了问题,你可以试试。 > > ### 英文: > When running the `Ollama ps` command, my Ollama shows 100% GPU usage, but in reality, the CPU is running at full speed while the NVIDIA MSI shows 0% GPU usage and no VRAM is being utilized. I tried uninstalling and reinstalling Ollama, but the issue persisted. Until I came across this post: > > [#9266](https://github.com/ollama/ollama/issues/9266) by https://github.com/Hsq12138 > > I added the following path to the PATH environment variable: `C:\Users\PC\AppData\Local\Programs\Ollama\lib\ollama` > > This perfectly resolved the issue. You can give it a try. yes thanks.... I tried this too. It fixed everything..... Will close this thread now.... thanks again everyone.
Author
Owner

@mentalblood0 commented on GitHub (Nov 21, 2025):

I had the problem looking exactly the same on Arch Linux and resolved it by using /usr/bin/ollama instead of /usr/local/bin/ollama in systemd service

<!-- gh-comment-id:3564167652 --> @mentalblood0 commented on GitHub (Nov 21, 2025): I had the problem looking exactly the same on Arch Linux and resolved it by using `/usr/bin/ollama` instead of `/usr/local/bin/ollama` in systemd service
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68261