[GH-ISSUE #8961] After ollama upgrade, severe performance drop with deepseek, seems GPU not available #5814

Closed
opened 2026-04-12 17:09:18 -05:00 by GiteaMirror · 42 comments
Owner

Originally created by @liyuheng55555 on GitHub (Feb 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8961

What is the issue?

After upgrading from Ollama 0.3.14 (pre-installed in the server provider’s system image) to 0.5.7 using the official curl installation command, Deepseek’s loading and execution speed significantly dropped.

Observations:
• Performance Issue: Deepseek was running much slower after the upgrade.
• CPU & Memory Usage: top showed high CPU and memory usage during inference.
• GPU Usage: nvidia-smi displayed no GPU memory usage, but ollama ps still reported “100% GPU.”
• Inference Concern: It seems that Deepseek might be falling back to CPU inference despite showing GPU utilization in ollama ps.

Additional Information:
• The issue persisted until I reverted the system by reinstalling the original system image.
• Attached are screenshots from top, nvidia-smi, and ollama ps during Deepseek’s execution.

Image Image Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @liyuheng55555 on GitHub (Feb 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8961 ### What is the issue? After upgrading from Ollama 0.3.14 (pre-installed in the server provider’s system image) to 0.5.7 using the official curl installation command, Deepseek’s loading and execution speed significantly dropped. Observations: • Performance Issue: Deepseek was running much slower after the upgrade. • CPU & Memory Usage: `top` showed high CPU and memory usage during inference. • GPU Usage: `nvidia-smi` displayed no GPU memory usage, but `ollama ps` still reported “100% GPU.” • Inference Concern: It seems that Deepseek might be falling back to CPU inference despite showing GPU utilization in ollama ps. Additional Information: • The issue persisted until I reverted the system by reinstalling the original system image. • Attached are screenshots from top, nvidia-smi, and ollama ps during Deepseek’s execution. <img width="1478" alt="Image" src="https://github.com/user-attachments/assets/1ff1a559-bc2a-406c-b603-536a270d9b80" /> <img width="1478" alt="Image" src="https://github.com/user-attachments/assets/58cfa0ed-6778-4e2d-8090-5a61575c27ef" /> <img width="1478" alt="Image" src="https://github.com/user-attachments/assets/74df5a01-bc86-4f90-ad37-92c2e4ee1f78" /> ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bugneeds more info labels 2026-04-12 17:09:18 -05:00
Author
Owner

@YonTracks commented on GitHub (Feb 9, 2025):

seems a fresh 0.5.7 needs the new gpu build instructions.
https://github.com/ollama/ollama/releases/tag/v0.5.8-rc12

https://github.com/ollama/ollama/blob/main/docs/development.md
good luck.

<!-- gh-comment-id:2646126285 --> @YonTracks commented on GitHub (Feb 9, 2025): seems a `fresh` 0.5.7 needs the new gpu build instructions. https://github.com/ollama/ollama/releases/tag/v0.5.8-rc12 https://github.com/ollama/ollama/blob/main/docs/development.md good luck.
Author
Owner

@rick-github commented on GitHub (Feb 9, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2646149633 --> @rick-github commented on GitHub (Feb 9, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@billye commented on GitHub (Feb 10, 2025):

I have the same problem. Has it been solved?

<!-- gh-comment-id:2647066898 --> @billye commented on GitHub (Feb 10, 2025): I have the same problem. Has it been solved?
Author
Owner

@billye commented on GitHub (Feb 10, 2025):

@rick-github Server logs are here:

ollama_serve.txt

<!-- gh-comment-id:2647246748 --> @billye commented on GitHub (Feb 10, 2025): @rick-github Server logs are here: [ollama_serve.txt](https://github.com/user-attachments/files/18730416/ollama_serve.txt)
Author
Owner

@liyuheng55555 commented on GitHub (Feb 10, 2025):

@YonTracks Thanks for the solution. I think I’ll stick with the old version of ollama for now.

@rick-github Unfortunately, the logs were not preserved.

@billye Looks like your video memory is being used normally

<!-- gh-comment-id:2647271184 --> @liyuheng55555 commented on GitHub (Feb 10, 2025): @YonTracks Thanks for the solution. I think I’ll stick with the old version of ollama for now. @rick-github Unfortunately, the logs were not preserved. @billye Looks like your video memory is being used normally
Author
Owner

@YonTracks commented on GitHub (Feb 10, 2025):

cheers.
The actual problem I was having is, the gpu files installed here "...\AppData\Local\Programs\Ollama\lib\ollama\" were missing the cuda_v11 and or cuda_v12 and or rocm. ollama shows no errors, everything in the logs seem good, and ollama ps shows 100% gpu, but slow as (even slower than when using cpu only), No gpu being used!

If CMake build config does not find the gpu, or is not set for the gpu, then no gpu files will be created (env settings and cuda toolkit problems also will prevent this).

After sorting the gpu cuda env issues (the reason I had issues, is I installed cuda toolkit 12.8 and CMake from the dev instructions fresh...) now with correct Path, the CMake builds successfully, they are in the build/lib/ollama (for development)
and when using from the dev folder, it works great, ./ollama serve or go run. serve it works great.

now heres the issue:
I used to build via the ollama.iss script with powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 and this would create a OllamaSetup.exe that would install the gpu files correctly.
now when running powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1, it compiles very quick, but does not install the gpu files, and will remove any current files breaking the installed ollama.
if manually copying the gpu files from build/lib/ollama to "...\AppData\Local\Programs\Ollama\lib\ollama\", it works again.
or, modified .iss.

#if DirExists("..\build\lib\ollama")
Source: "..\build\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
#endif
<!-- gh-comment-id:2647421112 --> @YonTracks commented on GitHub (Feb 10, 2025): cheers. The actual problem I was having is, the gpu files installed here `"...\AppData\Local\Programs\Ollama\lib\ollama\"` were missing the cuda_v11 and or cuda_v12 and or rocm. ollama shows no errors, everything in the logs seem good, and `ollama ps` shows 100% gpu, but slow as (even slower than when using cpu only), No gpu being used! If CMake build config does not find the gpu, or is not set for the gpu, then no gpu files will be created (env settings and cuda toolkit problems also will prevent this). After sorting the gpu cuda env issues (the reason I had issues, is I installed cuda toolkit 12.8 and CMake from the dev instructions fresh...) now with correct `Path`, the CMake builds successfully, they are in the `build/lib/ollama` (for development) and when using from the dev folder, it works great, `./ollama serve` or `go run. serve` it works great. now heres the issue: I used to build via the ollama.iss script with `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` and this would create a OllamaSetup.exe that would install the gpu files correctly. now when running `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1`, it compiles very quick, but does not install the gpu files, and will remove any current files breaking the installed ollama. if manually copying the gpu files from `build/lib/ollama` to `"...\AppData\Local\Programs\Ollama\lib\ollama\"`, it works again. or, modified .iss. ``` #if DirExists("..\build\lib\ollama") Source: "..\build\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs #endif ````
Author
Owner

@rick-github commented on GitHub (Feb 10, 2025):

@billye

llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/65 layers to GPU
llm_load_tensors:    CUDA_Host model buffer size = 13365.91 MiB
llm_load_tensors:        CUDA0 model buffer size =  5142.45 MiB
llm_load_tensors:          CPU model buffer size =   417.66 MiB

Your GPU is being used.

<!-- gh-comment-id:2647542080 --> @rick-github commented on GitHub (Feb 10, 2025): @billye ``` llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloaded 18/65 layers to GPU llm_load_tensors: CUDA_Host model buffer size = 13365.91 MiB llm_load_tensors: CUDA0 model buffer size = 5142.45 MiB llm_load_tensors: CPU model buffer size = 417.66 MiB ``` Your GPU is being used.
Author
Owner

@YonTracks commented on GitHub (Feb 10, 2025):

heres a not working server.log.

2025/02/10 21:03:03 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\clint\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]"
time=2025-02-10T21:03:03.921+10:00 level=INFO source=images.go:432 msg="total blobs: 132"
time=2025-02-10T21:03:03.930+10:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-02-10T21:03:03.935+10:00 level=INFO source=routes.go:1237 msg="Listening on 127.0.0.1:11434 (version 0.5.7)"
time=2025-02-10T21:03:03.935+10:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2025-02-10T21:03:03.936+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-10T21:03:03.936+10:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-02-10T21:03:03.936+10:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-02-10T21:03:03.936+10:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-02-10T21:03:03.936+10:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll
time=2025-02-10T21:03:03.936+10:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="
..."
time=2025-02-10T21:03:03.957+10:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\System32\\nvcuda.dll C:\\WINDOWS\\system32\\nvcuda.dll]"
time=2025-02-10T21:03:03.969+10:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=C:\Windows\System32\nvcuda.dll
time=2025-02-10T21:03:04.081+10:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found."
time=2025-02-10T21:03:04.082+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
time=2025-02-10T21:03:37.182+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.3 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB"
time=2025-02-10T21:03:37.190+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB"
time=2025-02-10T21:03:37.191+10:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff7a91e9640 gpu_count=1
time=2025-02-10T21:03:37.219+10:00 level=DEBUG source=sched.go:224 msg="loading first model" model=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93
time=2025-02-10T21:03:37.219+10:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2025-02-10T21:03:37.220+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB"
time=2025-02-10T21:03:37.237+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB"
time=2025-02-10T21:03:37.239+10:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2025-02-10T21:03:37.239+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB"
time=2025-02-10T21:03:37.252+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB"
time=2025-02-10T21:03:37.254+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB"
time=2025-02-10T21:03:37.269+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB"
time=2025-02-10T21:03:37.269+10:00 level=INFO source=server.go:100 msg="system memory" total="31.9 GiB" free="22.4 GiB" free_swap="20.0 GiB"
time=2025-02-10T21:03:37.269+10:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2025-02-10T21:03:37.269+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB"
time=2025-02-10T21:03:37.284+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB"
time=2025-02-10T21:03:37.285+10:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=30 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.3 GiB" memory.required.partial="10.8 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="668.3 MiB" memory.graph.partial="916.1 MiB"
time=2025-02-10T21:03:37.285+10:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-02-10T21:03:37.285+10:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
time=2025-02-10T21:03:37.285+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]

<when working>
time=2025-02-10T21:38:51.710+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]"
time=2025-02-10T21:38:51.718+10:00 level=DEBUG source=server.go:302 msg="adding gpu library" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12
time=2025-02-10T21:38:51.718+10:00 level=DEBUG source=server.go:310 msg="adding gpu dependency paths" paths=[C:\Users\clint\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12]
</when working>

time=2025-02-10T21:03:37.293+10:00 level=INFO source=server.go:381 msg="starting llama server" cmd="
..."

time=2025-02-10T21:03:37.298+10:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-10T21:03:37.298+10:00 level=INFO source=server.go:558 msg="waiting for llama runner to start responding"
time=2025-02-10T21:03:37.299+10:00 level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server error"
time=2025-02-10T21:03:37.326+10:00 level=INFO source=runner.go:936 msg="starting go runner"
time=2025-02-10T21:03:37.327+10:00 level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | LLAMAFILE = 1 | CPU : SSE3 = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=6
time=2025-02-10T21:03:37.327+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama
time=2025-02-10T21:03:37.327+10:00 level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:56682"
time=2025-02-10T21:03:37.331+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\bin"
time=2025-02-10T21:03:37.349+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\libnvvp"
time=2025-02-10T21:03:37.354+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Windows\System32
time=2025-02-10T21:03:37.550+10:00 level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server loading model"
time=2025-02-10T21:03:41.697+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\Go\\bin"
time=2025-02-10T21:03:41.699+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.42.34433\\bin\\Hostx64\\x64"
time=2025-02-10T21:03:41.728+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\dotnet"
time=2025-02-10T21:03:41.733+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\nodejs"
time=2025-02-10T21:03:41.739+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin"
time=2025-02-10T21:03:41.755+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0"
ime=2025-02-10T21:03:41.797+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\ollama
time=2025-02-10T21:03:41.806+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\cuda
time=2025-02-10T21:03:41.807+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\rocm
llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG
llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG
llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG
llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 18.48 GiB (4.85 BPW) 
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 32B
llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:          CPU model buffer size = 18926.01 MiB
load_all_data: no device found for buffer type CPU for async uploads

<when working>
llm_load_tensors:          CPU model buffer size =   417.66 MiB
load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads
time=2025-02-10T21:39:07.743+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.00"
</when working>

time=2025-02-10T21:03:42.304+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.02"
time=2025-02-10T21:03:42.555+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.06"
ime=2025-02-10T21:03:58.573+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.97"
time=2025-02-10T21:03:58.823+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.99"
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 8192
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 1
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 48: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 49: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 50: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 51: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 52: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 53: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 54: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 55: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 56: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 57: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 58: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 59: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 60: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 61: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 62: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 63: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
time=2025-02-10T21:03:59.073+10:00 level=DEBUG source=server.go:603 msg="model load progress 1.00"

<when working>
llama_kv_cache_init:      CUDA0 KV buffer size =   480.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   480.00 MiB
time=2025-02-10T21:39:20.506+10:00 level=DEBUG source=server.go:606 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =  1088.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   916.08 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   146.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 1991
llama_new_context_with_model: graph splits = 481 (with bs=512), 4 (with bs=1)
</when working>

time=2025-02-10T21:03:59.323+10:00 level=DEBUG source=server.go:606 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
llama_new_context_with_model:        CPU compute buffer size =   307.00 MiB
llama_new_context_with_model: graph nodes  = 1991
llama_new_context_with_model: graph splits = 1
time=2025-02-10T21:03:59.825+10:00 level=INFO source=server.go:597 msg="llama runner started in 22.53 seconds"
time=2025-02-10T21:03:59.826+10:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93
time=2025-02-10T21:03:59.827+10:00 level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt=<|User|>hello<|Assistant|>
time=2025-02-10T21:03:59.830+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=4 used=0 remaining=4
[GIN] 2025/02/10 - 21:05:05 | 200 |         1m28s |       127.0.0.1 | POST     "/api/chat"
time=2025-02-10T21:05:05.680+10:00 level=DEBUG source=sched.go:466 msg="context for request finished"
time=2025-02-10T21:05:05.680+10:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 duration=30m0s
time=2025-02-10T21:05:05.680+10:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 refCount=0
time=2025-02-10T21:05:05.781+10:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93
time=2025-02-10T21:05:05.781+10:00 level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|User|>Given the following... please reply with a title for the chat that is 3-4 words in length, all words used should be directly related to the content of the chat, avoid using verbs unless they are directly related to the content of the chat, no additional text or explanation, you don't need ending punctuation.\n\nHello! How can I assist you today? 😊<|Assistant|>"
time=2025-02-10T21:05:05.782+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=19 prompt=81 used=2 remaining=79
[GIN] 2025/02/10 - 21:10:13 | 200 |          5m8s |       127.0.0.1 | POST     "/api/chat"
time=2025-02-10T21:10:13.972+10:00 level=DEBUG source=sched.go:407 msg="context for request finished"
time=2025-02-10T21:10:13.972+10:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 duration=30m0s
time=2025-02-10T21:10:13.972+10:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 refCount=0
<!-- gh-comment-id:2647669991 --> @YonTracks commented on GitHub (Feb 10, 2025): heres a not working server.log. ``` 2025/02/10 21:03:03 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\clint\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]" time=2025-02-10T21:03:03.921+10:00 level=INFO source=images.go:432 msg="total blobs: 132" time=2025-02-10T21:03:03.930+10:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-02-10T21:03:03.935+10:00 level=INFO source=routes.go:1237 msg="Listening on 127.0.0.1:11434 (version 0.5.7)" time=2025-02-10T21:03:03.935+10:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2025-02-10T21:03:03.936+10:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-10T21:03:03.936+10:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-02-10T21:03:03.936+10:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12 time=2025-02-10T21:03:03.936+10:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-02-10T21:03:03.936+10:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=nvml.dll time=2025-02-10T21:03:03.936+10:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs=" ..." time=2025-02-10T21:03:03.957+10:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[C:\\Windows\\System32\\nvcuda.dll C:\\WINDOWS\\system32\\nvcuda.dll]" time=2025-02-10T21:03:03.969+10:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=C:\Windows\System32\nvcuda.dll time=2025-02-10T21:03:04.081+10:00 level=DEBUG source=amd_windows.go:34 msg="unable to load amdhip64_6.dll, please make sure to upgrade to the latest amd driver: The specified module could not be found." time=2025-02-10T21:03:04.082+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" time=2025-02-10T21:03:37.182+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.3 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB" time=2025-02-10T21:03:37.190+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB" time=2025-02-10T21:03:37.191+10:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7ff7a91e9640 gpu_count=1 time=2025-02-10T21:03:37.219+10:00 level=DEBUG source=sched.go:224 msg="loading first model" model=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 time=2025-02-10T21:03:37.219+10:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2025-02-10T21:03:37.220+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB" time=2025-02-10T21:03:37.237+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB" time=2025-02-10T21:03:37.239+10:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2025-02-10T21:03:37.239+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB" time=2025-02-10T21:03:37.252+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB" time=2025-02-10T21:03:37.254+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB" time=2025-02-10T21:03:37.269+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB" time=2025-02-10T21:03:37.269+10:00 level=INFO source=server.go:100 msg="system memory" total="31.9 GiB" free="22.4 GiB" free_swap="20.0 GiB" time=2025-02-10T21:03:37.269+10:00 level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2025-02-10T21:03:37.269+10:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="31.9 GiB" before.free="22.4 GiB" before.free_swap="20.0 GiB" now.total="31.9 GiB" now.free="22.4 GiB" now.free_swap="20.0 GiB" time=2025-02-10T21:03:37.284+10:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="12.0 GiB" before.free="10.8 GiB" now.total="12.0 GiB" now.free="10.8 GiB" now.used="1.2 GiB" time=2025-02-10T21:03:37.285+10:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=30 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.3 GiB" memory.required.partial="10.8 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="668.3 MiB" memory.graph.partial="916.1 MiB" time=2025-02-10T21:03:37.285+10:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-02-10T21:03:37.285+10:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" time=2025-02-10T21:03:37.285+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[] <when working> time=2025-02-10T21:38:51.710+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible="[cuda_v12 cuda_v11]" time=2025-02-10T21:38:51.718+10:00 level=DEBUG source=server.go:302 msg="adding gpu library" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12 time=2025-02-10T21:38:51.718+10:00 level=DEBUG source=server.go:310 msg="adding gpu dependency paths" paths=[C:\Users\clint\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12] </when working> time=2025-02-10T21:03:37.293+10:00 level=INFO source=server.go:381 msg="starting llama server" cmd=" ..." time=2025-02-10T21:03:37.298+10:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-10T21:03:37.298+10:00 level=INFO source=server.go:558 msg="waiting for llama runner to start responding" time=2025-02-10T21:03:37.299+10:00 level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server error" time=2025-02-10T21:03:37.326+10:00 level=INFO source=runner.go:936 msg="starting go runner" time=2025-02-10T21:03:37.327+10:00 level=INFO source=runner.go:937 msg=system info="CPU : SSE3 = 1 | LLAMAFILE = 1 | CPU : SSE3 = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=6 time=2025-02-10T21:03:37.327+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama time=2025-02-10T21:03:37.327+10:00 level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:56682" time=2025-02-10T21:03:37.331+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\bin" time=2025-02-10T21:03:37.349+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.8\\libnvvp" time=2025-02-10T21:03:37.354+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Windows\System32 time=2025-02-10T21:03:37.550+10:00 level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server loading model" time=2025-02-10T21:03:41.697+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\Go\\bin" time=2025-02-10T21:03:41.699+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.42.34433\\bin\\Hostx64\\x64" time=2025-02-10T21:03:41.728+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\dotnet" time=2025-02-10T21:03:41.733+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\nodejs" time=2025-02-10T21:03:41.739+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\bin" time=2025-02-10T21:03:41.755+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path="C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2025.1.0" ime=2025-02-10T21:03:41.797+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\ollama time=2025-02-10T21:03:41.806+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\cuda time=2025-02-10T21:03:41.807+10:00 level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=C:\Users\clint\AppData\Local\Programs\Ollama\lib\rocm llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen2.block_count u32 = 64 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 5 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 27648 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 32B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 32.76 B llm_load_print_meta: model size = 18.48 GiB (4.85 BPW) llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 32B llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 llm_load_tensors: CPU model buffer size = 18926.01 MiB load_all_data: no device found for buffer type CPU for async uploads <when working> llm_load_tensors: CPU model buffer size = 417.66 MiB load_all_data: buffer type CUDA_Host is not the default buffer type for device CUDA0 for async uploads time=2025-02-10T21:39:07.743+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.00" </when working> time=2025-02-10T21:03:42.304+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.02" time=2025-02-10T21:03:42.555+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.06" ime=2025-02-10T21:03:58.573+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.97" time=2025-02-10T21:03:58.823+10:00 level=DEBUG source=server.go:603 msg="model load progress 0.99" llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 34: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 35: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 36: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 37: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 38: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 39: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 40: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 41: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 42: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 43: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 44: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 45: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 46: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 47: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 48: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 49: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 50: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 51: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 52: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 53: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 54: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 55: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 56: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 57: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 58: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 59: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 60: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 61: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 62: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 63: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 time=2025-02-10T21:03:59.073+10:00 level=DEBUG source=server.go:603 msg="model load progress 1.00" <when working> llama_kv_cache_init: CUDA0 KV buffer size = 480.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 480.00 MiB time=2025-02-10T21:39:20.506+10:00 level=DEBUG source=server.go:606 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 1088.00 MiB llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_new_context_with_model: CPU output buffer size = 0.60 MiB llama_new_context_with_model: CUDA0 compute buffer size = 916.08 MiB llama_new_context_with_model: CUDA0 compute buffer size = 146.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 26.01 MiB llama_new_context_with_model: graph nodes = 1991 llama_new_context_with_model: graph splits = 481 (with bs=512), 4 (with bs=1) </when working> time=2025-02-10T21:03:59.323+10:00 level=DEBUG source=server.go:606 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_new_context_with_model: CPU output buffer size = 0.60 MiB llama_new_context_with_model: CPU compute buffer size = 307.00 MiB llama_new_context_with_model: graph nodes = 1991 llama_new_context_with_model: graph splits = 1 time=2025-02-10T21:03:59.825+10:00 level=INFO source=server.go:597 msg="llama runner started in 22.53 seconds" time=2025-02-10T21:03:59.826+10:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 time=2025-02-10T21:03:59.827+10:00 level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt=<|User|>hello<|Assistant|> time=2025-02-10T21:03:59.830+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=4 used=0 remaining=4 [GIN] 2025/02/10 - 21:05:05 | 200 | 1m28s | 127.0.0.1 | POST "/api/chat" time=2025-02-10T21:05:05.680+10:00 level=DEBUG source=sched.go:466 msg="context for request finished" time=2025-02-10T21:05:05.680+10:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 duration=30m0s time=2025-02-10T21:05:05.680+10:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 refCount=0 time=2025-02-10T21:05:05.781+10:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 time=2025-02-10T21:05:05.781+10:00 level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|User|>Given the following... please reply with a title for the chat that is 3-4 words in length, all words used should be directly related to the content of the chat, avoid using verbs unless they are directly related to the content of the chat, no additional text or explanation, you don't need ending punctuation.\n\nHello! How can I assist you today? 😊<|Assistant|>" time=2025-02-10T21:05:05.782+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=19 prompt=81 used=2 remaining=79 [GIN] 2025/02/10 - 21:10:13 | 200 | 5m8s | 127.0.0.1 | POST "/api/chat" time=2025-02-10T21:10:13.972+10:00 level=DEBUG source=sched.go:407 msg="context for request finished" time=2025-02-10T21:10:13.972+10:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 duration=30m0s time=2025-02-10T21:10:13.972+10:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 refCount=0 ```
Author
Owner

@YonTracks commented on GitHub (Feb 10, 2025):

far out, I still have not learned how to post logs, sorry about that.
txt file, from now on.
edit^:

$ ollama ps
NAME               ID              SIZE     PROCESSOR          UNTIL
deepseek-r1:32b    38056bbcbb2d    23 GB    52%/48% CPU/GPU    26 minutes from now 

same with llama3.1 the ollama ps:

$ ollama ps
NAME               ID              SIZE      PROCESSOR    UNTIL
llama3.1:latest    46e0c10c039e    6.9 GB    100% GPU     29 minutes from now

slow as, gpu is not being used (I think it only adds the overhead), and the speed is slower than with cpu only and the model size is larger.

<!-- gh-comment-id:2647672643 --> @YonTracks commented on GitHub (Feb 10, 2025): far out, I still have not learned how to post logs, sorry about that. txt file, from now on. edit^: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL deepseek-r1:32b 38056bbcbb2d 23 GB 52%/48% CPU/GPU 26 minutes from now ``` same with llama3.1 the ollama ps: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.1:latest 46e0c10c039e 6.9 GB 100% GPU 29 minutes from now ``` slow as, gpu is not being used (I think it only adds the overhead), and the speed is slower than with cpu only and the model size is larger.
Author
Owner

@rick-github commented on GitHub (Feb 10, 2025):

The GPU usage shown by ollama ps is calculated before the runner is started. ollama is expecting a GPU enabled runner based on the GPUs detected. A GPU enabled runner is not found so ollama falls back to use CPU. The GPU usage is not updated.

This should be exactly the same as running CPU only. If the speed is different, that indicates something else is broken. You would need to provide some logs to allow debugging. A simple check is to run the ollama client in GPU/non-GPU modes:

Baseline (want to start GPU but GPU runner may not be available):

$ ollama run --verbose llama3.1
>>> why is the sky blue?
...
total duration:       5.10967256s
load duration:        190.90907ms
prompt eval count:    16 token(s)
prompt eval duration: 117ms
prompt eval rate:     136.75 tokens/s
eval count:           386 token(s)
eval duration:        4.8s
eval rate:            80.42 tokens/s

Switch to CPU:

>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> why is the sky blue?
...
total duration:       26.360083051s
load duration:        3.504271067s
prompt eval count:    417 token(s)
prompt eval duration: 12.149s
prompt eval rate:     34.32 tokens/s
eval count:           115 token(s)
eval duration:        10.243s
eval rate:            11.23 tokens/s
<!-- gh-comment-id:2647770251 --> @rick-github commented on GitHub (Feb 10, 2025): The GPU usage shown by `ollama ps` is calculated before the runner is started. ollama is expecting a GPU enabled runner based on the GPUs detected. A GPU enabled runner is not found so ollama falls back to use CPU. The GPU usage is not updated. This should be exactly the same as running CPU only. If the speed is different, that indicates something else is broken. You would need to provide some logs to allow debugging. A simple check is to run the ollama client in GPU/non-GPU modes: Baseline (want to start GPU but GPU runner may not be available): ```console $ ollama run --verbose llama3.1 >>> why is the sky blue? ... total duration: 5.10967256s load duration: 190.90907ms prompt eval count: 16 token(s) prompt eval duration: 117ms prompt eval rate: 136.75 tokens/s eval count: 386 token(s) eval duration: 4.8s eval rate: 80.42 tokens/s ``` Switch to CPU: ```console >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> why is the sky blue? ... total duration: 26.360083051s load duration: 3.504271067s prompt eval count: 417 token(s) prompt eval duration: 12.149s prompt eval rate: 34.32 tokens/s eval count: 115 token(s) eval duration: 10.243s eval rate: 11.23 tokens/s ```
Author
Owner

@YonTracks commented on GitHub (Feb 10, 2025):

cheers, I will try get the data, also the speed of the initial model load also is affected, pretty sure with windows, speed difference between ./ollama serve dev mode, and ollama serve installed mode.

I will try and get definitive data.

edit^:
not working gpu.

ollama run --verbose llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

total duration:       18.669819s
load duration:        23.3505ms
prompt eval count:    11 token(s)
prompt eval duration: 825ms
prompt eval rate:     13.33 tokens/s
eval count:           23 token(s)
eval duration:        17.82s
eval rate:            1.29 tokens/s

not working gpu / cpu only mode.

ollama run --verbose llama3.1
>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

total duration:       18.7362011s
load duration:        23.3079ms
prompt eval count:    11 token(s)
prompt eval duration: 847ms
prompt eval rate:     12.99 tokens/s
eval count:           23 token(s)
eval duration:        17.865s
eval rate:            1.29 tokens/s

gpu working correct. disabled with CUDA_VISIBLE_DEVICES -1
cpu only:

ollama run --verbose llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

total duration:       3.40819s
load duration:        23.3603ms
prompt eval count:    11 token(s)
prompt eval duration: 551ms
prompt eval rate:     19.96 tokens/s
eval count:           23 token(s)
eval duration:        2.833s
eval rate:            8.12 tokens/s

gpu working correct:

ollama run --verbose llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

total duration:       525.3746ms
load duration:        22.7864ms
prompt eval count:    11 token(s)
prompt eval duration: 56ms
prompt eval rate:     196.43 tokens/s
eval count:           23 token(s)
eval duration:        445ms
eval rate:            51.69 tokens/s

gpu working / cpu only via params:

ollama run --verbose llama3.1
>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

total duration:       8.212891s
load duration:        4.889852s
prompt eval count:    11 token(s)
prompt eval duration: 531ms
prompt eval rate:     20.72 tokens/s
eval count:           23 token(s)
eval duration:        2.791s
eval rate:            8.24 tokens/s

edit^: huge update. I also tried to compare ./ollama serve vs ollama serve and go run . serve quiting ollama first, even if not and found orphaned ollama in the task manager process. nothing should be running? so I try start ollama with the icon, now I have 2x ollama, if I run the same prompt now?

ollama run --verbose llama3.1
>>> hello
Hello! How are you today? Is there something I can help you with or would you like to chat?

total duration:       706.5097ms
load duration:        25.3654ms
prompt eval count:    11 token(s)
prompt eval duration: 182ms
prompt eval rate:     60.44 tokens/s
eval count:           23 token(s)
eval duration:        497ms
eval rate:            46.28 tokens/s

both Ollama and ollama.exe .
i will see if ollama will clear it somehow, before restarting pc, or ending the task.
restarting ollama, and the server, did not auto clear, only ollama.exe clears, not exact sure how to reproduce, I will test more and find out. epic! cheers.

edit^: server started server.log running in the console, and while already running, server.log in editor shows Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
but ollama is running in the console. and Ollama is orphaned.
normally, I think lol, this automatically sorts itself, with that error shown, (it is now not allowing the server if already running, hard to reproduce the orphans).

most likely related to previous issue also, when the gpu is not working correct, I'll test that also.
edit: for memory note to self< ToDo running ollama ps with nothing running no ollama running and ollama ps started the server, but a little differently.

good luck cheers

<!-- gh-comment-id:2647788715 --> @YonTracks commented on GitHub (Feb 10, 2025): cheers, I will try get the data, also the speed of the initial model load also is affected, pretty sure with windows, speed difference between `./ollama serve` dev mode, and `ollama serve` installed mode. I will try and get definitive data. edit^: not working gpu. ``` ollama run --verbose llama3.1 >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? total duration: 18.669819s load duration: 23.3505ms prompt eval count: 11 token(s) prompt eval duration: 825ms prompt eval rate: 13.33 tokens/s eval count: 23 token(s) eval duration: 17.82s eval rate: 1.29 tokens/s ``` not working gpu / cpu only mode. ``` ollama run --verbose llama3.1 >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? total duration: 18.7362011s load duration: 23.3079ms prompt eval count: 11 token(s) prompt eval duration: 847ms prompt eval rate: 12.99 tokens/s eval count: 23 token(s) eval duration: 17.865s eval rate: 1.29 tokens/s ``` gpu working correct. disabled with `CUDA_VISIBLE_DEVICES` -1 cpu only: ``` ollama run --verbose llama3.1 >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? total duration: 3.40819s load duration: 23.3603ms prompt eval count: 11 token(s) prompt eval duration: 551ms prompt eval rate: 19.96 tokens/s eval count: 23 token(s) eval duration: 2.833s eval rate: 8.12 tokens/s ``` gpu working correct: ``` ollama run --verbose llama3.1 >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? total duration: 525.3746ms load duration: 22.7864ms prompt eval count: 11 token(s) prompt eval duration: 56ms prompt eval rate: 196.43 tokens/s eval count: 23 token(s) eval duration: 445ms eval rate: 51.69 tokens/s ``` gpu working / cpu only via params: ``` ollama run --verbose llama3.1 >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? total duration: 8.212891s load duration: 4.889852s prompt eval count: 11 token(s) prompt eval duration: 531ms prompt eval rate: 20.72 tokens/s eval count: 23 token(s) eval duration: 2.791s eval rate: 8.24 tokens/s ``` edit^: huge update. I also tried to compare `./ollama serve` vs `ollama serve` and `go run . serve` quiting ollama first, even if not and found orphaned `ollama` in the task manager process. nothing should be running? so I try start ollama with the icon, now I have 2x `ollama`, if I run the same prompt now? ``` ollama run --verbose llama3.1 >>> hello Hello! How are you today? Is there something I can help you with or would you like to chat? total duration: 706.5097ms load duration: 25.3654ms prompt eval count: 11 token(s) prompt eval duration: 182ms prompt eval rate: 60.44 tokens/s eval count: 23 token(s) eval duration: 497ms eval rate: 46.28 tokens/s ``` both `Ollama` and `ollama.exe` . i will see if ollama will clear it somehow, before restarting pc, or ending the task. restarting ollama, and the server, did not auto clear, only `ollama.exe` clears, not exact sure how to reproduce, I will test more and find out. epic! cheers. edit^: server started server.log running in the console, and while already running, server.log in editor shows `Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.` but ollama is running in the console. and `Ollama` is orphaned. normally, I think lol, this automatically sorts itself, with that error shown, (it is now not allowing the server if already running, hard to reproduce the orphans). most likely related to previous issue also, when the gpu is not working correct, I'll test that also. edit: for memory note to self< ToDo running ollama ps with nothing running `no ollama running` and ollama ps started the server, but a little differently. good luck cheers
Author
Owner

@YonTracks commented on GitHub (Feb 10, 2025):

I must say again, if I haven't enough, current testing is for compiled 0.5.7... latest, via CMake cmake -B build cmake --build build (configured via vscode or VS 2022 workflow) using VS 2022 native tools, or vscode Cmake extension (very very slow) or powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 (but the powershell script builds the OllamaSetup.exe also, without the gpu files, but quick full compile).

with this type of build style, ./ollama serve | go run . serve | and ./ollama start all work great, server.log streaming in the console.

with this type of build style, the installed app via the compiled OllamaSetup.exe, when running the app icon, is what triggers the gpu issues.

if using any of the official OllamaSetup.exe everything works great if the env details are correct, else similar issues to the faulty compiled OllamaSetup.exe (for me cuda 12.6 and 12.8 clashing and toolkit/s, I should use only 1?)

if I add the following to the ollama.iss, then the compiled no longer faulty OllamaSetup.exe will extract the gpu files like the official

#if DirExists("..\build\lib\ollama")
Source: "..\build\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
#endif

edit^: I'm seeing continue vs cli causing different similar issues also depending on how ollama is used, mostly continue issues ./ollama or something.
good luck, cheers

<!-- gh-comment-id:2648086984 --> @YonTracks commented on GitHub (Feb 10, 2025): I must say again, if I haven't enough, current testing is for compiled `0.5.7... latest`, via CMake `cmake -B build cmake --build build` (configured via vscode or VS 2022 `workflow`) using VS 2022 native tools, or vscode Cmake extension (very very slow) or `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` (but the powershell script builds the OllamaSetup.exe also, without the gpu files, but quick full compile). with this type of build style, `./ollama serve` | `go run . serve` | and `./ollama start` all work great, server.log streaming in the console. with this type of build style, the installed app via the compiled OllamaSetup.exe, when running the app icon, is what triggers the gpu issues. if using any of the `official OllamaSetup.exe` everything works great if the env details are correct, else similar issues to the faulty compiled OllamaSetup.exe (for me cuda 12.6 and 12.8 clashing and toolkit/s, I should use only 1?) if I add the following to the ollama.iss, then the compiled no longer faulty OllamaSetup.exe will extract the gpu files like the `official` ``` #if DirExists("..\build\lib\ollama") Source: "..\build\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs #endif ``` edit^: I'm seeing `continue` vs `cli` causing different similar issues also depending on how ollama is used, mostly continue issues `./ollama` or something. good luck, cheers
Author
Owner

@rick-github commented on GitHub (Feb 10, 2025):

If you edit posts, people subscribed to the thread will not be notified of your additions.

The difference in the CPU tests may be down to the different CPU runners. The absolute fallback when ollama fails to find a runner will be a very generic runner, compiled with little optimization to allow it to run anywhere. Allowing ollama to choose the CPU runner by disabling the GPU lets it pick one of the runners with AVX optimisations.

<!-- gh-comment-id:2648422830 --> @rick-github commented on GitHub (Feb 10, 2025): If you edit posts, people subscribed to the thread will not be notified of your additions. The difference in the CPU tests may be down to the different CPU runners. The absolute fallback when ollama fails to find a runner will be a very generic runner, compiled with little optimization to allow it to run anywhere. Allowing ollama to choose the CPU runner by disabling the GPU lets it pick one of the runners with AVX optimisations.
Author
Owner

@arkerwu commented on GitHub (Feb 12, 2025):

Using the official installation package may encounter this issue, while the self-compiled version can properly utilize the GPU.
8908

<!-- gh-comment-id:2652437027 --> @arkerwu commented on GitHub (Feb 12, 2025): Using the official installation package may encounter this issue, while the self-compiled version can properly utilize the GPU. [8908](https://github.com/ollama/ollama/issues/8908)
Author
Owner

@Osirising commented on GitHub (Feb 12, 2025):

那怎么解决这个问题呢?我部署的deepseek-r1:70b速度太慢了,一条命令要60s。回退ollama版本?还是怎么办?

<!-- gh-comment-id:2653215097 --> @Osirising commented on GitHub (Feb 12, 2025): 那怎么解决这个问题呢?我部署的deepseek-r1:70b速度太慢了,一条命令要60s。回退ollama版本?还是怎么办?
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2653269569 --> @rick-github commented on GitHub (Feb 12, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

not sure if hinder, or help? but for me the changes I made to make it work with the latest 0.5.8+ (cuda and toolkit 12.8 and 12.6 for the latest).

with these changes the old, 0.5.7, before the repo changes, now does not build correct unless I delete 12.8 and revert the path settings. But, reverting breaks the latest...

so. for me. cuda! and Path env variables, fixes it.
for the latest 0.5.8+ I have CUDA_PATH and CUDA_PATH_V12_6 and CUDA_PATH_V12_8
old 0.5.7, this does not work, I need to remove the 12.8.

but compiling only, else official install OllamaSetup.exe is good

<!-- gh-comment-id:2655166854 --> @YonTracks commented on GitHub (Feb 13, 2025): not sure if hinder, or help? but for me the changes I made to make it work with the latest 0.5.8+ (cuda and toolkit 12.8 and 12.6 for the latest). with these changes the old, 0.5.7, before the repo changes, now does not build correct unless I delete 12.8 and revert the path settings. But, reverting breaks the latest... so. for me. cuda! and Path env variables, fixes it. for the latest 0.5.8+ I have CUDA_PATH and CUDA_PATH_V12_6 and CUDA_PATH_V12_8 old 0.5.7, this does not work, I need to remove the 12.8. but compiling only, else official install OllamaSetup.exe is good
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

just to be clear also, srry.

CUDA_PATH and CUDA_PATH_V12_6 and CUDA_PATH_V12_8 separate! to Path .
and in the Path there are a few.
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp

and
C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\
I think is gone when reverted.

and
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin
I did not even really ever touch, a windows env before ollama . epic!
I will try get more definitive data, might take a while.
good luck

<!-- gh-comment-id:2655179497 --> @YonTracks commented on GitHub (Feb 13, 2025): just to be clear also, srry. `CUDA_PATH` and `CUDA_PATH_V12_6` and `CUDA_PATH_V12_8` separate! to `Path` . and in the Path there are a few. `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin` `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp` and `C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\` I think is gone when reverted. and `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin` I did not even really ever touch, a windows env before `ollama` . epic! I will try get more definitive data, might take a while. good luck
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

compiles 0.5.9 with powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 . awesome!

compiled very, very quick, only built with the required files and nothing more,

time=2025-02-13T11:47:33.037+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"

for me cmake -B build and cmake --build build is very slow.
but powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1
very quick loading, all seems very good. thank you! ollama. Keep up the great work. love it

time=2025-02-13T11:57:24.388+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]
only using the ml backend.
0.5.9-yontracks.txt

<!-- gh-comment-id:2655249011 --> @YonTracks commented on GitHub (Feb 13, 2025): compiles 0.5.9 with `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` . awesome! compiled very, very quick, only built with the required files and nothing more, ``` time=2025-02-13T11:47:33.037+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" ``` for me `cmake -B build` and `cmake --build build` is very slow. but `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` very quick loading, all seems very good. thank you! ollama. Keep up the great work. love it `time=2025-02-13T11:57:24.388+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]` only using the ml backend. [0.5.9-yontracks.txt](https://github.com/user-attachments/files/18776271/0.5.9-yontracks.txt)
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

compiles 0.5.9 with powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 . awesome!

compiled very, very quick, only built with the required files and nothing more,

time=2025-02-13T11:47:33.037+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"

for me cmake -B build and cmake --build build is very slow. but powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 very quick loading, all seems very good. thank you! ollama. Keep up the great work. love it

0.5.9-yontracks.txt

hold up... srry, I forgot about the .iss
I modified it?

with the original iss? not working...

<!-- gh-comment-id:2655275202 --> @YonTracks commented on GitHub (Feb 13, 2025): > compiles 0.5.9 with `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` . awesome! > > compiled very, very quick, only built with the required files and nothing more, > > ``` > time=2025-02-13T11:47:33.037+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" > ``` > > for me `cmake -B build` and `cmake --build build` is very slow. but `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` very quick loading, all seems very good. thank you! ollama. Keep up the great work. love it > > [0.5.9-yontracks.txt](https://github.com/user-attachments/files/18776271/0.5.9-yontracks.txt) hold up... srry, I forgot about the .iss I modified it? with the original iss? not working...
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

compiles 0.5.9 with powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 . awesome!
compiled very, very quick, only built with the required files and nothing more,

time=2025-02-13T11:47:33.037+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"

for me cmake -B build and cmake --build build is very slow. but powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1 very quick loading, all seems very good. thank you! ollama. Keep up the great work. love it
0.5.9-yontracks.txt

hold up... srry, I forgot about the .iss I modified it?

with the original iss? not working...

modifications:

#if DirExists("..\build\lib\ollama")
Source: "..\build\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs
#endif

time=2025-02-13T12:11:42.690+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]
0.5.9-original.txt
but the ml backend is not working.

edit^: Just to be clear!
ollama ps says 100% gpu lol... slower than with cpu only!!! <<<

<!-- gh-comment-id:2655278139 --> @YonTracks commented on GitHub (Feb 13, 2025): > > compiles 0.5.9 with `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` . awesome! > > compiled very, very quick, only built with the required files and nothing more, > > ``` > > time=2025-02-13T11:47:33.037+10:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" > > ``` > > > > > > > > > > > > > > > > > > > > > > > > for me `cmake -B build` and `cmake --build build` is very slow. but `powershell -ExecutionPolicy Bypass -File .\scripts\build_windows.ps1` very quick loading, all seems very good. thank you! ollama. Keep up the great work. love it > > [0.5.9-yontracks.txt](https://github.com/user-attachments/files/18776271/0.5.9-yontracks.txt) > > hold up... srry, I forgot about the .iss I modified it? > > with the original iss? not working... modifications: ``` #if DirExists("..\build\lib\ollama") Source: "..\build\lib\ollama\*"; DestDir: "{app}\lib\ollama\"; Check: not IsArm64(); Flags: ignoreversion 64bit recursesubdirs #endif ``` `time=2025-02-13T12:11:42.690+10:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]` [0.5.9-original.txt](https://github.com/user-attachments/files/18776380/0.5.9-original.txt) but the ml backend is not working. edit^: Just to be clear! `ollama ps` says 100% gpu lol... slower than with cpu only!!! <<<
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

I will test, with only cuda 12_8 and vice versa. this will take a while.

<!-- gh-comment-id:2655285966 --> @YonTracks commented on GitHub (Feb 13, 2025): I will test, with only cuda 12_8 and vice versa. this will take a while.
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

very good info!!! I'm only a mechanic lol

cuda 12.6. I uninstalled 12.8 completely.
so only 12.6. and env has only CUDA_PATH_V12_6
and Path has only:
check the log.

now 0.5.7 before the changes will compile and run great.
0.5.9 will compile with powershell, but now even with the .iss build mod, it does not work.

I know, but can't explain it... lol

0.5.9-original-12.6.txt

<!-- gh-comment-id:2655492880 --> @YonTracks commented on GitHub (Feb 13, 2025): very good info!!! I'm only a mechanic lol cuda 12.6. I uninstalled 12.8 completely. so only 12.6. and env has only ` CUDA_PATH_V12_6` and Path has only: check the log. now 0.5.7 before the changes will compile and run great. 0.5.9 will compile with powershell, but now even with the .iss build mod, it does not work. I know, but can't explain it... lol [0.5.9-original-12.6.txt](https://github.com/user-attachments/files/18777753/0.5.9-original-12.6.txt)
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

I also see ollama fetching both user env variables and the system variables.
edit^: I know is being skipped, but this is user variables and not sytem.

time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\ucrt64\bin
time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\usr\bin

scary, now, I try uninstall 12.6? and only run 12.8.

good luck

<!-- gh-comment-id:2655509490 --> @YonTracks commented on GitHub (Feb 13, 2025): I also see ollama fetching both user env variables and the system variables. edit^: I know is being skipped, but this is user variables and not sytem. ``` time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\ucrt64\bin time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\usr\bin ``` scary, now, I try uninstall 12.6? and only run 12.8. good luck
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

cuda 12.6. I uninstalled 12.8 completely. so only 12.6. and env has only CUDA_PATH_V12_6 and Path has only: check the log.

now 0.5.7 before the changes will compile and run great. 0.5.9 will compile with powershell, but now even with the .iss build mod, it does not work.

pulled the latest changes, and with

0.5.9-build-mod.txt

12.6 only with the build mod, it does still work.

<!-- gh-comment-id:2655538713 --> @YonTracks commented on GitHub (Feb 13, 2025): > cuda 12.6. I uninstalled 12.8 completely. so only 12.6. and env has only ` CUDA_PATH_V12_6` and Path has only: check the log. > > now 0.5.7 before the changes will compile and run great. 0.5.9 will compile with powershell, but now even with the .iss build mod, it does not work. > pulled the latest changes, and with [0.5.9-build-mod.txt](https://github.com/user-attachments/files/18778512/0.5.9-build-mod.txt) 12.6 only with the build mod, it does still work.
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

I also see ollama fetching both user env variables and the system variables. edit^: I know is being skipped, but this is user variables and not sytem.

time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\ucrt64\bin
time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\usr\bin

scary, now, I try uninstall 12.6? and only run 12.8.

good luck

12.8 only: with CUDA_PATH_V12_8 only and the path has:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp
C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\

0.5.9-original-12.8only-compiled.txt
not working...

<!-- gh-comment-id:2655562887 --> @YonTracks commented on GitHub (Feb 13, 2025): > I also see ollama fetching both user env variables and the system variables. edit^: I know is being skipped, but this is user variables and not sytem. > > ``` > time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\ucrt64\bin > time=2025-02-13T15:04:05.288+10:00 level=DEBUG source=ggml.go:83 msg="skipping path which is not part of ollama" path=c:\msys64\usr\bin > ``` > > scary, now, I try uninstall 12.6? and only run 12.8. > > good luck 12.8 only: with ` CUDA_PATH_V12_8` only and the `path` has: `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin` `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\libnvvp` `C:\Program Files\NVIDIA Corporation\Nsight Compute 2025.1.0\` [0.5.9-original-12.8only-compiled.txt](https://github.com/user-attachments/files/18778748/0.5.9-original-12.8only-compiled.txt) not working...
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

12.8 only with the .iss build mod
working.

0.5.9-original12.8-buildmod-compiled.txt

now I try the official OllamaSetup.exe with the huge lol wizard size. lol .

I see the official OllamaSetup.exe extracting the gpu files for all, as usual.

0.5.9-official-huge-wizard-size-lol.txt

<!-- gh-comment-id:2655593693 --> @YonTracks commented on GitHub (Feb 13, 2025): 12.8 only with the .iss build mod working. [0.5.9-original12.8-buildmod-compiled.txt](https://github.com/user-attachments/files/18778801/0.5.9-original12.8-buildmod-compiled.txt) now I try the official OllamaSetup.exe with the `huge` lol wizard size. lol . I see the official OllamaSetup.exe extracting the gpu files for all, as usual. [0.5.9-official-huge-wizard-size-lol.txt](https://github.com/user-attachments/files/18778881/0.5.9-official-huge-wizard-size-lol.txt)
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

12.8 only, and on an older repo 0.5.7 before the changes, and now I can't build it???
but I am working on that repo 0.5.7..

so what I do/ folks will be in same boat??? what I do is reinstall 12.6???
but best I think, is to integrate the updated changes to run 12.8.

good luck.

<!-- gh-comment-id:2655642068 --> @YonTracks commented on GitHub (Feb 13, 2025): 12.8 only, and on an older repo 0.5.7 before the changes, and now I can't build it??? but I am working on that repo 0.5.7.. so what I do/ folks will be in same boat??? what I do is reinstall 12.6??? but best I think, is to integrate the updated changes to run 12.8. good luck.
Author
Owner

@foloumi commented on GitHub (Feb 13, 2025):

Same issue here. It's not just Deepseek, but any model above 20GB in size. I have had the same happen with Qwen2.5-Coder:32b and Phind-CodeLlama:34b. Seems like there is some VRAM mismanagement happening that makes Ollama think there isn't enough VRAM and falls back to CPU inference. Same models run fine on an older version of Ollama. Main question is, which version has introduced the issue.

<!-- gh-comment-id:2656387979 --> @foloumi commented on GitHub (Feb 13, 2025): Same issue here. It's not just Deepseek, but any model above 20GB in size. I have had the same happen with Qwen2.5-Coder:32b and Phind-CodeLlama:34b. Seems like there is some VRAM mismanagement happening that makes Ollama think there isn't enough VRAM and falls back to CPU inference. Same models run fine on an older version of Ollama. Main question is, which version has introduced the issue.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2656392009 --> @rick-github commented on GitHub (Feb 13, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@foloumi commented on GitHub (Feb 13, 2025):

Server logs may aid in debugging.

Attached.

Ollama_Service_Logs.txt

<!-- gh-comment-id:2656413309 --> @foloumi commented on GitHub (Feb 13, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. Attached. [Ollama_Service_Logs.txt](https://github.com/user-attachments/files/18783538/Ollama_Service_Logs.txt)
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

Server logs may aid in debugging.

Attached.

Ollama_Service_Logs.txt

12.4, interesting, should they try 12.8? is there a 12.8 for them? what version?
good luck

<!-- gh-comment-id:2656431265 --> @YonTracks commented on GitHub (Feb 13, 2025): > > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. > > Attached. > > [Ollama_Service_Logs.txt](https://github.com/user-attachments/files/18783538/Ollama_Service_Logs.txt) 12.4, interesting, should they try 12.8? is there a 12.8 for them? what version? good luck
Author
Owner

@rick-github commented on GitHub (Feb 13, 2025):

This looks normal.

Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.234+01:00 level=INFO source=memory.go:356
  msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=24 layers.split=12,12
  memory.available="[23.4 GiB 23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="77.5 GiB"
  memory.required.partial="45.7 GiB" memory.required.kv="31.2 GiB" memory.required.allocations="[22.9 GiB 22.8 GiB]" 
  memory.weights.total="48.7 GiB" memory.weights.repeating="48.1 GiB" memory.weights.nonrepeating="609.1 MiB"
  memory.graph.full="12.5 GiB" memory.graph.partial="12.5 GiB"

Two cards with 23.4G and 23.2G available respectively. Ollama allocates 22.9 and 22.8G and offloads 24 layers. Large context size results in large KV cache of 31G.

Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.235+01:00 level=INFO source=server.go:376 
 msg="starting llama server"
  cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /media/SSD_Storage/ollama/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93
  --ctx-size 128000 --batch-size 512 --n-gpu-layers 24 --verbose --threads 16 --parallel 1
  --tensor-split 12,12 --port 36069"

CUDA enabled runner is started, 12 layers per card.

Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init:        CPU KV buffer size = 20000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init:      CUDA0 KV buffer size =  6000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init:      CUDA1 KV buffer size =  6000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: KV self size  = 32000.00 MiB, K (f16): 16000.00 MiB, V (f16): 16000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:      CUDA0 compute buffer size = 10780.02 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:      CUDA1 compute buffer size = 10290.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:  CUDA_Host compute buffer size =   260.01 MiB

llama.cpp loads model into both GPUs.

From the model loading point of view, everything looks OK. How are you quantifying a performance drop?

<!-- gh-comment-id:2656442272 --> @rick-github commented on GitHub (Feb 13, 2025): This looks normal. ``` Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.234+01:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=24 layers.split=12,12 memory.available="[23.4 GiB 23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="77.5 GiB" memory.required.partial="45.7 GiB" memory.required.kv="31.2 GiB" memory.required.allocations="[22.9 GiB 22.8 GiB]" memory.weights.total="48.7 GiB" memory.weights.repeating="48.1 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="12.5 GiB" memory.graph.partial="12.5 GiB" ``` Two cards with 23.4G and 23.2G available respectively. Ollama allocates 22.9 and 22.8G and offloads 24 layers. Large context size results in large KV cache of 31G. ``` Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.235+01:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /media/SSD_Storage/ollama/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 --ctx-size 128000 --batch-size 512 --n-gpu-layers 24 --verbose --threads 16 --parallel 1 --tensor-split 12,12 --port 36069" ``` CUDA enabled runner is started, 12 layers per card. ``` Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init: CPU KV buffer size = 20000.00 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init: CUDA0 KV buffer size = 6000.00 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init: CUDA1 KV buffer size = 6000.00 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: KV self size = 32000.00 MiB, K (f16): 16000.00 MiB, V (f16): 16000.00 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CPU output buffer size = 0.60 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CUDA0 compute buffer size = 10780.02 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CUDA1 compute buffer size = 10290.00 MiB Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CUDA_Host compute buffer size = 260.01 MiB ``` llama.cpp loads model into both GPUs. From the model loading point of view, everything looks OK. How are you quantifying a performance drop?
Author
Owner

@foloumi commented on GitHub (Feb 13, 2025):

This looks normal.

Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.234+01:00 level=INFO source=memory.go:356
  msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=24 layers.split=12,12
  memory.available="[23.4 GiB 23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="77.5 GiB"
  memory.required.partial="45.7 GiB" memory.required.kv="31.2 GiB" memory.required.allocations="[22.9 GiB 22.8 GiB]" 
  memory.weights.total="48.7 GiB" memory.weights.repeating="48.1 GiB" memory.weights.nonrepeating="609.1 MiB"
  memory.graph.full="12.5 GiB" memory.graph.partial="12.5 GiB"

Two cards with 23.4G and 23.2G available respectively. Ollama allocates 22.9 and 22.8G and offloads 24 layers. Large context size results in large KV cache of 31G.

Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.235+01:00 level=INFO source=server.go:376 
 msg="starting llama server"
  cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /media/SSD_Storage/ollama/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93
  --ctx-size 128000 --batch-size 512 --n-gpu-layers 24 --verbose --threads 16 --parallel 1
  --tensor-split 12,12 --port 36069"

CUDA enabled runner is started, 12 layers per card.

Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init:        CPU KV buffer size = 20000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init:      CUDA0 KV buffer size =  6000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init:      CUDA1 KV buffer size =  6000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: KV self size  = 32000.00 MiB, K (f16): 16000.00 MiB, V (f16): 16000.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:      CUDA0 compute buffer size = 10780.02 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:      CUDA1 compute buffer size = 10290.00 MiB
Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model:  CUDA_Host compute buffer size =   260.01 MiB

llama.cpp loads model into both GPUs.

From the model loading point of view, everything looks OK. How are you quantifying a performance drop?

Great point about the context window! Let me adjust that first before testing again. But previously the performance drop is quite obvious based on both resource manager showing only CPU utilization and also a very slow token generation rate. I'll post again after testing with a smaller context window

<!-- gh-comment-id:2656456477 --> @foloumi commented on GitHub (Feb 13, 2025): > This looks normal. > > ``` > Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.234+01:00 level=INFO source=memory.go:356 > msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=24 layers.split=12,12 > memory.available="[23.4 GiB 23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="77.5 GiB" > memory.required.partial="45.7 GiB" memory.required.kv="31.2 GiB" memory.required.allocations="[22.9 GiB 22.8 GiB]" > memory.weights.total="48.7 GiB" memory.weights.repeating="48.1 GiB" memory.weights.nonrepeating="609.1 MiB" > memory.graph.full="12.5 GiB" memory.graph.partial="12.5 GiB" > ``` > > Two cards with 23.4G and 23.2G available respectively. Ollama allocates 22.9 and 22.8G and offloads 24 layers. Large context size results in large KV cache of 31G. > > ``` > Feb 13 11:11:22 NeuralNexus ollama[216839]: time=2025-02-13T11:11:22.235+01:00 level=INFO source=server.go:376 > msg="starting llama server" > cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /media/SSD_Storage/ollama/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 > --ctx-size 128000 --batch-size 512 --n-gpu-layers 24 --verbose --threads 16 --parallel 1 > --tensor-split 12,12 --port 36069" > ``` > > CUDA enabled runner is started, 12 layers per card. > > ``` > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init: CPU KV buffer size = 20000.00 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init: CUDA0 KV buffer size = 6000.00 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_kv_cache_init: CUDA1 KV buffer size = 6000.00 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: KV self size = 32000.00 MiB, K (f16): 16000.00 MiB, V (f16): 16000.00 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CPU output buffer size = 0.60 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CUDA0 compute buffer size = 10780.02 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CUDA1 compute buffer size = 10290.00 MiB > Feb 13 11:11:35 NeuralNexus ollama[216839]: llama_new_context_with_model: CUDA_Host compute buffer size = 260.01 MiB > ``` > > llama.cpp loads model into both GPUs. > > From the model loading point of view, everything looks OK. How are you quantifying a performance drop? Great point about the context window! Let me adjust that first before testing again. But previously the performance drop is quite obvious based on both resource manager showing only CPU utilization and also a very slow token generation rate. I'll post again after testing with a smaller context window
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

is that like 2min? to complete, is that good for that size.

<!-- gh-comment-id:2656465627 --> @YonTracks commented on GitHub (Feb 13, 2025): is that like 2min? to complete, is that good for that size.
Author
Owner

@foloumi commented on GitHub (Feb 13, 2025):

is that like 2min? to complete, is that good for that size.

based on previous experience on a similar system, the CPU only inference speed is just not practical at all! It's going at 1 token per second with 32 cores and 128GB of RAM, so it's a no go for actual usage.

<!-- gh-comment-id:2656470744 --> @foloumi commented on GitHub (Feb 13, 2025): > is that like 2min? to complete, is that good for that size. based on previous experience on a similar system, the CPU only inference speed is just not practical at all! It's going at 1 token per second with 32 cores and 128GB of RAM, so it's a no go for actual usage.
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

is that like 2min? to complete, is that good for that size.

based on previous experience on a similar system, the CPU only inference speed is just no practical at all! It's practically going at 1 token per second with 32 cores and 128GB of RAM, so it's a no go for actual usage.

yep, so it is, or isn't doing that? running very slow like cpu only?
or is it just not as good as before, but still ok.

<!-- gh-comment-id:2656481549 --> @YonTracks commented on GitHub (Feb 13, 2025): > > is that like 2min? to complete, is that good for that size. > > based on previous experience on a similar system, the CPU only inference speed is just no practical at all! It's practically going at 1 token per second with 32 cores and 128GB of RAM, so it's a no go for actual usage. yep, so it is, or isn't doing that? running very slow like cpu only? or is it just not as good as before, but still ok.
Author
Owner

@foloumi commented on GitHub (Feb 13, 2025):

is that like 2min? to complete, is that good for that size.

based on previous experience on a similar system, the CPU only inference speed is just no practical at all! It's practically going at 1 token per second with 32 cores and 128GB of RAM, so it's a no go for actual usage.

yep, so it is, or isn't doing that? running very slow like cpu only? or is it just not as good as before, but still ok.

No, it's certainly only doing CPU inference, not GPU.

<!-- gh-comment-id:2656483776 --> @foloumi commented on GitHub (Feb 13, 2025): > > > is that like 2min? to complete, is that good for that size. > > > > > > based on previous experience on a similar system, the CPU only inference speed is just no practical at all! It's practically going at 1 token per second with 32 cores and 128GB of RAM, so it's a no go for actual usage. > > yep, so it is, or isn't doing that? running very slow like cpu only? or is it just not as good as before, but still ok. No, it's certainly only doing CPU inference, not GPU.
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

yes, you will see the issue is the gpu files and or runners? ollama/lib? transfer the files manually?
good luck

<!-- gh-comment-id:2656486497 --> @YonTracks commented on GitHub (Feb 13, 2025): yes, you will see the issue is the gpu files and or runners? ollama/lib? transfer the files manually? good luck
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

srry dyslexia lol ..\build\lib\ollama and lib/ollama lol

<!-- gh-comment-id:2656498175 --> @YonTracks commented on GitHub (Feb 13, 2025): srry dyslexia lol `..\build\lib\ollama` and lib/ollama lol
Author
Owner

@YonTracks commented on GitHub (Feb 13, 2025):

and if still issues, but works if using older 0.5.6. than you need cuda 12.8
good luck

<!-- gh-comment-id:2656501268 --> @YonTracks commented on GitHub (Feb 13, 2025): and if still issues, but works if using older 0.5.6. than you need cuda 12.8 good luck
Author
Owner

@foloumi commented on GitHub (Feb 13, 2025):

My issue seems to have been related to the large context window that was being passed to Ollama... reducing it keeps the inference on the GPU. Thanks to @rick-github for pointing this out. Was even able to do an inference for a 70B Deepseek via model splitting
I'm running

  • Ollama: v0.5.7
  • CUDA: 12.4
  • Nvidia Driver: 550
  • Ubuntu: 24.04
<!-- gh-comment-id:2656563053 --> @foloumi commented on GitHub (Feb 13, 2025): My issue seems to have been related to the large context window that was being passed to Ollama... reducing it keeps the inference on the GPU. Thanks to @rick-github for pointing this out. Was even able to do an inference for a 70B Deepseek via model splitting I'm running - Ollama: v0.5.7 - CUDA: 12.4 - Nvidia Driver: 550 - Ubuntu: 24.04
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5814