[GH-ISSUE #13423] Ollama doesn't run on GPU and it runs on CPU #55377

Open
opened 2026-04-29 09:04:15 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @amekiri13 on GitHub (Dec 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13423

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

When I try using Ollama with qwen3:30b model, I found that my GPU havn't any usage and I can't found ollama on nvidia-smi command. However, I found that Ollama has a high CPU usage.

OS: Arch Linux Rolling

Kernel:

➜  ~ uname -a
Linux amekiri 6.17.9-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Mon, 24 Nov 2025 15:21:16 +0000 x86_64 GNU/Linux

ollama --version:

➜  ~ ollama --version
ollama version is 0.13.1
Warning: client version is 0.13.2

nvidia-smi:

➜  ~ nvidia-smi
Thu Dec 11 16:13:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   51C    P3             45W /  300W |    2837MiB /  16303MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             963      G   /usr/lib/Xorg                            23MiB |
|    0   N/A  N/A            4022      G   /usr/bin/ksecretd                         3MiB |
|    0   N/A  N/A            4194      G   /usr/bin/kwin_wayland                   142MiB |
|    0   N/A  N/A            4559      G   /usr/bin/Xwayland                         9MiB |
|    0   N/A  N/A            4599      G   /usr/bin/ksmserver                        3MiB |
|    0   N/A  N/A            4601      G   /usr/bin/kded6                            3MiB |
|    0   N/A  N/A            4682      G   /usr/bin/plasmashell                    274MiB |
|    0   N/A  N/A            4739      G   /usr/bin/krdpserver                       3MiB |
|    0   N/A  N/A            4742      G   /usr/bin/kaccess                          3MiB |
|    0   N/A  N/A            4743      G   ...it-kde-authentication-agent-1          3MiB |
|    0   N/A  N/A            5033      G   /usr/bin/kwalletd6                        3MiB |
|    0   N/A  N/A            5448      G   /usr/bin/kdeconnectd                      3MiB |
|    0   N/A  N/A            6579      G   /usr/bin/konsole                          3MiB |
|    0   N/A  N/A            6837      G   /usr/bin/xwaylandvideobridge              3MiB |
|    0   N/A  N/A            6846      G   /usr/lib/DiscoverNotifier                 3MiB |
|    0   N/A  N/A            6863      G   /usr/lib/thunderbird/thunderbird        119MiB |
|    0   N/A  N/A            6888      G   /usr/lib/xdg-desktop-portal-kde           3MiB |
|    0   N/A  N/A            6929      G   ...-wayland-text-input-version=3          3MiB |
|    0   N/A  N/A            6934      G   /usr/bin/dolphin                          3MiB |
|    0   N/A  N/A            7803      G   ...rack-uuid=3190708988185955192        425MiB |
|    0   N/A  N/A            7971      G   /opt/visual-studio-code/code            162MiB |
|    0   N/A  N/A           24508      G   ...-wayland-text-input-version=3          3MiB |
|    0   N/A  N/A           24679      G   ...rack-uuid=3190708988185955192        197MiB |
|    0   N/A  N/A          499754      G   ...share/Steam/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A          500159      G   ./steamwebhelper                         60MiB |
|    0   N/A  N/A          500182      G   ...am/ubuntu12_64/steamwebhelper        377MiB |
|    0   N/A  N/A          919209      G   /usr/lib/baloorunner                      3MiB |
|    0   N/A  N/A          919698      G   ...nt_cherryZiFZpd/Cherry Studio        138MiB |
|    0   N/A  N/A          982624      G   /usr/bin/spectacle                       48MiB |
+-----------------------------------------------------------------------------------------+
Image

Relevant log output

Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU"
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="17.3 GiB"
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB"
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="88.0 MiB"
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:272 msg="total memory" size="17.7 GiB"
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model"
Dec 11 16:05:44 amekiri ollama[1621]: time=2025-12-11T16:05:44.529+08:00 level=INFO source=server.go:1332 msg="llama runner started in 4.41 seconds"
Dec 11 16:09:26 amekiri ollama[1621]: [GIN] 2025/12/11 - 16:09:26 | 200 |       47.97µs |       127.0.0.1 | GET      "/api/version"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.13.1

Originally created by @amekiri13 on GitHub (Dec 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13423 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? When I try using Ollama with qwen3:30b model, I found that my GPU havn't any usage and I can't found ollama on `nvidia-smi` command. However, I found that Ollama has a high CPU usage. OS: Arch Linux Rolling Kernel: ```sh ➜ ~ uname -a Linux amekiri 6.17.9-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Mon, 24 Nov 2025 15:21:16 +0000 x86_64 GNU/Linux ``` ollama --version: ```sh ➜ ~ ollama --version ollama version is 0.13.1 Warning: client version is 0.13.2 ``` nvidia-smi: ```sh ➜ ~ nvidia-smi Thu Dec 11 16:13:15 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5070 Ti Off | 00000000:01:00.0 On | N/A | | 0% 51C P3 45W / 300W | 2837MiB / 16303MiB | 5% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 963 G /usr/lib/Xorg 23MiB | | 0 N/A N/A 4022 G /usr/bin/ksecretd 3MiB | | 0 N/A N/A 4194 G /usr/bin/kwin_wayland 142MiB | | 0 N/A N/A 4559 G /usr/bin/Xwayland 9MiB | | 0 N/A N/A 4599 G /usr/bin/ksmserver 3MiB | | 0 N/A N/A 4601 G /usr/bin/kded6 3MiB | | 0 N/A N/A 4682 G /usr/bin/plasmashell 274MiB | | 0 N/A N/A 4739 G /usr/bin/krdpserver 3MiB | | 0 N/A N/A 4742 G /usr/bin/kaccess 3MiB | | 0 N/A N/A 4743 G ...it-kde-authentication-agent-1 3MiB | | 0 N/A N/A 5033 G /usr/bin/kwalletd6 3MiB | | 0 N/A N/A 5448 G /usr/bin/kdeconnectd 3MiB | | 0 N/A N/A 6579 G /usr/bin/konsole 3MiB | | 0 N/A N/A 6837 G /usr/bin/xwaylandvideobridge 3MiB | | 0 N/A N/A 6846 G /usr/lib/DiscoverNotifier 3MiB | | 0 N/A N/A 6863 G /usr/lib/thunderbird/thunderbird 119MiB | | 0 N/A N/A 6888 G /usr/lib/xdg-desktop-portal-kde 3MiB | | 0 N/A N/A 6929 G ...-wayland-text-input-version=3 3MiB | | 0 N/A N/A 6934 G /usr/bin/dolphin 3MiB | | 0 N/A N/A 7803 G ...rack-uuid=3190708988185955192 425MiB | | 0 N/A N/A 7971 G /opt/visual-studio-code/code 162MiB | | 0 N/A N/A 24508 G ...-wayland-text-input-version=3 3MiB | | 0 N/A N/A 24679 G ...rack-uuid=3190708988185955192 197MiB | | 0 N/A N/A 499754 G ...share/Steam/ubuntu12_32/steam 4MiB | | 0 N/A N/A 500159 G ./steamwebhelper 60MiB | | 0 N/A N/A 500182 G ...am/ubuntu12_64/steamwebhelper 377MiB | | 0 N/A N/A 919209 G /usr/lib/baloorunner 3MiB | | 0 N/A N/A 919698 G ...nt_cherryZiFZpd/Cherry Studio 138MiB | | 0 N/A N/A 982624 G /usr/bin/spectacle 48MiB | +-----------------------------------------------------------------------------------------+ ``` <img width="2526" height="280" alt="Image" src="https://github.com/user-attachments/assets/e9c7611f-3974-40ca-ab16-445b82aea6b7" /> ### Relevant log output ```shell Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=ggml.go:494 msg="offloaded 0/49 layers to GPU" Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="17.3 GiB" Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="384.0 MiB" Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="88.0 MiB" Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=device.go:272 msg="total memory" size="17.7 GiB" Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" Dec 11 16:05:40 amekiri ollama[1621]: time=2025-12-11T16:05:40.261+08:00 level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model" Dec 11 16:05:44 amekiri ollama[1621]: time=2025-12-11T16:05:44.529+08:00 level=INFO source=server.go:1332 msg="llama runner started in 4.41 seconds" Dec 11 16:09:26 amekiri ollama[1621]: [GIN] 2025/12/11 - 16:09:26 | 200 | 47.97µs | 127.0.0.1 | GET "/api/version" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.1
GiteaMirror added the bug label 2026-04-29 09:04:15 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 11, 2025):

Did you install the ollama-cuda package?

<!-- gh-comment-id:3640816561 --> @rick-github commented on GitHub (Dec 11, 2025): Did you install the ollama-cuda package?
Author
Owner

@amekiri13 commented on GitHub (Dec 11, 2025):

Sure, I installed ollama-cuda.

➜  ~ yay -Qs ollama     
local/ollama 0.13.2-1
    Create, run and share large language models (LLMs)
local/ollama-cuda 0.13.2-1
    Create, run and share large language models (LLMs) with CUDA

Did you install the ollama-cuda package?

<!-- gh-comment-id:3640827088 --> @amekiri13 commented on GitHub (Dec 11, 2025): Sure, I installed `ollama-cuda`. ```sh ➜ ~ yay -Qs ollama local/ollama 0.13.2-1 Create, run and share large language models (LLMs) local/ollama-cuda 0.13.2-1 Create, run and share large language models (LLMs) with CUDA ``` > Did you install the ollama-cuda package?
Author
Owner

@rick-github commented on GitHub (Dec 11, 2025):

Set OLLAMA_DEBUG=2 in the server environment and post the log from the start to the model load.

<!-- gh-comment-id:3640841940 --> @rick-github commented on GitHub (Dec 11, 2025): Set `OLLAMA_DEBUG=2` in the server environment and post the log from the start to the model load.
Author
Owner

@amekiri13 commented on GitHub (Dec 11, 2025):

Here are logs of Ollama when OLLAMA_DEBUG=2 setted:

time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=106.556373ms OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs=map[]
time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2
time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=/usr/lib/ollama description="NVIDIA GeForce RTX 5070 Ti" compute=12.0 id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 pci_id=0000:01:00.0
time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=/usr/lib/ollama description="AMD Ryzen 7 9700X 8-Core Processor" compute=gfx1036 id=0 pci_id=0000:0c:00.0
time=2025-12-11T16:44:16.294+08:00 level=TRACE source=runner.go:440 msg="starting runner for device discovery" libDirs=[/usr/lib/ollama] extraEnvs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
time=2025-12-11T16:44:16.294+08:00 level=TRACE source=runner.go:440 msg="starting runner for device discovery" libDirs=[/usr/lib/ollama] extraEnvs="map[CUDA_VISIBLE_DEVICES:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 GGML_CUDA_INIT:1]"
time=2025-12-11T16:44:16.294+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 39761"
time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama OLLAMA_LIBRARY_PATH=/usr/lib/ollama CUDA_VISIBLE_DEVICES=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 GGML_CUDA_INIT=1
time=2025-12-11T16:44:16.294+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45061"
time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama OLLAMA_LIBRARY_PATH=/usr/lib/ollama ROCR_VISIBLE_DEVICES=0 GGML_CUDA_INIT=1
time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1398 msg="starting ollama engine"
time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:45061"
time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1398 msg="starting ollama engine"
time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:39761"
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=general.architecture type=string
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=tokenizer.ggml.model type=string
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=general.architecture type=string
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=tokenizer.ggml.model type=string
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.file_type default=0
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.file_type default=0
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.name default=""
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.description default=""
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.name default=""
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.description default=""
time=2025-12-11T16:44:16.306+08:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2025-12-11T16:44:16.306+08:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-12-11T16:44:16.366+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,880,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.pooling_type default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.expert_count default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.embedding_length default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.key_length default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=runner.go:1373 msg="dummy model load took" duration=60.432184ms
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
ggml_cuda_init: initializing rocBLAS on device 0

rocBLAS error: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1036
 List of available TensileLibrary Files : 
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1103.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1150.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1151.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx950.dat"
ggml_backend_cuda_device_get_memory device GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 utilizing NVML memory reporting free: 13942194176 total: 17094934528
time=2025-12-11T16:44:16.375+08:00 level=DEBUG source=runner.go:1378 msg="gathering device infos took" duration=9.476959ms
time=2025-12-11T16:44:16.376+08:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] devices="[{DeviceID:{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA} Name:CUDA0 Description:NVIDIA GeForce RTX 5070 Ti FilterID: Integrated:false PCIID:0000:01:00.0 TotalMemory:17094934528 FreeMemory:13942194176 ComputeMajor:12 ComputeMinor:0 DriverMajor:13 DriverMinor:0 LibraryPath:[/usr/lib/ollama]}]"
time=2025-12-11T16:44:16.376+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=81.9498ms OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 GGML_CUDA_INIT:1]"
time=2025-12-11T16:44:17.275+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" error="runner crashed"
time=2025-12-11T16:44:17.275+08:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] devices=[]
time=2025-12-11T16:44:17.275+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=981.006794ms OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]"
time=2025-12-11T16:44:17.275+08:00 level=DEBUG source=runner.go:153 msg="filtering device which didn't fully initialize" id=0 libdir=/usr/lib/ollama pci_id=0000:0c:00.0 library=ROCm
time=2025-12-11T16:44:17.275+08:00 level=TRACE source=runner.go:174 msg="supported GPU library combinations before filtering" supported=map[CUDA:map[/usr/lib/ollama:map[GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8:0]]]
time=2025-12-11T16:44:17.275+08:00 level=TRACE source=runner.go:183 msg="removing unsupported or overlapping GPU combination" libDir=/usr/lib/ollama description="AMD Ryzen 7 9700X 8-Core Processor" compute=gfx1036 pci_id=0000:0c:00.0
time=2025-12-11T16:44:17.275+08:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=1.087772337s
time=2025-12-11T16:44:17.275+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5070 Ti" libdirs=ollama driver=13.0 pci_id=0000:01:00.0 type=discrete total="15.9 GiB" available="13.0 GiB"
time=2025-12-11T16:44:17.275+08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB"
[GIN] 2025/12/11 - 16:44:36 | 404 |      333.63µs |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/12/11 - 16:44:55 | 200 |      23.699µs |       127.0.0.1 | GET      "/"
[GIN] 2025/12/11 - 16:45:07 | 200 |       19.74µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/12/11 - 16:45:07 | 200 |       96.44µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/12/11 - 16:45:33 | 200 |       19.63µs |       127.0.0.1 | HEAD     "/"
time=2025-12-11T16:45:35.404+08:00 level=INFO source=download.go:177 msg="downloading aabd4debf0c8 in 12 100 MB part(s)"
time=2025-12-11T16:46:20.098+08:00 level=INFO source=download.go:177 msg="downloading c5ad996bda6e in 1 556 B part(s)"
time=2025-12-11T16:46:21.700+08:00 level=INFO source=download.go:177 msg="downloading 6e4c38e1172f in 1 1.1 KB part(s)"
time=2025-12-11T16:46:23.637+08:00 level=INFO source=download.go:177 msg="downloading f4d24e9138dd in 1 148 B part(s)"
time=2025-12-11T16:46:25.273+08:00 level=INFO source=download.go:177 msg="downloading a85fe2a2e58e in 1 487 B part(s)"
[GIN] 2025/12/11 - 16:46:27 | 200 | 53.223237688s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2025/12/11 - 16:46:37 | 200 |     209.369µs |       127.0.0.1 | GET      "/api/tags"
time=2025-12-11T16:46:51.274+08:00 level=TRACE source=sched.go:146 msg="processing incoming request" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
time=2025-12-11T16:46:51.274+08:00 level=TRACE source=sched.go:179 msg="refreshing GPU list" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
time=2025-12-11T16:46:51.274+08:00 level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2025-12-11T16:46:51.274+08:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2025-12-11T16:46:51.274+08:00 level=TRACE source=runner.go:440 msg="starting runner for device discovery" libDirs="[/usr/lib/ollama ]" extraEnvs=map[]
time=2025-12-11T16:46:51.274+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44023"
time=2025-12-11T16:46:51.274+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama: OLLAMA_LIBRARY_PATH=/usr/lib/ollama:
time=2025-12-11T16:46:51.280+08:00 level=INFO source=runner.go:1398 msg="starting ollama engine"
time=2025-12-11T16:46:51.280+08:00 level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:44023"
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=gguf.go:589 msg=general.architecture type=string
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=gguf.go:589 msg=tokenizer.ggml.model type=string
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.file_type default=0
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.name default=""
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.description default=""
time=2025-12-11T16:46:51.285+08:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3
time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Ryzen 7 9700X 8-Core Processor, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 0
load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:88 msg="skipping path which is not part of ollama" path=/home/amekiri
time=2025-12-11T16:46:51.339+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,880,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.pooling_type default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.expert_count default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}"
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}"
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}"
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}"
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.pre default=""
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.embedding_length default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count_kv default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.key_length default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.dimension_count default=0
time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0
time=2025-12-11T16:46:51.340+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.freq_base default=100000
time=2025-12-11T16:46:51.340+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.scaling.factor default=1
time=2025-12-11T16:46:51.340+08:00 level=DEBUG source=runner.go:1373 msg="dummy model load took" duration=54.37661ms
ggml_backend_cuda_device_get_memory device GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 utilizing NVML memory reporting free: 13972471808 total: 17094934528
ggml_hip_get_device_memory searching for device 0000:0c:00.0
ggml_backend_cuda_device_get_memory device 0000:0c:00.0 utilizing AMD specific memory reporting free: 2121228288 total: 2147483648
time=2025-12-11T16:46:51.349+08:00 level=DEBUG source=runner.go:1378 msg="gathering device infos took" duration=9.35368ms
time=2025-12-11T16:46:51.349+08:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama ]" devices="[{DeviceID:{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA} Name:CUDA0 Description:NVIDIA GeForce RTX 5070 Ti FilterID: Integrated:false PCIID:0000:01:00.0 TotalMemory:17094934528 FreeMemory:13972471808 ComputeMajor:12 ComputeMinor:0 DriverMajor:13 DriverMinor:0 LibraryPath:[/usr/lib/ollama ]} {DeviceID:{ID:0 Library:ROCm} Name:ROCm0 Description:AMD Ryzen 7 9700X 8-Core Processor FilterID: Integrated:true PCIID:0000:0c:00.0 TotalMemory:2147483648 FreeMemory:2121228288 ComputeMajor:16 ComputeMinor:54 DriverMajor:70152 DriverMinor:80 LibraryPath:[/usr/lib/ollama ]}]"
time=2025-12-11T16:46:51.349+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=75.356547ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama ]" extra_envs=map[]
time=2025-12-11T16:46:51.349+08:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=75.385557ms
time=2025-12-11T16:46:51.349+08:00 level=TRACE source=sched.go:182 msg="refreshing system information" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
time=2025-12-11T16:46:51.349+08:00 level=TRACE source=gpu.go:22 msg="performing CPU discovery"
time=2025-12-11T16:46:51.350+08:00 level=TRACE source=gpu.go:25 msg="CPU discovery completed" duration=343.3µs
time=2025-12-11T16:46:51.350+08:00 level=DEBUG source=sched.go:194 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-12-11T16:46:51.350+08:00 level=TRACE source=sched.go:198 msg="loading model metadata" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
time=2025-12-11T16:46:51.356+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32
time=2025-12-11T16:46:51.356+08:00 level=TRACE source=sched.go:206 msg="updating free space" gpu_count=1 model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
time=2025-12-11T16:46:51.356+08:00 level=DEBUG source=sched.go:211 msg="loading first model" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.04 GiB (5.00 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151643 ('<|end▁of▁sentence|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-12-11T16:46:51.473+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --model /home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc --port 40115"
time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama OLLAMA_LIBRARY_PATH=/usr/lib/ollama
time=2025-12-11T16:46:51.473+08:00 level=INFO source=sched.go:443 msg="system memory" total="60.5 GiB" free="43.3 GiB" free_swap="48.0 GiB"
time=2025-12-11T16:46:51.473+08:00 level=INFO source=sched.go:450 msg="gpu memory" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 library=CUDA available="12.6 GiB" free="13.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-12-11T16:46:51.473+08:00 level=INFO source=server.go:459 msg="loading model" "model layers"=29 requested=-1
time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.key_length default=128
time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.value_length default=128
time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=ggml.go:614 msg="default cache size estimate" "attention MiB"=112 "attention bytes"=117440512 "recurrent MiB"=0 "recurrent bytes"=0
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=0 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=1 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=2 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=3 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=4 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=5 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=6 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=7 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=8 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=9 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=10 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=11 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=12 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=13 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=14 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=15 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=16 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=17 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=18 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=19 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=20 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=21 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=22 size="29.1 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=23 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=24 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=25 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=26 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=27 size="32.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=28 size="182.6 MiB"
time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 library=CUDA "available layer vram"="12.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=0 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=1 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=2 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=3 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=4 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=5 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=6 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=7 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=8 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=9 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=10 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=11 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=12 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=13 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=14 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=15 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=16 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=17 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=18 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=19 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=20 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=21 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=22 size="29.1 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=23 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=24 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=25 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=26 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=27 size="32.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=28 size="182.6 MiB"
time=2025-12-11T16:46:51.474+08:00 level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 library=CUDA "available layer vram"="12.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="482.3 MiB"
time=2025-12-11T16:46:51.474+08:00 level=DEBUG source=server.go:614 msg=memory estimate.CUDA0.ID=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 estimate.CUDA0.Weights="[29990912 29990912 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 29990912 29990912 29990912 29990912 191445504]" estimate.CUDA0.Cache="[4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 0]" estimate.CUDA0.Graph=314310656
time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="934.7 MiB"
time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="112.0 MiB"
time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="299.8 MiB"
time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:272 msg="total memory" size="1.3 GiB"
time=2025-12-11T16:46:51.478+08:00 level=INFO source=runner.go:963 msg="starting go runner"
time=2025-12-11T16:46:51.478+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Ryzen 7 9700X 8-Core Processor, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 0
load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-12-11T16:46:51.530+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,880,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-12-11T16:46:51.530+08:00 level=INFO source=runner.go:999 msg="Server listening on 127.0.0.1:40115"
time=2025-12-11T16:46:51.538+08:00 level=INFO source=runner.go:893 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:29[ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2025-12-11T16:46:51.538+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
time=2025-12-11T16:46:51.538+08:00 level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model"
ggml_backend_cuda_device_get_memory device GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 utilizing NVML memory reporting free: 13978238976 total: 17094934528
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5070 Ti) (0000:01:00.0) - 13330 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.04 GiB (5.00 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151643 ('<|end▁of▁sentence|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 1536
print_info: n_embd_inp       = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 1.5B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_q.bias
create_tensor: loading tensor blk.1.attn_k.bias
create_tensor: loading tensor blk.1.attn_v.bias
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_q.bias
create_tensor: loading tensor blk.2.attn_k.bias
create_tensor: loading tensor blk.2.attn_v.bias
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q.bias
create_tensor: loading tensor blk.3.attn_k.bias
create_tensor: loading tensor blk.3.attn_v.bias
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_q.bias
create_tensor: loading tensor blk.4.attn_k.bias
create_tensor: loading tensor blk.4.attn_v.bias
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_q.bias
create_tensor: loading tensor blk.5.attn_k.bias
create_tensor: loading tensor blk.5.attn_v.bias
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_q.bias
create_tensor: loading tensor blk.6.attn_k.bias
create_tensor: loading tensor blk.6.attn_v.bias
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q.bias
create_tensor: loading tensor blk.7.attn_k.bias
create_tensor: loading tensor blk.7.attn_v.bias
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_q.bias
create_tensor: loading tensor blk.8.attn_k.bias
create_tensor: loading tensor blk.8.attn_v.bias
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_q.bias
create_tensor: loading tensor blk.9.attn_k.bias
create_tensor: loading tensor blk.9.attn_v.bias
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_q.bias
create_tensor: loading tensor blk.10.attn_k.bias
create_tensor: loading tensor blk.10.attn_v.bias
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q.bias
create_tensor: loading tensor blk.11.attn_k.bias
create_tensor: loading tensor blk.11.attn_v.bias
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_q.bias
create_tensor: loading tensor blk.12.attn_k.bias
create_tensor: loading tensor blk.12.attn_v.bias
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_q.bias
create_tensor: loading tensor blk.13.attn_k.bias
create_tensor: loading tensor blk.13.attn_v.bias
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_q.bias
create_tensor: loading tensor blk.14.attn_k.bias
create_tensor: loading tensor blk.14.attn_v.bias
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q.bias
create_tensor: loading tensor blk.15.attn_k.bias
create_tensor: loading tensor blk.15.attn_v.bias
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_q.bias
create_tensor: loading tensor blk.16.attn_k.bias
create_tensor: loading tensor blk.16.attn_v.bias
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_q.bias
create_tensor: loading tensor blk.17.attn_k.bias
create_tensor: loading tensor blk.17.attn_v.bias
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_q.bias
create_tensor: loading tensor blk.18.attn_k.bias
create_tensor: loading tensor blk.18.attn_v.bias
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q.bias
create_tensor: loading tensor blk.19.attn_k.bias
create_tensor: loading tensor blk.19.attn_v.bias
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_q.bias
create_tensor: loading tensor blk.20.attn_k.bias
create_tensor: loading tensor blk.20.attn_v.bias
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_q.bias
create_tensor: loading tensor blk.21.attn_k.bias
create_tensor: loading tensor blk.21.attn_v.bias
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_q.bias
create_tensor: loading tensor blk.22.attn_k.bias
create_tensor: loading tensor blk.22.attn_v.bias
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_q.bias
create_tensor: loading tensor blk.23.attn_k.bias
create_tensor: loading tensor blk.23.attn_v.bias
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_q.bias
create_tensor: loading tensor blk.24.attn_k.bias
create_tensor: loading tensor blk.24.attn_v.bias
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_q.bias
create_tensor: loading tensor blk.25.attn_k.bias
create_tensor: loading tensor blk.25.attn_v.bias
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_q.bias
create_tensor: loading tensor blk.26.attn_k.bias
create_tensor: loading tensor blk.26.attn_v.bias
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_q.bias
create_tensor: loading tensor blk.27.attn_k.bias
create_tensor: loading tensor blk.27.attn_v.bias
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   125.19 MiB
load_tensors:        CUDA0 model buffer size =   934.70 MiB
time=2025-12-11T16:46:51.789+08:00 level=DEBUG source=server.go:1338 msg="model load progress 0.58"
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.59 MiB
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache: layer  20: dev = CUDA0
llama_kv_cache: layer  21: dev = CUDA0
llama_kv_cache: layer  22: dev = CUDA0
llama_kv_cache: layer  23: dev = CUDA0
llama_kv_cache: layer  24: dev = CUDA0
llama_kv_cache: layer  25: dev = CUDA0
llama_kv_cache: layer  26: dev = CUDA0
llama_kv_cache: layer  27: dev = CUDA0
llama_kv_cache:      CUDA0 KV buffer size =   112.00 MiB
llama_kv_cache: size =  112.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2712
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
time=2025-12-11T16:46:52.040+08:00 level=DEBUG source=server.go:1338 msg="model load progress 1.00"
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      CUDA0 compute buffer size =   299.75 MiB
llama_context:  CUDA_Host compute buffer size =    12.01 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 2
time=2025-12-11T16:46:52.290+08:00 level=INFO source=server.go:1332 msg="llama runner started in 0.82 seconds"
time=2025-12-11T16:46:52.290+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2025-12-11T16:46:52.290+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
time=2025-12-11T16:46:52.290+08:00 level=INFO source=server.go:1332 msg="llama runner started in 0.82 seconds"
time=2025-12-11T16:46:52.290+08:00 level=DEBUG source=sched.go:529 msg="finished setting up" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b runner.inference="[{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA}]" runner.size="1.3 GiB" runner.vram="1.3 GiB" runner.parallel=1 runner.pid=1241307 runner.model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc runner.num_ctx=4096
time=2025-12-11T16:46:52.291+08:00 level=DEBUG source=server.go:1465 msg="completion request" images=0 prompt=104 format=""
time=2025-12-11T16:46:52.291+08:00 level=TRACE source=server.go:1466 msg="completion request" prompt="<|User|>你好,请问你是谁?\n\n你好,请问你是谁?<|Assistant|><think>\n\n</think>\n\n"
time=2025-12-11T16:46:52.293+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=19 used=0 remaining=19
[GIN] 2025/12/11 - 16:46:52 | 200 |  1.172291934s |       127.0.0.1 | POST     "/api/chat"
time=2025-12-11T16:46:52.403+08:00 level=DEBUG source=sched.go:537 msg="context for request finished"
time=2025-12-11T16:46:52.403+08:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b runner.inference="[{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA}]" runner.size="1.3 GiB" runner.vram="1.3 GiB" runner.parallel=1 runner.pid=1241307 runner.model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc runner.num_ctx=4096 duration=5m0s
time=2025-12-11T16:46:52.403+08:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b runner.inference="[{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA}]" runner.size="1.3 GiB" runner.vram="1.3 GiB" runner.parallel=1 runner.pid=1241307 runner.model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc runner.num_ctx=4096 refCount=0
^Ctime=2025-12-11T16:47:30.906+08:00 level=DEBUG source=sched.go:844 msg="shutting down runner" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc
time=2025-12-11T16:47:30.906+08:00 level=DEBUG source=sched.go:136 msg="shutting down scheduler pending loop"
time=2025-12-11T16:47:30.906+08:00 level=DEBUG source=sched.go:269 msg="shutting down scheduler completed loop"
time=2025-12-11T16:47:30.922+08:00 level=DEBUG source=server.go:1755 msg="stopping llama server" pid=1241307
time=2025-12-11T16:47:30.922+08:00 level=DEBUG source=server.go:1761 msg="waiting for llama server to exit" pid=1241307
time=2025-12-11T16:47:31.165+08:00 level=DEBUG source=server.go:1765 msg="llama server stopped" pid=1241307

I found Ollama works on GPU when I use ollama serve to launch it instead using systemd. When I try to use systemctl start ollama, I found some errors:

➜  ~ systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/usr/lib/systemd/system/ollama.service; enabled; preset: 
disabled)
     Active: active (running) since Thu 2025-12-11 16:48:42 CST; 4s ago
 Invocation: 9917834945de4488b7a7143f1a23bb43
   Main PID: 1252847 (ollama)
      Tasks: 13 (limit: 73914)
     Memory: 14.2M (peak: 286.6M)
        CPU: 334ms
     CGroup: /system.slice/ollama.service
             └─1252847 /usr/bin/ollama serve

Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.589+08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.589+08:00 level=INFO source=routes.go:1597 msg="Listening on 127.0.0.1:11434 (version 0
.13.2)"
Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.589+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.590+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama ru
nner --ollama-engine --port 44287"
Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.670+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama ru
nner --ollama-engine --port 37077"
Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.670+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama ru
nner --ollama-engine --port 35601"
Dec 11 16:48:43 amekiri systemd-coredump[1253010]: [🡕] Process 1252958 (ollama) of user 963 dumped core.
                                                   
                                                   Stack trace of thread 1252971:
                                                   #0  0x00007f7a39a9890c n/a (libc.so.6 + 0x9890c)
                                                   #1  0x00007f7a39a3e3a0 raise (libc.so.6 + 0x3e3a0)
                                                   #2  0x00007f7a39a2557a abort (libc.so.6 + 0x2557a)
                                                   #3  0x00007f770234d6f5 rocblas_abort (librocblas.so.5 + 0x954d6f5)
                                                   #4  0x00007f770219711e _ZN12_GLOBAL__N_123get_library_and_adapterEPSt10shared_ptrIN7Tensile21M
asterSolutionLibraryINS1_18ContractionProblemENS1_19ContractionSolutionEEEEPS0_I20hipDeviceProp_tR0600Ei (librocblas.so.5 + 0x939711e)
                                                   #5  0x00007f773fa8b532 n/a (libggml-hip.so + 0x3ce8b532)
                                                   #6  0x00007f773fa8c1b3 ggml_backend_cuda_reg (libggml-hip.so + 0x3ce8c1b3)
                                                   #7  0x00005574019d16fb n/a (/usr/bin/ollama + 0xe1b6fb)
                                                   #8  0x00005574019cf642 n/a (/usr/bin/ollama + 0xe19642)
                                                   #9  0x00005574019d0a7c n/a (/usr/bin/ollama + 0xe1aa7c)
                                                   #10 0x0000557400c7ae21 n/a (/usr/bin/ollama + 0xc4e21)
                                                   ELF object binary architecture: AMD x86-64
Dec 11 16:48:43 amekiri ollama[1252847]: time=2025-12-11T16:48:43.581+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIB
RARY_PATH=[/usr/lib/ollama] extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" error="runner crashed"
Dec 11 16:48:43 amekiri ollama[1252847]: time=2025-12-11T16:48:43.581+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-5bfceebe-7c3b-ba
e6-3a62-cc5002f849b8 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5070 Ti" libdirs=ollama driver=13.0 pci_id=0000:01
:00.0 type=discrete total="15.9 GiB" available="13.1 GiB"
Dec 11 16:48:43 amekiri ollama[1252847]: time=2025-12-11T16:48:43.581+08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="1
5.9 GiB" threshold="20.0 GiB"
<!-- gh-comment-id:3640904093 --> @amekiri13 commented on GitHub (Dec 11, 2025): Here are logs of Ollama when `OLLAMA_DEBUG=2` setted: ``` time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=106.556373ms OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs=map[] time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:124 msg="evaluating which, if any, devices to filter out" initial_count=2 time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=/usr/lib/ollama description="NVIDIA GeForce RTX 5070 Ti" compute=12.0 id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 pci_id=0000:01:00.0 time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=runner.go:146 msg="verifying if device is supported" library=/usr/lib/ollama description="AMD Ryzen 7 9700X 8-Core Processor" compute=gfx1036 id=0 pci_id=0000:0c:00.0 time=2025-12-11T16:44:16.294+08:00 level=TRACE source=runner.go:440 msg="starting runner for device discovery" libDirs=[/usr/lib/ollama] extraEnvs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" time=2025-12-11T16:44:16.294+08:00 level=TRACE source=runner.go:440 msg="starting runner for device discovery" libDirs=[/usr/lib/ollama] extraEnvs="map[CUDA_VISIBLE_DEVICES:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 GGML_CUDA_INIT:1]" time=2025-12-11T16:44:16.294+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 39761" time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama OLLAMA_LIBRARY_PATH=/usr/lib/ollama CUDA_VISIBLE_DEVICES=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 GGML_CUDA_INIT=1 time=2025-12-11T16:44:16.294+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45061" time=2025-12-11T16:44:16.294+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama OLLAMA_LIBRARY_PATH=/usr/lib/ollama ROCR_VISIBLE_DEVICES=0 GGML_CUDA_INIT=1 time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1398 msg="starting ollama engine" time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:45061" time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1398 msg="starting ollama engine" time=2025-12-11T16:44:16.300+08:00 level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:39761" time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=general.architecture type=string time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=tokenizer.ggml.model type=string time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=general.architecture type=string time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=gguf.go:589 msg=tokenizer.ggml.model type=string time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:44:16.305+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.file_type default=0 time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.file_type default=0 time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.name default="" time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.description default="" time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.name default="" time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.description default="" time=2025-12-11T16:44:16.306+08:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2025-12-11T16:44:16.306+08:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama time=2025-12-11T16:44:16.306+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-12-11T16:44:16.366+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,880,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.pooling_type default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.expert_count default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.embedding_length default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.key_length default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2025-12-11T16:44:16.366+08:00 level=DEBUG source=runner.go:1373 msg="dummy model load took" duration=60.432184ms ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: ggml_cuda_init: initializing rocBLAS on device 0 rocBLAS error: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1036 List of available TensileLibrary Files : "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1103.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1150.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1151.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1200.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1201.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat" "/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx950.dat" ggml_backend_cuda_device_get_memory device GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 utilizing NVML memory reporting free: 13942194176 total: 17094934528 time=2025-12-11T16:44:16.375+08:00 level=DEBUG source=runner.go:1378 msg="gathering device infos took" duration=9.476959ms time=2025-12-11T16:44:16.376+08:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] devices="[{DeviceID:{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA} Name:CUDA0 Description:NVIDIA GeForce RTX 5070 Ti FilterID: Integrated:false PCIID:0000:01:00.0 TotalMemory:17094934528 FreeMemory:13942194176 ComputeMajor:12 ComputeMinor:0 DriverMajor:13 DriverMinor:0 LibraryPath:[/usr/lib/ollama]}]" time=2025-12-11T16:44:16.376+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=81.9498ms OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs="map[CUDA_VISIBLE_DEVICES:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 GGML_CUDA_INIT:1]" time=2025-12-11T16:44:17.275+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" error="runner crashed" time=2025-12-11T16:44:17.275+08:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] devices=[] time=2025-12-11T16:44:17.275+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=981.006794ms OLLAMA_LIBRARY_PATH=[/usr/lib/ollama] extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" time=2025-12-11T16:44:17.275+08:00 level=DEBUG source=runner.go:153 msg="filtering device which didn't fully initialize" id=0 libdir=/usr/lib/ollama pci_id=0000:0c:00.0 library=ROCm time=2025-12-11T16:44:17.275+08:00 level=TRACE source=runner.go:174 msg="supported GPU library combinations before filtering" supported=map[CUDA:map[/usr/lib/ollama:map[GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8:0]]] time=2025-12-11T16:44:17.275+08:00 level=TRACE source=runner.go:183 msg="removing unsupported or overlapping GPU combination" libDir=/usr/lib/ollama description="AMD Ryzen 7 9700X 8-Core Processor" compute=gfx1036 pci_id=0000:0c:00.0 time=2025-12-11T16:44:17.275+08:00 level=DEBUG source=runner.go:40 msg="GPU bootstrap discovery took" duration=1.087772337s time=2025-12-11T16:44:17.275+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5070 Ti" libdirs=ollama driver=13.0 pci_id=0000:01:00.0 type=discrete total="15.9 GiB" available="13.0 GiB" time=2025-12-11T16:44:17.275+08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB" [GIN] 2025/12/11 - 16:44:36 | 404 | 333.63µs | 127.0.0.1 | POST "/api/chat" [GIN] 2025/12/11 - 16:44:55 | 200 | 23.699µs | 127.0.0.1 | GET "/" [GIN] 2025/12/11 - 16:45:07 | 200 | 19.74µs | 127.0.0.1 | HEAD "/" [GIN] 2025/12/11 - 16:45:07 | 200 | 96.44µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/12/11 - 16:45:33 | 200 | 19.63µs | 127.0.0.1 | HEAD "/" time=2025-12-11T16:45:35.404+08:00 level=INFO source=download.go:177 msg="downloading aabd4debf0c8 in 12 100 MB part(s)" time=2025-12-11T16:46:20.098+08:00 level=INFO source=download.go:177 msg="downloading c5ad996bda6e in 1 556 B part(s)" time=2025-12-11T16:46:21.700+08:00 level=INFO source=download.go:177 msg="downloading 6e4c38e1172f in 1 1.1 KB part(s)" time=2025-12-11T16:46:23.637+08:00 level=INFO source=download.go:177 msg="downloading f4d24e9138dd in 1 148 B part(s)" time=2025-12-11T16:46:25.273+08:00 level=INFO source=download.go:177 msg="downloading a85fe2a2e58e in 1 487 B part(s)" [GIN] 2025/12/11 - 16:46:27 | 200 | 53.223237688s | 127.0.0.1 | POST "/api/pull" [GIN] 2025/12/11 - 16:46:37 | 200 | 209.369µs | 127.0.0.1 | GET "/api/tags" time=2025-12-11T16:46:51.274+08:00 level=TRACE source=sched.go:146 msg="processing incoming request" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc time=2025-12-11T16:46:51.274+08:00 level=TRACE source=sched.go:179 msg="refreshing GPU list" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc time=2025-12-11T16:46:51.274+08:00 level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2025-12-11T16:46:51.274+08:00 level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2025-12-11T16:46:51.274+08:00 level=TRACE source=runner.go:440 msg="starting runner for device discovery" libDirs="[/usr/lib/ollama ]" extraEnvs=map[] time=2025-12-11T16:46:51.274+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44023" time=2025-12-11T16:46:51.274+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama: OLLAMA_LIBRARY_PATH=/usr/lib/ollama: time=2025-12-11T16:46:51.280+08:00 level=INFO source=runner.go:1398 msg="starting ollama engine" time=2025-12-11T16:46:51.280+08:00 level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:44023" time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=gguf.go:589 msg=general.architecture type=string time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=gguf.go:589 msg=tokenizer.ggml.model type=string time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.file_type default=0 time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.name default="" time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.description default="" time=2025-12-11T16:46:51.285+08:00 level=INFO source=ggml.go:136 msg="" architecture=llama file_type=unknown name="" description="" num_tensors=0 num_key_values=3 time=2025-12-11T16:46:51.285+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Ryzen 7 9700X 8-Core Processor, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 0 load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:88 msg="skipping path which is not part of ollama" path=/home/amekiri time=2025-12-11T16:46:51.339+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,880,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.pooling_type default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.expert_count default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.tokens default="&{size:0 values:[]}" time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.scores default="&{size:0 values:[]}" time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.token_type default="&{size:0 values:[]}" time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.merges default="&{size:0 values:[]}" time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.add_eos_token default=false time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_id default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=tokenizer.ggml.pre default="" time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.block_count default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.embedding_length default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.head_count_kv default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.key_length default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.dimension_count default=0 time=2025-12-11T16:46:51.339+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.attention.layer_norm_rms_epsilon default=0 time=2025-12-11T16:46:51.340+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.freq_base default=100000 time=2025-12-11T16:46:51.340+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=llama.rope.scaling.factor default=1 time=2025-12-11T16:46:51.340+08:00 level=DEBUG source=runner.go:1373 msg="dummy model load took" duration=54.37661ms ggml_backend_cuda_device_get_memory device GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 utilizing NVML memory reporting free: 13972471808 total: 17094934528 ggml_hip_get_device_memory searching for device 0000:0c:00.0 ggml_backend_cuda_device_get_memory device 0000:0c:00.0 utilizing AMD specific memory reporting free: 2121228288 total: 2147483648 time=2025-12-11T16:46:51.349+08:00 level=DEBUG source=runner.go:1378 msg="gathering device infos took" duration=9.35368ms time=2025-12-11T16:46:51.349+08:00 level=TRACE source=runner.go:467 msg="runner enumerated devices" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama ]" devices="[{DeviceID:{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA} Name:CUDA0 Description:NVIDIA GeForce RTX 5070 Ti FilterID: Integrated:false PCIID:0000:01:00.0 TotalMemory:17094934528 FreeMemory:13972471808 ComputeMajor:12 ComputeMinor:0 DriverMajor:13 DriverMinor:0 LibraryPath:[/usr/lib/ollama ]} {DeviceID:{ID:0 Library:ROCm} Name:ROCm0 Description:AMD Ryzen 7 9700X 8-Core Processor FilterID: Integrated:true PCIID:0000:0c:00.0 TotalMemory:2147483648 FreeMemory:2121228288 ComputeMajor:16 ComputeMinor:54 DriverMajor:70152 DriverMinor:80 LibraryPath:[/usr/lib/ollama ]}]" time=2025-12-11T16:46:51.349+08:00 level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=75.356547ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama ]" extra_envs=map[] time=2025-12-11T16:46:51.349+08:00 level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=75.385557ms time=2025-12-11T16:46:51.349+08:00 level=TRACE source=sched.go:182 msg="refreshing system information" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc time=2025-12-11T16:46:51.349+08:00 level=TRACE source=gpu.go:22 msg="performing CPU discovery" time=2025-12-11T16:46:51.350+08:00 level=TRACE source=gpu.go:25 msg="CPU discovery completed" duration=343.3µs time=2025-12-11T16:46:51.350+08:00 level=DEBUG source=sched.go:194 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-12-11T16:46:51.350+08:00 level=TRACE source=sched.go:198 msg="loading model metadata" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc time=2025-12-11T16:46:51.356+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=general.alignment default=32 time=2025-12-11T16:46:51.356+08:00 level=TRACE source=sched.go:206 msg="updating free space" gpu_count=1 model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc time=2025-12-11T16:46:51.356+08:00 level=DEBUG source=sched.go:211 msg="loading first model" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 1.5B llama_model_loader: - kv 5: qwen2.block_count u32 = 28 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.04 GiB (5.00 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151645 '<|Assistant|>' is not marked as EOG load: control token: 151644 '<|User|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151647 '<|EOT|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 151643 ('<|end▁of▁sentence|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 1.78 B print_info: general.name = DeepSeek R1 Distill Qwen 1.5B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-12-11T16:46:51.473+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama runner --model /home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc --port 40115" time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=server.go:393 msg=subprocess CUDA_PATH=/opt/cuda PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/opt/rocm/bin:/usr/lib/rustup/bin ROCM_PATH=/opt/rocm OLLAMA_DEBUG=2 LD_LIBRARY_PATH=/usr/lib/ollama OLLAMA_LIBRARY_PATH=/usr/lib/ollama time=2025-12-11T16:46:51.473+08:00 level=INFO source=sched.go:443 msg="system memory" total="60.5 GiB" free="43.3 GiB" free_swap="48.0 GiB" time=2025-12-11T16:46:51.473+08:00 level=INFO source=sched.go:450 msg="gpu memory" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 library=CUDA available="12.6 GiB" free="13.0 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-12-11T16:46:51.473+08:00 level=INFO source=server.go:459 msg="loading model" "model layers"=29 requested=-1 time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.key_length default=128 time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=ggml.go:279 msg="key with type not found" key=qwen2.attention.value_length default=128 time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=ggml.go:614 msg="default cache size estimate" "attention MiB"=112 "attention bytes"=117440512 "recurrent MiB"=0 "recurrent bytes"=0 time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=0 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=1 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=2 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=3 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=4 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=5 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=6 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=7 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=8 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=9 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=10 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=11 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=12 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=13 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=14 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=15 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=16 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=17 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=18 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=19 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=20 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=21 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=22 size="29.1 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=23 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=24 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=25 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=26 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=27 size="32.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=28 size="182.6 MiB" time=2025-12-11T16:46:51.473+08:00 level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 library=CUDA "available layer vram"="12.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=0 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=1 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=2 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=3 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=4 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=5 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=6 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=7 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=8 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=9 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=10 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=11 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=12 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=13 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=14 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=15 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=16 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=17 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=18 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=19 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=20 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=21 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=22 size="29.1 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=23 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=24 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=25 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=26 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=27 size="32.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=TRACE source=server.go:896 msg="layer to assign" layer=28 size="182.6 MiB" time=2025-12-11T16:46:51.474+08:00 level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 library=CUDA "available layer vram"="12.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="482.3 MiB" time=2025-12-11T16:46:51.474+08:00 level=DEBUG source=server.go:614 msg=memory estimate.CUDA0.ID=GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 estimate.CUDA0.Weights="[29990912 29990912 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 26341376 26341376 29990912 29990912 29990912 29990912 29990912 191445504]" estimate.CUDA0.Cache="[4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 4194304 0]" estimate.CUDA0.Graph=314310656 time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="934.7 MiB" time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="112.0 MiB" time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="299.8 MiB" time=2025-12-11T16:46:51.474+08:00 level=INFO source=device.go:272 msg="total memory" size="1.3 GiB" time=2025-12-11T16:46:51.478+08:00 level=INFO source=runner.go:963 msg="starting go runner" time=2025-12-11T16:46:51.478+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Ryzen 7 9700X 8-Core Processor, gfx1036 (0x1036), VMM: no, Wave Size: 32, ID: 0 load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-12-11T16:46:51.530+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,880,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-12-11T16:46:51.530+08:00 level=INFO source=runner.go:999 msg="Server listening on 127.0.0.1:40115" time=2025-12-11T16:46:51.538+08:00 level=INFO source=runner.go:893 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:29[ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" time=2025-12-11T16:46:51.538+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" time=2025-12-11T16:46:51.538+08:00 level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model" ggml_backend_cuda_device_get_memory device GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 utilizing NVML memory reporting free: 13978238976 total: 17094934528 llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5070 Ti) (0000:01:00.0) - 13330 MiB free llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 1.5B llama_model_loader: - kv 5: qwen2.block_count u32 = 28 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.04 GiB (5.00 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151645 '<|Assistant|>' is not marked as EOG load: control token: 151644 '<|User|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151647 '<|EOT|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 151643 ('<|end▁of▁sentence|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 1536 print_info: n_embd_inp = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: model type = 1.5B print_info: model params = 1.78 B print_info: general.name = DeepSeek R1 Distill Qwen 1.5B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA0, is_swa = 0 create_tensor: loading tensor token_embd.weight create_tensor: loading tensor output_norm.weight create_tensor: loading tensor output.weight create_tensor: loading tensor blk.0.attn_norm.weight create_tensor: loading tensor blk.0.attn_q.weight create_tensor: loading tensor blk.0.attn_k.weight create_tensor: loading tensor blk.0.attn_v.weight create_tensor: loading tensor blk.0.attn_output.weight create_tensor: loading tensor blk.0.attn_q.bias create_tensor: loading tensor blk.0.attn_k.bias create_tensor: loading tensor blk.0.attn_v.bias create_tensor: loading tensor blk.0.ffn_norm.weight create_tensor: loading tensor blk.0.ffn_gate.weight create_tensor: loading tensor blk.0.ffn_down.weight create_tensor: loading tensor blk.0.ffn_up.weight create_tensor: loading tensor blk.1.attn_norm.weight create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.attn_q.bias create_tensor: loading tensor blk.1.attn_k.bias create_tensor: loading tensor blk.1.attn_v.bias create_tensor: loading tensor blk.1.ffn_norm.weight create_tensor: loading tensor blk.1.ffn_gate.weight create_tensor: loading tensor blk.1.ffn_down.weight create_tensor: loading tensor blk.1.ffn_up.weight create_tensor: loading tensor blk.2.attn_norm.weight create_tensor: loading tensor blk.2.attn_q.weight create_tensor: loading tensor blk.2.attn_k.weight create_tensor: loading tensor blk.2.attn_v.weight create_tensor: loading tensor blk.2.attn_output.weight create_tensor: loading tensor blk.2.attn_q.bias create_tensor: loading tensor blk.2.attn_k.bias create_tensor: loading tensor blk.2.attn_v.bias create_tensor: loading tensor blk.2.ffn_norm.weight create_tensor: loading tensor blk.2.ffn_gate.weight create_tensor: loading tensor blk.2.ffn_down.weight create_tensor: loading tensor blk.2.ffn_up.weight create_tensor: loading tensor blk.3.attn_norm.weight create_tensor: loading tensor blk.3.attn_q.weight create_tensor: loading tensor blk.3.attn_k.weight create_tensor: loading tensor blk.3.attn_v.weight create_tensor: loading tensor blk.3.attn_output.weight create_tensor: loading tensor blk.3.attn_q.bias create_tensor: loading tensor blk.3.attn_k.bias create_tensor: loading tensor blk.3.attn_v.bias create_tensor: loading tensor blk.3.ffn_norm.weight create_tensor: loading tensor blk.3.ffn_gate.weight create_tensor: loading tensor blk.3.ffn_down.weight create_tensor: loading tensor blk.3.ffn_up.weight create_tensor: loading tensor blk.4.attn_norm.weight create_tensor: loading tensor blk.4.attn_q.weight create_tensor: loading tensor blk.4.attn_k.weight create_tensor: loading tensor blk.4.attn_v.weight create_tensor: loading tensor blk.4.attn_output.weight create_tensor: loading tensor blk.4.attn_q.bias create_tensor: loading tensor blk.4.attn_k.bias create_tensor: loading tensor blk.4.attn_v.bias create_tensor: loading tensor blk.4.ffn_norm.weight create_tensor: loading tensor blk.4.ffn_gate.weight create_tensor: loading tensor blk.4.ffn_down.weight create_tensor: loading tensor blk.4.ffn_up.weight create_tensor: loading tensor blk.5.attn_norm.weight create_tensor: loading tensor blk.5.attn_q.weight create_tensor: loading tensor blk.5.attn_k.weight create_tensor: loading tensor blk.5.attn_v.weight create_tensor: loading tensor blk.5.attn_output.weight create_tensor: loading tensor blk.5.attn_q.bias create_tensor: loading tensor blk.5.attn_k.bias create_tensor: loading tensor blk.5.attn_v.bias create_tensor: loading tensor blk.5.ffn_norm.weight create_tensor: loading tensor blk.5.ffn_gate.weight create_tensor: loading tensor blk.5.ffn_down.weight create_tensor: loading tensor blk.5.ffn_up.weight create_tensor: loading tensor blk.6.attn_norm.weight create_tensor: loading tensor blk.6.attn_q.weight create_tensor: loading tensor blk.6.attn_k.weight create_tensor: loading tensor blk.6.attn_v.weight create_tensor: loading tensor blk.6.attn_output.weight create_tensor: loading tensor blk.6.attn_q.bias create_tensor: loading tensor blk.6.attn_k.bias create_tensor: loading tensor blk.6.attn_v.bias create_tensor: loading tensor blk.6.ffn_norm.weight create_tensor: loading tensor blk.6.ffn_gate.weight create_tensor: loading tensor blk.6.ffn_down.weight create_tensor: loading tensor blk.6.ffn_up.weight create_tensor: loading tensor blk.7.attn_norm.weight create_tensor: loading tensor blk.7.attn_q.weight create_tensor: loading tensor blk.7.attn_k.weight create_tensor: loading tensor blk.7.attn_v.weight create_tensor: loading tensor blk.7.attn_output.weight create_tensor: loading tensor blk.7.attn_q.bias create_tensor: loading tensor blk.7.attn_k.bias create_tensor: loading tensor blk.7.attn_v.bias create_tensor: loading tensor blk.7.ffn_norm.weight create_tensor: loading tensor blk.7.ffn_gate.weight create_tensor: loading tensor blk.7.ffn_down.weight create_tensor: loading tensor blk.7.ffn_up.weight create_tensor: loading tensor blk.8.attn_norm.weight create_tensor: loading tensor blk.8.attn_q.weight create_tensor: loading tensor blk.8.attn_k.weight create_tensor: loading tensor blk.8.attn_v.weight create_tensor: loading tensor blk.8.attn_output.weight create_tensor: loading tensor blk.8.attn_q.bias create_tensor: loading tensor blk.8.attn_k.bias create_tensor: loading tensor blk.8.attn_v.bias create_tensor: loading tensor blk.8.ffn_norm.weight create_tensor: loading tensor blk.8.ffn_gate.weight create_tensor: loading tensor blk.8.ffn_down.weight create_tensor: loading tensor blk.8.ffn_up.weight create_tensor: loading tensor blk.9.attn_norm.weight create_tensor: loading tensor blk.9.attn_q.weight create_tensor: loading tensor blk.9.attn_k.weight create_tensor: loading tensor blk.9.attn_v.weight create_tensor: loading tensor blk.9.attn_output.weight create_tensor: loading tensor blk.9.attn_q.bias create_tensor: loading tensor blk.9.attn_k.bias create_tensor: loading tensor blk.9.attn_v.bias create_tensor: loading tensor blk.9.ffn_norm.weight create_tensor: loading tensor blk.9.ffn_gate.weight create_tensor: loading tensor blk.9.ffn_down.weight create_tensor: loading tensor blk.9.ffn_up.weight create_tensor: loading tensor blk.10.attn_norm.weight create_tensor: loading tensor blk.10.attn_q.weight create_tensor: loading tensor blk.10.attn_k.weight create_tensor: loading tensor blk.10.attn_v.weight create_tensor: loading tensor blk.10.attn_output.weight create_tensor: loading tensor blk.10.attn_q.bias create_tensor: loading tensor blk.10.attn_k.bias create_tensor: loading tensor blk.10.attn_v.bias create_tensor: loading tensor blk.10.ffn_norm.weight create_tensor: loading tensor blk.10.ffn_gate.weight create_tensor: loading tensor blk.10.ffn_down.weight create_tensor: loading tensor blk.10.ffn_up.weight create_tensor: loading tensor blk.11.attn_norm.weight create_tensor: loading tensor blk.11.attn_q.weight create_tensor: loading tensor blk.11.attn_k.weight create_tensor: loading tensor blk.11.attn_v.weight create_tensor: loading tensor blk.11.attn_output.weight create_tensor: loading tensor blk.11.attn_q.bias create_tensor: loading tensor blk.11.attn_k.bias create_tensor: loading tensor blk.11.attn_v.bias create_tensor: loading tensor blk.11.ffn_norm.weight create_tensor: loading tensor blk.11.ffn_gate.weight create_tensor: loading tensor blk.11.ffn_down.weight create_tensor: loading tensor blk.11.ffn_up.weight create_tensor: loading tensor blk.12.attn_norm.weight create_tensor: loading tensor blk.12.attn_q.weight create_tensor: loading tensor blk.12.attn_k.weight create_tensor: loading tensor blk.12.attn_v.weight create_tensor: loading tensor blk.12.attn_output.weight create_tensor: loading tensor blk.12.attn_q.bias create_tensor: loading tensor blk.12.attn_k.bias create_tensor: loading tensor blk.12.attn_v.bias create_tensor: loading tensor blk.12.ffn_norm.weight create_tensor: loading tensor blk.12.ffn_gate.weight create_tensor: loading tensor blk.12.ffn_down.weight create_tensor: loading tensor blk.12.ffn_up.weight create_tensor: loading tensor blk.13.attn_norm.weight create_tensor: loading tensor blk.13.attn_q.weight create_tensor: loading tensor blk.13.attn_k.weight create_tensor: loading tensor blk.13.attn_v.weight create_tensor: loading tensor blk.13.attn_output.weight create_tensor: loading tensor blk.13.attn_q.bias create_tensor: loading tensor blk.13.attn_k.bias create_tensor: loading tensor blk.13.attn_v.bias create_tensor: loading tensor blk.13.ffn_norm.weight create_tensor: loading tensor blk.13.ffn_gate.weight create_tensor: loading tensor blk.13.ffn_down.weight create_tensor: loading tensor blk.13.ffn_up.weight create_tensor: loading tensor blk.14.attn_norm.weight create_tensor: loading tensor blk.14.attn_q.weight create_tensor: loading tensor blk.14.attn_k.weight create_tensor: loading tensor blk.14.attn_v.weight create_tensor: loading tensor blk.14.attn_output.weight create_tensor: loading tensor blk.14.attn_q.bias create_tensor: loading tensor blk.14.attn_k.bias create_tensor: loading tensor blk.14.attn_v.bias create_tensor: loading tensor blk.14.ffn_norm.weight create_tensor: loading tensor blk.14.ffn_gate.weight create_tensor: loading tensor blk.14.ffn_down.weight create_tensor: loading tensor blk.14.ffn_up.weight create_tensor: loading tensor blk.15.attn_norm.weight create_tensor: loading tensor blk.15.attn_q.weight create_tensor: loading tensor blk.15.attn_k.weight create_tensor: loading tensor blk.15.attn_v.weight create_tensor: loading tensor blk.15.attn_output.weight create_tensor: loading tensor blk.15.attn_q.bias create_tensor: loading tensor blk.15.attn_k.bias create_tensor: loading tensor blk.15.attn_v.bias create_tensor: loading tensor blk.15.ffn_norm.weight create_tensor: loading tensor blk.15.ffn_gate.weight create_tensor: loading tensor blk.15.ffn_down.weight create_tensor: loading tensor blk.15.ffn_up.weight create_tensor: loading tensor blk.16.attn_norm.weight create_tensor: loading tensor blk.16.attn_q.weight create_tensor: loading tensor blk.16.attn_k.weight create_tensor: loading tensor blk.16.attn_v.weight create_tensor: loading tensor blk.16.attn_output.weight create_tensor: loading tensor blk.16.attn_q.bias create_tensor: loading tensor blk.16.attn_k.bias create_tensor: loading tensor blk.16.attn_v.bias create_tensor: loading tensor blk.16.ffn_norm.weight create_tensor: loading tensor blk.16.ffn_gate.weight create_tensor: loading tensor blk.16.ffn_down.weight create_tensor: loading tensor blk.16.ffn_up.weight create_tensor: loading tensor blk.17.attn_norm.weight create_tensor: loading tensor blk.17.attn_q.weight create_tensor: loading tensor blk.17.attn_k.weight create_tensor: loading tensor blk.17.attn_v.weight create_tensor: loading tensor blk.17.attn_output.weight create_tensor: loading tensor blk.17.attn_q.bias create_tensor: loading tensor blk.17.attn_k.bias create_tensor: loading tensor blk.17.attn_v.bias create_tensor: loading tensor blk.17.ffn_norm.weight create_tensor: loading tensor blk.17.ffn_gate.weight create_tensor: loading tensor blk.17.ffn_down.weight create_tensor: loading tensor blk.17.ffn_up.weight create_tensor: loading tensor blk.18.attn_norm.weight create_tensor: loading tensor blk.18.attn_q.weight create_tensor: loading tensor blk.18.attn_k.weight create_tensor: loading tensor blk.18.attn_v.weight create_tensor: loading tensor blk.18.attn_output.weight create_tensor: loading tensor blk.18.attn_q.bias create_tensor: loading tensor blk.18.attn_k.bias create_tensor: loading tensor blk.18.attn_v.bias create_tensor: loading tensor blk.18.ffn_norm.weight create_tensor: loading tensor blk.18.ffn_gate.weight create_tensor: loading tensor blk.18.ffn_down.weight create_tensor: loading tensor blk.18.ffn_up.weight create_tensor: loading tensor blk.19.attn_norm.weight create_tensor: loading tensor blk.19.attn_q.weight create_tensor: loading tensor blk.19.attn_k.weight create_tensor: loading tensor blk.19.attn_v.weight create_tensor: loading tensor blk.19.attn_output.weight create_tensor: loading tensor blk.19.attn_q.bias create_tensor: loading tensor blk.19.attn_k.bias create_tensor: loading tensor blk.19.attn_v.bias create_tensor: loading tensor blk.19.ffn_norm.weight create_tensor: loading tensor blk.19.ffn_gate.weight create_tensor: loading tensor blk.19.ffn_down.weight create_tensor: loading tensor blk.19.ffn_up.weight create_tensor: loading tensor blk.20.attn_norm.weight create_tensor: loading tensor blk.20.attn_q.weight create_tensor: loading tensor blk.20.attn_k.weight create_tensor: loading tensor blk.20.attn_v.weight create_tensor: loading tensor blk.20.attn_output.weight create_tensor: loading tensor blk.20.attn_q.bias create_tensor: loading tensor blk.20.attn_k.bias create_tensor: loading tensor blk.20.attn_v.bias create_tensor: loading tensor blk.20.ffn_norm.weight create_tensor: loading tensor blk.20.ffn_gate.weight create_tensor: loading tensor blk.20.ffn_down.weight create_tensor: loading tensor blk.20.ffn_up.weight create_tensor: loading tensor blk.21.attn_norm.weight create_tensor: loading tensor blk.21.attn_q.weight create_tensor: loading tensor blk.21.attn_k.weight create_tensor: loading tensor blk.21.attn_v.weight create_tensor: loading tensor blk.21.attn_output.weight create_tensor: loading tensor blk.21.attn_q.bias create_tensor: loading tensor blk.21.attn_k.bias create_tensor: loading tensor blk.21.attn_v.bias create_tensor: loading tensor blk.21.ffn_norm.weight create_tensor: loading tensor blk.21.ffn_gate.weight create_tensor: loading tensor blk.21.ffn_down.weight create_tensor: loading tensor blk.21.ffn_up.weight create_tensor: loading tensor blk.22.attn_norm.weight create_tensor: loading tensor blk.22.attn_q.weight create_tensor: loading tensor blk.22.attn_k.weight create_tensor: loading tensor blk.22.attn_v.weight create_tensor: loading tensor blk.22.attn_output.weight create_tensor: loading tensor blk.22.attn_q.bias create_tensor: loading tensor blk.22.attn_k.bias create_tensor: loading tensor blk.22.attn_v.bias create_tensor: loading tensor blk.22.ffn_norm.weight create_tensor: loading tensor blk.22.ffn_gate.weight create_tensor: loading tensor blk.22.ffn_down.weight create_tensor: loading tensor blk.22.ffn_up.weight create_tensor: loading tensor blk.23.attn_norm.weight create_tensor: loading tensor blk.23.attn_q.weight create_tensor: loading tensor blk.23.attn_k.weight create_tensor: loading tensor blk.23.attn_v.weight create_tensor: loading tensor blk.23.attn_output.weight create_tensor: loading tensor blk.23.attn_q.bias create_tensor: loading tensor blk.23.attn_k.bias create_tensor: loading tensor blk.23.attn_v.bias create_tensor: loading tensor blk.23.ffn_norm.weight create_tensor: loading tensor blk.23.ffn_gate.weight create_tensor: loading tensor blk.23.ffn_down.weight create_tensor: loading tensor blk.23.ffn_up.weight create_tensor: loading tensor blk.24.attn_norm.weight create_tensor: loading tensor blk.24.attn_q.weight create_tensor: loading tensor blk.24.attn_k.weight create_tensor: loading tensor blk.24.attn_v.weight create_tensor: loading tensor blk.24.attn_output.weight create_tensor: loading tensor blk.24.attn_q.bias create_tensor: loading tensor blk.24.attn_k.bias create_tensor: loading tensor blk.24.attn_v.bias create_tensor: loading tensor blk.24.ffn_norm.weight create_tensor: loading tensor blk.24.ffn_gate.weight create_tensor: loading tensor blk.24.ffn_down.weight create_tensor: loading tensor blk.24.ffn_up.weight create_tensor: loading tensor blk.25.attn_norm.weight create_tensor: loading tensor blk.25.attn_q.weight create_tensor: loading tensor blk.25.attn_k.weight create_tensor: loading tensor blk.25.attn_v.weight create_tensor: loading tensor blk.25.attn_output.weight create_tensor: loading tensor blk.25.attn_q.bias create_tensor: loading tensor blk.25.attn_k.bias create_tensor: loading tensor blk.25.attn_v.bias create_tensor: loading tensor blk.25.ffn_norm.weight create_tensor: loading tensor blk.25.ffn_gate.weight create_tensor: loading tensor blk.25.ffn_down.weight create_tensor: loading tensor blk.25.ffn_up.weight create_tensor: loading tensor blk.26.attn_norm.weight create_tensor: loading tensor blk.26.attn_q.weight create_tensor: loading tensor blk.26.attn_k.weight create_tensor: loading tensor blk.26.attn_v.weight create_tensor: loading tensor blk.26.attn_output.weight create_tensor: loading tensor blk.26.attn_q.bias create_tensor: loading tensor blk.26.attn_k.bias create_tensor: loading tensor blk.26.attn_v.bias create_tensor: loading tensor blk.26.ffn_norm.weight create_tensor: loading tensor blk.26.ffn_gate.weight create_tensor: loading tensor blk.26.ffn_down.weight create_tensor: loading tensor blk.26.ffn_up.weight create_tensor: loading tensor blk.27.attn_norm.weight create_tensor: loading tensor blk.27.attn_q.weight create_tensor: loading tensor blk.27.attn_k.weight create_tensor: loading tensor blk.27.attn_v.weight create_tensor: loading tensor blk.27.attn_output.weight create_tensor: loading tensor blk.27.attn_q.bias create_tensor: loading tensor blk.27.attn_k.bias create_tensor: loading tensor blk.27.attn_v.bias create_tensor: loading tensor blk.27.ffn_norm.weight create_tensor: loading tensor blk.27.ffn_gate.weight create_tensor: loading tensor blk.27.ffn_down.weight create_tensor: loading tensor blk.27.ffn_up.weight load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead load_tensors: offloading 28 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 29/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 125.19 MiB load_tensors: CUDA0 model buffer size = 934.70 MiB time=2025-12-11T16:46:51.789+08:00 level=DEBUG source=server.go:1338 msg="model load progress 0.58" llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CUDA_Host output buffer size = 0.59 MiB llama_kv_cache: layer 0: dev = CUDA0 llama_kv_cache: layer 1: dev = CUDA0 llama_kv_cache: layer 2: dev = CUDA0 llama_kv_cache: layer 3: dev = CUDA0 llama_kv_cache: layer 4: dev = CUDA0 llama_kv_cache: layer 5: dev = CUDA0 llama_kv_cache: layer 6: dev = CUDA0 llama_kv_cache: layer 7: dev = CUDA0 llama_kv_cache: layer 8: dev = CUDA0 llama_kv_cache: layer 9: dev = CUDA0 llama_kv_cache: layer 10: dev = CUDA0 llama_kv_cache: layer 11: dev = CUDA0 llama_kv_cache: layer 12: dev = CUDA0 llama_kv_cache: layer 13: dev = CUDA0 llama_kv_cache: layer 14: dev = CUDA0 llama_kv_cache: layer 15: dev = CUDA0 llama_kv_cache: layer 16: dev = CUDA0 llama_kv_cache: layer 17: dev = CUDA0 llama_kv_cache: layer 18: dev = CUDA0 llama_kv_cache: layer 19: dev = CUDA0 llama_kv_cache: layer 20: dev = CUDA0 llama_kv_cache: layer 21: dev = CUDA0 llama_kv_cache: layer 22: dev = CUDA0 llama_kv_cache: layer 23: dev = CUDA0 llama_kv_cache: layer 24: dev = CUDA0 llama_kv_cache: layer 25: dev = CUDA0 llama_kv_cache: layer 26: dev = CUDA0 llama_kv_cache: layer 27: dev = CUDA0 llama_kv_cache: CUDA0 KV buffer size = 112.00 MiB llama_kv_cache: size = 112.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 56.00 MiB, V (f16): 56.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 2 llama_context: max_nodes = 2712 llama_context: reserving full memory module llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 time=2025-12-11T16:46:52.040+08:00 level=DEBUG source=server.go:1338 msg="model load progress 1.00" graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 llama_context: CUDA0 compute buffer size = 299.75 MiB llama_context: CUDA_Host compute buffer size = 12.01 MiB llama_context: graph nodes = 1098 llama_context: graph splits = 2 time=2025-12-11T16:46:52.290+08:00 level=INFO source=server.go:1332 msg="llama runner started in 0.82 seconds" time=2025-12-11T16:46:52.290+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2025-12-11T16:46:52.290+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" time=2025-12-11T16:46:52.290+08:00 level=INFO source=server.go:1332 msg="llama runner started in 0.82 seconds" time=2025-12-11T16:46:52.290+08:00 level=DEBUG source=sched.go:529 msg="finished setting up" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b runner.inference="[{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA}]" runner.size="1.3 GiB" runner.vram="1.3 GiB" runner.parallel=1 runner.pid=1241307 runner.model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc runner.num_ctx=4096 time=2025-12-11T16:46:52.291+08:00 level=DEBUG source=server.go:1465 msg="completion request" images=0 prompt=104 format="" time=2025-12-11T16:46:52.291+08:00 level=TRACE source=server.go:1466 msg="completion request" prompt="<|User|>你好,请问你是谁?\n\n你好,请问你是谁?<|Assistant|><think>\n\n</think>\n\n" time=2025-12-11T16:46:52.293+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=19 used=0 remaining=19 [GIN] 2025/12/11 - 16:46:52 | 200 | 1.172291934s | 127.0.0.1 | POST "/api/chat" time=2025-12-11T16:46:52.403+08:00 level=DEBUG source=sched.go:537 msg="context for request finished" time=2025-12-11T16:46:52.403+08:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b runner.inference="[{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA}]" runner.size="1.3 GiB" runner.vram="1.3 GiB" runner.parallel=1 runner.pid=1241307 runner.model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc runner.num_ctx=4096 duration=5m0s time=2025-12-11T16:46:52.403+08:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/deepseek-r1:1.5b runner.inference="[{ID:GPU-5bfceebe-7c3b-bae6-3a62-cc5002f849b8 Library:CUDA}]" runner.size="1.3 GiB" runner.vram="1.3 GiB" runner.parallel=1 runner.pid=1241307 runner.model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc runner.num_ctx=4096 refCount=0 ^Ctime=2025-12-11T16:47:30.906+08:00 level=DEBUG source=sched.go:844 msg="shutting down runner" model=/home/amekiri/.ollama/models/blobs/sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc time=2025-12-11T16:47:30.906+08:00 level=DEBUG source=sched.go:136 msg="shutting down scheduler pending loop" time=2025-12-11T16:47:30.906+08:00 level=DEBUG source=sched.go:269 msg="shutting down scheduler completed loop" time=2025-12-11T16:47:30.922+08:00 level=DEBUG source=server.go:1755 msg="stopping llama server" pid=1241307 time=2025-12-11T16:47:30.922+08:00 level=DEBUG source=server.go:1761 msg="waiting for llama server to exit" pid=1241307 time=2025-12-11T16:47:31.165+08:00 level=DEBUG source=server.go:1765 msg="llama server stopped" pid=1241307 ``` I found Ollama works on GPU when I use `ollama serve` to launch it instead using systemd. When I try to use `systemctl start ollama`, I found some errors: ``` ➜ ~ systemctl status ollama ● ollama.service - Ollama Service Loaded: loaded (/usr/lib/systemd/system/ollama.service; enabled; preset: disabled) Active: active (running) since Thu 2025-12-11 16:48:42 CST; 4s ago Invocation: 9917834945de4488b7a7143f1a23bb43 Main PID: 1252847 (ollama) Tasks: 13 (limit: 73914) Memory: 14.2M (peak: 286.6M) CPU: 334ms CGroup: /system.slice/ollama.service └─1252847 /usr/bin/ollama serve Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.589+08:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.589+08:00 level=INFO source=routes.go:1597 msg="Listening on 127.0.0.1:11434 (version 0 .13.2)" Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.589+08:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.590+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama ru nner --ollama-engine --port 44287" Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.670+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama ru nner --ollama-engine --port 37077" Dec 11 16:48:42 amekiri ollama[1252847]: time=2025-12-11T16:48:42.670+08:00 level=INFO source=server.go:392 msg="starting runner" cmd="/usr/bin/ollama ru nner --ollama-engine --port 35601" Dec 11 16:48:43 amekiri systemd-coredump[1253010]: [🡕] Process 1252958 (ollama) of user 963 dumped core. Stack trace of thread 1252971: #0 0x00007f7a39a9890c n/a (libc.so.6 + 0x9890c) #1 0x00007f7a39a3e3a0 raise (libc.so.6 + 0x3e3a0) #2 0x00007f7a39a2557a abort (libc.so.6 + 0x2557a) #3 0x00007f770234d6f5 rocblas_abort (librocblas.so.5 + 0x954d6f5) #4 0x00007f770219711e _ZN12_GLOBAL__N_123get_library_and_adapterEPSt10shared_ptrIN7Tensile21M asterSolutionLibraryINS1_18ContractionProblemENS1_19ContractionSolutionEEEEPS0_I20hipDeviceProp_tR0600Ei (librocblas.so.5 + 0x939711e) #5 0x00007f773fa8b532 n/a (libggml-hip.so + 0x3ce8b532) #6 0x00007f773fa8c1b3 ggml_backend_cuda_reg (libggml-hip.so + 0x3ce8c1b3) #7 0x00005574019d16fb n/a (/usr/bin/ollama + 0xe1b6fb) #8 0x00005574019cf642 n/a (/usr/bin/ollama + 0xe19642) #9 0x00005574019d0a7c n/a (/usr/bin/ollama + 0xe1aa7c) #10 0x0000557400c7ae21 n/a (/usr/bin/ollama + 0xc4e21) ELF object binary architecture: AMD x86-64 Dec 11 16:48:43 amekiri ollama[1252847]: time=2025-12-11T16:48:43.581+08:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIB RARY_PATH=[/usr/lib/ollama] extra_envs="map[GGML_CUDA_INIT:1 ROCR_VISIBLE_DEVICES:0]" error="runner crashed" Dec 11 16:48:43 amekiri ollama[1252847]: time=2025-12-11T16:48:43.581+08:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-5bfceebe-7c3b-ba e6-3a62-cc5002f849b8 filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5070 Ti" libdirs=ollama driver=13.0 pci_id=0000:01 :00.0 type=discrete total="15.9 GiB" available="13.1 GiB" Dec 11 16:48:43 amekiri ollama[1252847]: time=2025-12-11T16:48:43.581+08:00 level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="1 5.9 GiB" threshold="20.0 GiB" ```
Author
Owner

@LiaoZhanHao commented on GitHub (Dec 11, 2025):

I also encountered this problem. It was solved by lowering the ollama version to 0.12.10

<!-- gh-comment-id:3641255550 --> @LiaoZhanHao commented on GitHub (Dec 11, 2025): I also encountered this problem. It was solved by lowering the ollama version to 0.12.10
Author
Owner

@dhiltgen commented on GitHub (Dec 11, 2025):

@amekiri13 the logs you shared from systemd look "normal" for your hardware. You have an iGPU which ROCm does not support. During GPU discovery, we force ROCm to do a deeper initialization so we can trigger ROCm to crash on unsupported devices and then eliminate them from the set of inference devices we use at runtime. Your gfx1036 device is not included in the list of "inference compute" as expected, and it correctly lists your RTX 5070 Ti. Based on those startup logs, it should run inference on the GPU. Based on the paths in your, logs, it appears you aren't running an official Ollama binary but a binary built and packaged by someone else. There may be some other bugs related to that packaging. Can you share a systemd log showing from startup of Ollama through model load where it doesn't use the GPU as you describe with OLLAMA_DEBUG=2 set?

@LiaoZhanHao please share your server startup logs up to the point of "inference compute" with OLLAMA_DEBUG="2" set so we can see what may be going wrong.

<!-- gh-comment-id:3643630814 --> @dhiltgen commented on GitHub (Dec 11, 2025): @amekiri13 the logs you shared from systemd look "normal" for your hardware. You have an iGPU which ROCm does not support. During GPU discovery, we force ROCm to do a deeper initialization so we can trigger ROCm to crash on unsupported devices and then eliminate them from the set of inference devices we use at runtime. Your gfx1036 device is not included in the list of "inference compute" as expected, and it correctly lists your RTX 5070 Ti. Based on those startup logs, it should run inference on the GPU. Based on the paths in your, logs, it appears you aren't running an official Ollama binary but a binary built and packaged by someone else. There may be some other bugs related to that packaging. Can you share a systemd log showing from startup of Ollama through model load where it doesn't use the GPU as you describe with OLLAMA_DEBUG=2 set? @LiaoZhanHao please share your server startup logs up to the point of "inference compute" with OLLAMA_DEBUG="2" set so we can see what may be going wrong.
Author
Owner

@liorgross commented on GitHub (Dec 17, 2025):

Same issue here, noticed it happens since version 0.13.0 -

I was on version 0.12.7, upgraded yesterday to version 0.13.4 and noticed that all models run on CPU:

Image

I rolled back to 0.12.11 and the same models ran on GPU again:

Image

This morning I tried version 0.13.0 and was also using CPU only, rolled back again to 0.12.11

I am using AMD Ryzen AI Max+ 395 (gfx1151), ROCm version 7.9.0 (It's a preview version, only one with support to this board: https://github.com/ROCm/ROCm/releases/tag/therock-7.9.0)

Whatever it is, started with the update to 0.13, would be happy to provide more information and logs.... A guess would be the change related to "Improved VRAM information detection for AMD GPUs"

<!-- gh-comment-id:3666204799 --> @liorgross commented on GitHub (Dec 17, 2025): Same issue here, noticed it happens since version 0.13.0 - I was on version 0.12.7, upgraded yesterday to version 0.13.4 and noticed that all models run on CPU: <img width="644" height="40" alt="Image" src="https://github.com/user-attachments/assets/602e4276-5deb-4b93-b56c-79609ae2834a" /> I rolled back to 0.12.11 and the same models ran on GPU again: <img width="637" height="41" alt="Image" src="https://github.com/user-attachments/assets/94ac7b4b-42ab-47ba-b9b5-f3f40c3ca8dd" /> This morning I tried version 0.13.0 and was also using CPU only, rolled back again to 0.12.11 I am using AMD Ryzen AI Max+ 395 (gfx1151), ROCm version 7.9.0 (It's a preview version, only one with support to this board: https://github.com/ROCm/ROCm/releases/tag/therock-7.9.0) Whatever it is, started with the update to 0.13, would be happy to provide more information and logs.... A guess would be the change related to "_Improved VRAM information detection for AMD GPUs_"
Author
Owner

@dhiltgen commented on GitHub (Dec 17, 2025):

@liorgross are you compiling from source? Ollama's official binaries are linked against ROCm v6.3 on linux. There are some other issues in the backlog related to ROCm updates which may be a better place to share your findings.

<!-- gh-comment-id:3666364029 --> @dhiltgen commented on GitHub (Dec 17, 2025): @liorgross are you compiling from source? Ollama's official binaries are linked against ROCm v6.3 on linux. There are some other issues in the backlog related to ROCm updates which may be a better place to share your findings.
Author
Owner

@liorgross commented on GitHub (Dec 17, 2025):

@dhiltgen I was using the official installer script: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.13.0 sh

<!-- gh-comment-id:3666379335 --> @liorgross commented on GitHub (Dec 17, 2025): @dhiltgen I was using the official installer script: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.13.0 sh
Author
Owner

@dhiltgen commented on GitHub (Dec 17, 2025):

@liorgross can you share your server log with OLLAMA_DEBUG=2 set, showing the initial startup through "inference compute"?

<!-- gh-comment-id:3666388461 --> @dhiltgen commented on GitHub (Dec 17, 2025): @liorgross can you share your server log with OLLAMA_DEBUG=2 set, showing the initial startup through "inference compute"?
Author
Owner

@liorgross commented on GitHub (Dec 17, 2025):

@dhiltgen providing both with version 12.11 (where it works great) and version 13.4 with the issue

ollama_12.11.txt
ollama_13.4.txt

From a quick look I can see it does not detect correctly the available memory -
Line 203 in the working version:
time=2025-12-17T12:58:43.650-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c6:00.0 type=iGPU total="110.0 GiB" available="109.8 GiB"

Line 416 in version 13:
time=2025-12-17T13:03:18.435-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c6:00.0 type=iGPU total="512.0 MiB" available="364.2 MiB"

<!-- gh-comment-id:3666800599 --> @liorgross commented on GitHub (Dec 17, 2025): @dhiltgen providing both with version 12.11 (where it works great) and version 13.4 with the issue [ollama_12.11.txt](https://github.com/user-attachments/files/24220230/ollama_12.11.txt) [ollama_13.4.txt](https://github.com/user-attachments/files/24220231/ollama_13.4.txt) From a quick look I can see it does not detect correctly the available memory - Line 203 in the working version: `time=2025-12-17T12:58:43.650-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c6:00.0 type=iGPU total="110.0 GiB" available="109.8 GiB"` Line 416 in version 13: `time=2025-12-17T13:03:18.435-06:00 level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:c6:00.0 type=iGPU total="512.0 MiB" available="364.2 MiB"`
Author
Owner

@dhiltgen commented on GitHub (Dec 17, 2025):

@liorgross it looks like it does discovery your iGPU, but it does not have much dedicated VRAM and we don't currently process GTT properly. See #13196

<!-- gh-comment-id:3667522151 --> @dhiltgen commented on GitHub (Dec 17, 2025): @liorgross it looks like it does discovery your iGPU, but it does not have much dedicated VRAM and we don't currently process GTT properly. See #13196
Author
Owner

@liorgross commented on GitHub (Dec 18, 2025):

@dhiltgen That does seem to be the problem...

Also I checked out that branch, seems to still be running on CPU

Image

ollama_debug.log

<!-- gh-comment-id:3667686764 --> @liorgross commented on GitHub (Dec 18, 2025): @dhiltgen That does seem to be the problem... Also I checked out that branch, seems to still be running on CPU <img width="616" height="41" alt="Image" src="https://github.com/user-attachments/assets/bacd2b55-13d2-4d69-b053-94d4994d384c" /> [ollama_debug.log](https://github.com/user-attachments/files/24224222/ollama_debug.log)
Author
Owner

@dhiltgen commented on GitHub (Dec 18, 2025):

@liorgross thanks for giving it a try. From your logs, it looks like the ROCm ggml library may not have been built leading to CPU only support.

<!-- gh-comment-id:3667714780 --> @dhiltgen commented on GitHub (Dec 18, 2025): @liorgross thanks for giving it a try. From your logs, it looks like the ROCm ggml library may not have been built leading to CPU only support.
Author
Owner

@liorgross commented on GitHub (Dec 18, 2025):

@dhiltgen First time I compiled it manually from source… I’ll check if there something I missed.

<!-- gh-comment-id:3667927558 --> @liorgross commented on GitHub (Dec 18, 2025): @dhiltgen First time I compiled it manually from source… I’ll check if there something I missed.
Author
Owner

@liorgross commented on GitHub (Jan 11, 2026):

@dhiltgen I tried the latest pre-release version, 0.14.0-rc2, and it works - issue resolved!

<!-- gh-comment-id:3735150501 --> @liorgross commented on GitHub (Jan 11, 2026): @dhiltgen I tried the latest pre-release version, 0.14.0-rc2, and it works - issue resolved!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55377