[GH-ISSUE #12342] Bug: inconsistent to use VRAM and GTT of iGPU of AMD Ryzen AI Processor #33958

New Issue

GiteaMirror · 2026-04-22T17:08:35-05:00

GiteaMirror commented

2026-04-22 17:08:35 -05:00

Originally created by @alexhegit on GitHub (Sep 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12342

What is the issue?

Ollama dose not use VRAM and GTT of iGPU of AMD Ryzen AI Processor with consistent logic.

There are two parts of memory of iGPU of AMD Ryzen AI Processor. The VRAM size is set in BIOS and the half of residual is GTT.

e.g. for 128GB DDR memory of AMD Ryzen AI Max+ laptop - ROG Flow Z13 (2025) GZ302

if set VRAM with 96GB, the GTT= (128-96)/2=16GB
if set VRAM with 8GB, the GTT= (128-8)/2=60GB

Test platform:

Hardware: AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302 , set 96GB VRAM (by BIOS)

OS: Ubuntu24.04
Ollama: version is 0.11.11

Test1 :

VRAM=96GB, GTT=16GB

ollama run gpt-oss:20b --verbose "why is sky blue"

Use radeontop to monitor the VRAM usage

The model gpt-oss:20b is loaded to GTT(16GB) rather than VRAM(96GB).

Expect to load the model to VRAM.

Other tests:

Run qwen3:32b failed.

$ ollama run qwen3:32b --verbose "why is sky blue"
Error: 500 Internal Server Error: llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer
llama_model_load_from_file_impl: failed to load model

Expect to load the model to VRAM.

Relevant log output

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.11.11

Originally created by @alexhegit on GitHub (Sep 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12342 ### What is the issue? Ollama dose not use VRAM and GTT of iGPU of AMD Ryzen AI Processor with consistent logic. There are two parts of memory of iGPU of AMD Ryzen AI Processor. The VRAM size is set in BIOS and the half of residual is GTT. e.g. for 128GB DDR memory of AMD Ryzen AI Max+ laptop - ROG Flow Z13 (2025) GZ302 1. if set VRAM with 96GB, the GTT= (128-96)/2=16GB 2. if set VRAM with 8GB, the GTT= (128-8)/2=60GB ### Test platform: Hardware: AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302 , set 96GB VRAM (by BIOS) OS: Ubuntu24.04 Ollama: version is 0.11.11 ### Test1 : VRAM=96GB, GTT=16GB ```bash ollama run gpt-oss:20b --verbose "why is sky blue" ``` Use `radeontop` to monitor the VRAM usage The model gpt-oss:20b is loaded to GTT(16GB) rather than VRAM(96GB). <img width="1460" height="823" alt="Image" src="https://github.com/user-attachments/assets/98606d1a-db8c-4854-9ae4-fa8f2e761318" /> Expect to load the model to VRAM. Other tests: Run qwen3:32b failed. ```terminal $ ollama run qwen3:32b --verbose "why is sky blue" Error: 500 Internal Server Error: llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer llama_model_load_from_file_impl: failed to load model ``` Expect to load the model to VRAM. ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.11.11

GiteaMirror added the needs more info bug labels 2026-04-22 17:08:35 -05:00

GiteaMirror closed this issue

2026-04-22 17:08:36 -05:00

GiteaMirror commented

2026-04-22 17:08:37 -05:00

@alexhegit commented on GitHub (Sep 19, 2025):

Test2 :
VRAM=8GB, GTT=60GB

Then I reset the VRAM to 8GB while GTT is 60GB and run gpt-oss:20b which the GTT is enough for load the full model.

But the ollama ps show it run with GPU/CPU hybird mode. It seems that the ollama use VRAM size to estimate the memoy footprint to decide CPU or GPU or CPU/GPU to run the model. But load the model into GTT in runtime. That means ollama has the bug that do not use consistent logic for estimate memory footprint and real runtime memory usage.

@alexhegit commented on GitHub (Sep 19, 2025): Test2 : VRAM=8GB, GTT=60GB Then I reset the VRAM to 8GB while GTT is 60GB and run gpt-oss:20b which the GTT is enough for load the full model. But the ollama ps show it run with GPU/CPU hybird mode. It seems that the ollama use VRAM size to estimate the memoy footprint to decide CPU or GPU or CPU/GPU to run the model. But load the model into GTT in runtime. That means ollama has the bug that do not use consistent logic for estimate memory footprint and real runtime memory usage. <img width="1459" height="775" alt="Image" src="https://github.com/user-attachments/assets/cadc949c-29e6-4810-a8e3-4140bdb61043" />

GiteaMirror commented

2026-04-22 17:08:38 -05:00

@alexhegit commented on GitHub (Sep 19, 2025):

Expectation logic:

If chose use GTT for loading the model , it should use GTT to estimate the memoy footprint to decide if use CPU or GPU or CPU+GPU running mode.

Or more simple to use VRAM for loading the model, since

The BIOS has option to set VRAM for iGPU. it could be 96GB VRAM for AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302. That means it could run the gpt-oss-120b in GPU mode.
Use VRAM will make iGPU and dGPU are thee same logic in Ollama.

@alexhegit commented on GitHub (Sep 19, 2025): Expectation logic: If chose use GTT for loading the model , it should use GTT to estimate the memoy footprint to decide if use CPU or GPU or CPU+GPU running mode. Or more simple to use VRAM for loading the model, since 1. The BIOS has option to set VRAM for iGPU. it could be 96GB VRAM for AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302. That means it could run the gpt-oss-120b in GPU mode. 2. Use VRAM will make iGPU and dGPU are thee same logic in Ollama.

GiteaMirror commented

2026-04-22 17:08:38 -05:00

@rick-github commented on GitHub (Sep 19, 2025):

Server logs may help in debugging.

I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM.

@rick-github commented on GitHub (Sep 19, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging. I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM.

GiteaMirror commented

2026-04-22 17:08:39 -05:00

@Ricky1975 commented on GitHub (Sep 20, 2025):

I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics)
Kernel 6.1.0-39-amd64, ROCk module version 6.12.12

I think the issue is:
When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT.
Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full).
If I minimize my VRAM, Ollama does not even allow me to use the GTT.

Suspected solution (from a n00b):
The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

@Ricky1975 commented on GitHub (Sep 20, 2025): I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics) Kernel 6.1.0-39-amd64, ROCk module version 6.12.12 I think the issue is: When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT. Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT. Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

GiteaMirror commented

2026-04-22 17:08:39 -05:00

@rick-github commented on GitHub (Sep 20, 2025):

As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.

@rick-github commented on GitHub (Sep 20, 2025): As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.

GiteaMirror commented

2026-04-22 17:08:40 -05:00

@alexhegit commented on GitHub (Sep 20, 2025):

I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics) Kernel 6.1.0-39-amd64, ROCk module version 6.12.12

I think the issue is: When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT. Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT.

Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

Yes. we have same issue and same expectation.

@alexhegit commented on GitHub (Sep 20, 2025): > I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics) Kernel 6.1.0-39-amd64, ROCk module version 6.12.12 > > I think the issue is: When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT. Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT. > > Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G) Yes. we have same issue and same expectation.

GiteaMirror commented

2026-04-22 17:08:41 -05:00

@rick-github commented on GitHub (Sep 20, 2025):

As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.

@rick-github commented on GitHub (Sep 20, 2025): As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.

GiteaMirror commented

2026-04-22 17:08:41 -05:00

@alexhegit commented on GitHub (Sep 20, 2025):

Server logs may help in debugging.

I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM.

Have you use the same Ollama version 0.11.11 with running gpt-oss:20b ? Please use ollama ps to see if it is run with 100%CPU rather than 100%GPU.

The current logic of ollama use VRAM to judge if the GPU memory is enought to load the model but load and run it with GTT.

@alexhegit commented on GitHub (Sep 20, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging. > > I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM. Have you use the same Ollama version 0.11.11 with running gpt-oss:20b ? Please use `ollama ps` to see if it is run with 100%CPU rather than 100%GPU. The current logic of ollama use VRAM to judge if the GPU memory is enought to load the model but load and run it with GTT.

GiteaMirror commented

2026-04-22 17:08:42 -05:00

@alexhegit commented on GitHub (Sep 20, 2025):

There is one fork repo trying to solve this issue for AMD APU (with iGPU) https://github.com/rjmalagon/ollama-linux-amd-apu

It add a new path to use GTT for AMD APU in https://github.com/rjmalagon/ollama-linux-amd-apu/blob/main/discover/amd_linux.go

@alexhegit commented on GitHub (Sep 20, 2025): There is one fork repo trying to solve this issue for AMD APU (with iGPU) https://github.com/rjmalagon/ollama-linux-amd-apu It add a new path to use GTT for AMD APU in https://github.com/rjmalagon/ollama-linux-amd-apu/blob/main/discover/amd_linux.go <img width="1038" height="927" alt="Image" src="https://github.com/user-attachments/assets/90c1f992-4078-45fa-94ab-7290cfca0cea" />

GiteaMirror commented

2026-04-22 17:08:43 -05:00

@rick-github commented on GitHub (Sep 20, 2025):

$ ollama -v ; ollama run gpt-oss:20b hello ; ollama ps
ollama version is 0.11.11
Thinking...
We have a conversation. The user says "hello". We need to respond. The instructions: "You are ChatGPT, a large language model trained by OpenAI." No special instruction. We should greet, ask how can help. 
Let's respond politely.
...done thinking.

Hello! 👋 How can I help you today?

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:20b    aa4295ac10c3    14 GB    100% GPU     8192       Forever

Useful information to debug this might be found in the server logs.

@rick-github commented on GitHub (Sep 20, 2025): ```console $ ollama -v ; ollama run gpt-oss:20b hello ; ollama ps ollama version is 0.11.11 Thinking... We have a conversation. The user says "hello". We need to respond. The instructions: "You are ChatGPT, a large language model trained by OpenAI." No special instruction. We should greet, ask how can help. Let's respond politely. ...done thinking. Hello! 👋 How can I help you today? NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 8192 Forever ``` Useful information to debug this might be found in the server logs.

GiteaMirror commented

2026-04-22 17:08:43 -05:00

@alexhegit commented on GitHub (Sep 21, 2025):

ollama -v ; ollama run gpt-oss:20b hello ; ollama ps

Have you use radeontop to monitor where the model to be loaded, GTT or VRAM.

My test show it use VRAM to estimate the model memory footprint but use GTT to load and run the model.

Memory Setting: VRAM=16GB, GTT=56GB
OS

alex@GZ302EA:~$ uname -a
Linux GZ302EA 6.14.0-24-generic #24~24.04.3-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul  7 16:39:17 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

test cmd

 ollama -v ; ollama run gpt-oss:20b hello ; ollama ps

@alexhegit commented on GitHub (Sep 21, 2025): > ollama -v ; ollama run gpt-oss:20b hello ; ollama ps Have you use radeontop to monitor where the model to be loaded, GTT or VRAM. My test show it use VRAM to estimate the model memory footprint but use GTT to load and run the model. 0. Memory Setting: VRAM=16GB, GTT=56GB 1. OS ```shell alex@GZ302EA:~$ uname -a Linux GZ302EA 6.14.0-24-generic #24~24.04.3-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 7 16:39:17 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ``` 2. test cmd ```sheell ollama -v ; ollama run gpt-oss:20b hello ; ollama ps ``` <img width="1360" height="840" alt="Image" src="https://github.com/user-attachments/assets/753b8c0f-91c9-4e17-940e-0344086f9211" />

GiteaMirror commented

2026-04-22 17:08:45 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

             12979M / 98140M VRAM  13.22% │     
                 14M / 15848M GTT   0.09% │
       1.00G / 1.00G Memory Clock 100.00% │                                      
       0.60G / 2.90G Shader Clock  20.69% │

Useful information to debug this might be found in the server logs.

@rick-github commented on GitHub (Sep 21, 2025): ``` 12979M / 98140M VRAM 13.22% │ 14M / 15848M GTT 0.09% │ 1.00G / 1.00G Memory Clock 100.00% │ 0.60G / 2.90G Shader Clock 20.69% │ ``` Useful information to debug this might be found in the server logs.

GiteaMirror commented

2026-04-22 17:08:46 -05:00

@alexhegit commented on GitHub (Sep 21, 2025):

             12979M / 98140M VRAM  13.22% │     
                 14M / 15848M GTT   0.09% │
       1.00G / 1.00G Memory Clock 100.00% │                                      
       0.60G / 2.90G Shader Clock  20.69% │

Useful information to debug this might be found in the server logs.

Interesting. Can not explain the difference resluts between us. We are use same ollama and same model in same test cases.

I re-set the VRAM=96GB in bios and test again. Still load the model into GTT monitoring by radeontop.

The log show the VRAM=96GB and gpt-oss-20b run with AMD iGPU gfx1151.

Sep 21 10:05:47 GZ302EA systemd[1]: Stopping ollama.service - Ollama Service...
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:265 msg="shutting down scheduler completed loop"
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:766 msg="shutting down runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1683 msg="stopping llama server" pid=55289
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:135 msg="shutting down scheduler pending loop"
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1689 msg="waiting for llama server to exit" pid=55289
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.869+08:00 level=DEBUG source=server.go:1693 msg="llama server stopped" pid=55289
Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Deactivated successfully.
Sep 21 10:05:47 GZ302EA systemd[1]: Stopped ollama.service - Ollama Service.
Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Consumed 1min 24.998s CPU time, 5.3G memory peak, 210.7M memory swap peak.
Sep 21 10:05:56 GZ302EA systemd[1]: Started ollama.service - Ollama Service.
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.187+08:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[chrome-extension://* moz-extension://* safari-web-extension://* ollama serve http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.189+08:00 level=INFO source=images.go:477 msg="total blobs: 88"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=routes.go:1385 msg="Listening on [::]:11434 (version 0.11.11)"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=DEBUG source=sched.go:121 msg="starting llm scheduler"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcuda.so*
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths=[]
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcudart.so*
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcudart.so* /libcudart.so* /usr/local/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.200+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths="[/usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90 /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88]"
Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90: your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88: your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:203 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=5510 unique_id=0
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:237 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card1/device
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:343 msg="amdgpu memory" gpu=0 total="96.0 GiB"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:344 msg="amdgpu memory" gpu=0 available="95.7 GiB"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/local/lib/ollama/rocm"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:44 msg="detected ROCM next to ollama executable /usr/local/lib/ollama/rocm"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=DEBUG source=amd_linux.go:375 msg="rocm supported GPUs" types="[gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102 gfx1151 gfx1200 gfx1201 gfx900 gfx906 gfx908 gfx90a gfx942]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=0 gpu_type=gfx1151
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=rocm variant="" compute=gfx1151 driver=0.0 name=1002:1586 total="96.0 GiB" available="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 |      67.331µs |       127.0.0.1 | GET      "/api/version"
Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 |      23.062µs |       127.0.0.1 | HEAD     "/"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.617+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 |   56.855199ms |       127.0.0.1 | POST     "/api/show"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.3 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.714+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.715+08:00 level=DEBUG source=sched.go:208 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:200 msg="model wants flash attention"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:324 msg="adding gpu library" path=/usr/local/lib/ollama/rocm
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:332 msg="adding gpu dependency paths" paths=[/usr/local/lib/ollama/rocm]
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 46445"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:400 msg=subprocess PATH=/home/alex/.local/bin:/home/alex/Android/Sdk:/home/alex/development/flutter/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_HOST=0.0.0.0:11434 OLLAMA_ORIGINS="chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve" OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_QUEUE=512 OLLAMA_NUM_PARALLEL=4 OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/rocm LD_LIBRARY_PATH=/usr/local/lib/ollama/rocm:/usr/local/lib/ollama/rocm:/usr/local/lib/ollama:/usr/local/lib/ollama ROCR_VISIBLE_DEVICES=0
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:678 msg="system memory" total="31.0 GiB" free="29.5 GiB" free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="95.3 GiB" free="95.7 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1254 msg="starting ollama engine"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1289 msg="Server listening on 127.0.0.1:46445"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.797+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.name default=""
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.description default=""
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: found 1 ROCm devices:
Sep 21 10:06:04 GZ302EA ollama[57854]:   Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0
Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded ROCm backend from /usr/local/lib/ollama/libggml-hip.so
Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.760+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880U required.CPU.Graph=5898240U required.ROCm0.ID=0 required.ROCm0.Weights="[477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 1158278400U]" required.ROCm0.Cache="[34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 0U]" required.ROCm0.Graph=165415040U
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:alloc LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.791+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880A required.CPU.Graph=5898240A required.ROCm0.ID=0 required.ROCm0.Weights="[477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 1158278400A]" required.ROCm0.Cache="[34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 0U]" required.ROCm0.Graph=165415040A
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:487 msg="offloading 24 repeating layers to GPU"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:342 msg="total memory" size="14.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.738+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.09"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.989+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.18"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.239+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.24"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.495+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.33"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.746+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.35"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.997+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.45"
Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.248+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.54"
Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.498+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.64"
Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.749+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.73"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.000+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.81"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.251+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.88"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.502+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.91"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.752+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95"
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.003+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.99"
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.104+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=INFO source=server.go:1289 msg="llama runner started in 5.47 seconds"
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=server.go:1388 msg="completion request" images=0 prompt=307 format=""
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.280+08:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68
Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 |  6.648798587s |       127.0.0.1 | POST     "/api/generate"
Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:493 msg="context for request finished"
Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 duration=5m0s
Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 refCount=0
Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 |      30.615µs |       127.0.0.1 | HEAD     "/"
Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 |     193.169µs |       127.0.0.1 | GET      "/api/ps"

@alexhegit commented on GitHub (Sep 21, 2025): > ``` > 12979M / 98140M VRAM 13.22% │ > 14M / 15848M GTT 0.09% │ > 1.00G / 1.00G Memory Clock 100.00% │ > 0.60G / 2.90G Shader Clock 20.69% │ > ``` > > Useful information to debug this might be found in the server logs. Interesting. Can not explain the difference resluts between us. We are use same ollama and same model in same test cases. I re-set the VRAM=96GB in bios and test again. Still load the model into GTT monitoring by radeontop. The log show the VRAM=96GB and gpt-oss-20b run with AMD iGPU gfx1151. ```terminal Sep 21 10:05:47 GZ302EA systemd[1]: Stopping ollama.service - Ollama Service... Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:265 msg="shutting down scheduler completed loop" Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:766 msg="shutting down runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1683 msg="stopping llama server" pid=55289 Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:135 msg="shutting down scheduler pending loop" Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1689 msg="waiting for llama server to exit" pid=55289 Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.869+08:00 level=DEBUG source=server.go:1693 msg="llama server stopped" pid=55289 Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Deactivated successfully. Sep 21 10:05:47 GZ302EA systemd[1]: Stopped ollama.service - Ollama Service. Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Consumed 1min 24.998s CPU time, 5.3G memory peak, 210.7M memory swap peak. Sep 21 10:05:56 GZ302EA systemd[1]: Started ollama.service - Ollama Service. Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.187+08:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[chrome-extension://* moz-extension://* safari-web-extension://* ollama serve http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.189+08:00 level=INFO source=images.go:477 msg="total blobs: 88" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=routes.go:1385 msg="Listening on [::]:11434 (version 0.11.11)" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=DEBUG source=sched.go:121 msg="starting llm scheduler" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcuda.so* Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths=[] Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcudart.so* Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcudart.so* /libcudart.so* /usr/local/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.200+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths="[/usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90 /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88]" Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90: your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama" Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88: your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:203 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=5510 unique_id=0 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:237 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card1/device Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:343 msg="amdgpu memory" gpu=0 total="96.0 GiB" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:344 msg="amdgpu memory" gpu=0 available="95.7 GiB" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/local/lib/ollama/rocm" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:44 msg="detected ROCM next to ollama executable /usr/local/lib/ollama/rocm" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=DEBUG source=amd_linux.go:375 msg="rocm supported GPUs" types="[gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102 gfx1151 gfx1200 gfx1201 gfx900 gfx906 gfx908 gfx90a gfx942]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=0 gpu_type=gfx1151 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=rocm variant="" compute=gfx1151 driver=0.0 name=1002:1586 total="96.0 GiB" available="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 | 67.331µs | 127.0.0.1 | GET "/api/version" Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 | 23.062µs | 127.0.0.1 | HEAD "/" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.617+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 | 56.855199ms | 127.0.0.1 | POST "/api/show" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.3 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.714+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.715+08:00 level=DEBUG source=sched.go:208 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:200 msg="model wants flash attention" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:324 msg="adding gpu library" path=/usr/local/lib/ollama/rocm Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:332 msg="adding gpu dependency paths" paths=[/usr/local/lib/ollama/rocm] Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 46445" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:400 msg=subprocess PATH=/home/alex/.local/bin:/home/alex/Android/Sdk:/home/alex/development/flutter/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_HOST=0.0.0.0:11434 OLLAMA_ORIGINS="chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve" OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_QUEUE=512 OLLAMA_NUM_PARALLEL=4 OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/rocm LD_LIBRARY_PATH=/usr/local/lib/ollama/rocm:/usr/local/lib/ollama/rocm:/usr/local/lib/ollama:/usr/local/lib/ollama ROCR_VISIBLE_DEVICES=0 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:678 msg="system memory" total="31.0 GiB" free="29.5 GiB" free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="95.3 GiB" free="95.7 GiB" minimum="457.0 MiB" overhead="0 B" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1254 msg="starting ollama engine" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1289 msg="Server listening on 127.0.0.1:46445" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.797+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.name default="" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.description default="" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: found 1 ROCm devices: Sep 21 10:06:04 GZ302EA ollama[57854]: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0 Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded ROCm backend from /usr/local/lib/ollama/libggml-hip.so Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295 Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.760+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2 Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880U required.CPU.Graph=5898240U required.ROCm0.ID=0 required.ROCm0.Weights="[477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 1158278400U]" required.ROCm0.Cache="[34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 0U]" required.ROCm0.Graph=165415040U Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:alloc LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.791+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880A required.CPU.Graph=5898240A required.ROCm0.ID=0 required.ROCm0.Weights="[477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 1158278400A]" required.ROCm0.Cache="[34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 0U]" required.ROCm0.Graph=165415040A Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:487 msg="offloading 24 repeating layers to GPU" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:342 msg="total memory" size="14.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.738+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.09" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.989+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.18" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.239+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.24" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.495+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.33" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.746+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.35" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.997+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.45" Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.248+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.54" Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.498+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.64" Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.749+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.73" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.000+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.81" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.251+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.88" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.502+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.91" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.752+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.003+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.99" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.104+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295 Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=INFO source=server.go:1289 msg="llama runner started in 5.47 seconds" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=server.go:1388 msg="completion request" images=0 prompt=307 format="" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.280+08:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68 Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 | 6.648798587s | 127.0.0.1 | POST "/api/generate" Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:493 msg="context for request finished" Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 duration=5m0s Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 refCount=0 Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 | 30.615µs | 127.0.0.1 | HEAD "/" Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 | 193.169µs | 127.0.0.1 | GET "/api/ps" ```

GiteaMirror commented

2026-04-22 17:08:47 -05:00

@alexhegit commented on GitHub (Sep 21, 2025):

@rick-github

I'd like to check the gpu driver info with you. Here is mine.

alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/
libdrm_amdgpu.so          libdrm_amdgpu.so.1.124.0  libdrm_radeon.so.1        libdrm.so                 libdrm.so.2.124.0
libdrm_amdgpu.so.1        libdrm_radeon.so          libdrm_radeon.so.1.124.0  libdrm.so.2               pkgconfig/
alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/
libdrm_amdgpu.so    libdrm_amdgpu.so.1.124.0  libdrm_radeon.so.1        libdrm.so    libdrm.so.2.124.0
libdrm_amdgpu.so.1  libdrm_radeon.so          libdrm_radeon.so.1.124.0  libdrm.so.2  pkgconfig
alex@GZ302EA:~$ modinfo amdgpu | grep version
srcversion:     639640A50DD1D71D4F3C5D9
vermagic:       6.14.0-24-generic SMP preempt mod_unload modversions
parm:           hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)

@alexhegit commented on GitHub (Sep 21, 2025): @rick-github I'd like to check the gpu driver info with you. Here is mine. ```terminal alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/ libdrm_amdgpu.so libdrm_amdgpu.so.1.124.0 libdrm_radeon.so.1 libdrm.so libdrm.so.2.124.0 libdrm_amdgpu.so.1 libdrm_radeon.so libdrm_radeon.so.1.124.0 libdrm.so.2 pkgconfig/ alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/ libdrm_amdgpu.so libdrm_amdgpu.so.1.124.0 libdrm_radeon.so.1 libdrm.so libdrm.so.2.124.0 libdrm_amdgpu.so.1 libdrm_radeon.so libdrm_radeon.so.1.124.0 libdrm.so.2 pkgconfig alex@GZ302EA:~$ modinfo amdgpu | grep version srcversion: 639640A50DD1D71D4F3C5D9 vermagic: 6.14.0-24-generic SMP preempt mod_unload modversions parm: hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool) ```

GiteaMirror commented

2026-04-22 17:08:47 -05:00

@alexhegit commented on GitHub (Sep 21, 2025):

I get some infomation about GTT, VRAM from https://github.com/ollama/ollama/issues/5471.

The kernel changed the behavior about memory allcation about GTT & VRAM. But Ollama should consider to use GTT+VRAM together for loading model and load VRAM first and then use GTT if VRAM size is not enough. It should be good for the iGPU of AMD with UMA.

@rick-github Do you use the linux kernel version < 6.10.0?

@alexhegit commented on GitHub (Sep 21, 2025): I get some infomation about GTT, VRAM from https://github.com/ollama/ollama/issues/5471. <img width="1148" height="571" alt="Image" src="https://github.com/user-attachments/assets/66e75e6e-8cc8-4d5c-9162-8853c412721b" /> The kernel changed the behavior about memory allcation about GTT & VRAM. But Ollama should consider to use GTT+VRAM together for loading model and load VRAM first and then use GTT if VRAM size is not enough. It should be good for the iGPU of AMD with UMA. @rick-github Do you use the linux kernel version < 6.10.0?

GiteaMirror commented

2026-04-22 17:08:48 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

$ ls /opt/amdgpu/lib/x86_64-linux-gnu/
dri                         libgbm.so.1.0.0
gbm                         libGLX_mesa.so.0
libdrm_amdgpu.so            libGLX_mesa.so.0.0.0
libdrm_amdgpu.so.1          libLLVM.so.19.1
libdrm_amdgpu.so.1.124.0    libLTO.so.19.1
libdrm_radeon.so            libRemarks.so.19.1
libdrm_radeon.so.1          libwayland-client.so.0
libdrm_radeon.so.1.124.0    libwayland-client.so.0.23.0
libdrm.so                   libwayland-server.so.0
libdrm.so.2                 libwayland-server.so.0.23.0
libdrm.so.2.124.0           libxatracker.so.2
libEGL_mesa.so.0            libxatracker.so.2.5.0
libEGL_mesa.so.0.0.0        llvm-19.1
libgallium-25.0.0-devel.so  pkgconfig
libgbm.so.1                 vdpau
$ modinfo amdgpu | grep version
version:        6.12.12
srcversion:     9AB0277171A464F184AFEF4
vermagic:       6.11.0-29-generic SMP preempt mod_unload modversions 
parm:           hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)

@rick-github Do you use the linux kernel version < 6.10.0?

https://github.com/ollama/ollama/issues/12342#issuecomment-3311721448

@rick-github commented on GitHub (Sep 21, 2025): ```console $ ls /opt/amdgpu/lib/x86_64-linux-gnu/ dri libgbm.so.1.0.0 gbm libGLX_mesa.so.0 libdrm_amdgpu.so libGLX_mesa.so.0.0.0 libdrm_amdgpu.so.1 libLLVM.so.19.1 libdrm_amdgpu.so.1.124.0 libLTO.so.19.1 libdrm_radeon.so libRemarks.so.19.1 libdrm_radeon.so.1 libwayland-client.so.0 libdrm_radeon.so.1.124.0 libwayland-client.so.0.23.0 libdrm.so libwayland-server.so.0 libdrm.so.2 libwayland-server.so.0.23.0 libdrm.so.2.124.0 libxatracker.so.2 libEGL_mesa.so.0 libxatracker.so.2.5.0 libEGL_mesa.so.0.0.0 llvm-19.1 libgallium-25.0.0-devel.so pkgconfig libgbm.so.1 vdpau $ modinfo amdgpu | grep version version: 6.12.12 srcversion: 9AB0277171A464F184AFEF4 vermagic: 6.11.0-29-generic SMP preempt mod_unload modversions parm: hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool) ``` > @rick-github Do you use the linux kernel version < 6.10.0? https://github.com/ollama/ollama/issues/12342#issuecomment-3311721448

GiteaMirror commented

2026-04-22 17:08:49 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

Have you set amdgpu.no_system_mem_limit in the boot params?

@rick-github commented on GitHub (Sep 21, 2025): Have you set `amdgpu.no_system_mem_limit` in the boot params?

GiteaMirror commented

2026-04-22 17:08:50 -05:00

@alexhegit commented on GitHub (Sep 22, 2025):

@rick-github

I never modify this boot param.

alex@GZ302EA:~$ cat /sys/module/amdgpu/parameters/no_system_mem_limit
N

@alexhegit commented on GitHub (Sep 22, 2025): @rick-github I never modify this boot param. ``` alex@GZ302EA:~$ cat /sys/module/amdgpu/parameters/no_system_mem_limit N ```

GiteaMirror commented

2026-04-22 17:08:51 -05:00

@rick-github commented on GitHub (Sep 22, 2025):

Then maybe modify this boot param?

@rick-github commented on GitHub (Sep 22, 2025): Then maybe modify this boot param?

GiteaMirror commented

2026-04-22 17:08:52 -05:00

@alexhegit commented on GitHub (Sep 24, 2025):

Then maybe modify this boot param?

Still use GTT ahead VRAM with ollama test with /sys/module/amdgpu/parameters/no_system_mem_limit = Y.

what's your settings of this sysfs args?

@alexhegit commented on GitHub (Sep 24, 2025): Then maybe modify this boot param? Still use GTT ahead VRAM with ollama test with /sys/module/amdgpu/parameters/no_system_mem_limit = Y. what's your settings of this sysfs args?

GiteaMirror commented

2026-04-22 17:08:52 -05:00

@rick-github commented on GitHub (Sep 24, 2025):

$ cat /sys/module/amdgpu/parameters/no_system_mem_limit
Y

@rick-github commented on GitHub (Sep 24, 2025): ``` $ cat /sys/module/amdgpu/parameters/no_system_mem_limit Y ```

GiteaMirror commented

2026-04-22 17:08:53 -05:00

@MrUhu commented on GitHub (Sep 30, 2025):

Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT.

Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

@Ricky1975 I also noticed that somehow the behaviour of Ollama changed.
Usually I had the problem, that Ollama used VRAM for estimation and then loaded the model into GTT. Now it's running in VRAM but I can't force it anymore to load the model into GTT.

It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

@MrUhu commented on GitHub (Sep 30, 2025): > Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT. > > Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G) @Ricky1975 I also noticed that somehow the behaviour of Ollama changed. Usually I had the problem, that Ollama used VRAM for estimation and then loaded the model into GTT. Now it's running in VRAM but I can't force it anymore to load the model into GTT. It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

GiteaMirror commented

2026-04-22 17:08:54 -05:00

@alexhegit commented on GitHub (Oct 10, 2025):

It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

@MrUhu Hi, so you forgot what changes make it run with VRAM (changed from GTT)?

@alexhegit commented on GitHub (Oct 10, 2025): > It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS. @MrUhu Hi, so you forgot what changes make it run with VRAM (changed from GTT)?

GiteaMirror commented

2026-04-22 17:08:55 -05:00

@MrUhu commented on GitHub (Oct 13, 2025):

It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

@MrUhu Hi, so you forgot what changes make it run with VRAM (changed from GTT)?

No. I didn't forget.
Since one of the earlier versions the vRAM handling changed. I don't know if it was a change in ROCm or Ollama but now Ollama checks VRAM and writes the modell to vRAM and not GTT.
When GTT was still used I created custom modelfile where I told ollama to write all layers to Video Memory. In my case the video memory was the GTT. But now it uses the VRAM instead - so no more GTT tinkering.

I could try an older version of Ollama but have to take a look at the .sh file beforehand.

@MrUhu commented on GitHub (Oct 13, 2025): > > It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS. > > [@MrUhu](https://github.com/MrUhu) Hi, so you forgot what changes make it run with VRAM (changed from GTT)? No. I didn't forget. Since one of the earlier versions the vRAM handling changed. I don't know if it was a change in ROCm or Ollama but now Ollama checks VRAM and writes the modell to vRAM and not GTT. When GTT was still used I created custom modelfile where I told ollama to write all layers to Video Memory. In my case the video memory was the GTT. But now it uses the VRAM instead - so no more GTT tinkering. I could try an older version of Ollama but have to take a look at the .sh file beforehand.

GiteaMirror commented

2026-04-22 17:08:55 -05:00

@Djip007 commented on GitHub (Oct 21, 2025):

1 395 / 5 000
Sorry, I didn't see this issue.
There are a lot of assumptions in the comments that aren't correct, I'll try to be clear.
Ollama doesn't manage any allocations; it only configures llama.cpp.
In the rocm/hip backend of llama.cpp, no allocation is ever requested on the GTT. I don't even think this is possible with hip. With HIP, you can allocate memory on the device or on the host. In the case of host allocation, it's possible to configure cache coherence for better performance.
On the llama.cpp side, there's the option to choose whether to allocate memory on the host or the device. Enabling this on a dGPU significantly reduces performance. But on an eGPU, it allows you to use all of the RAM with very little loss. (And I'd say it can work on Windows.)
What's changed is the AMD driver in the Linux kernel. Previously, device allocation was always done on vRAM. Since many programs don't handle allocation on the host, this severely limited the usable size. With a kernel running under Linux (>6.11???), AMD made a change to Linux so that the allocation on an APU's device could be allocated either to vRAM or to the GTT. This made it easy to access more memory on laptops where the manufacturer doesn't allow changing the vRAM size in the BIOS.

And there were no changes to llama.cpp/ollama to handle this.

The problem lies with ollama. It doesn't physically allocate memory, but rather tries to estimate the memory size usable by GPUs to determine how many LLM layers it can fit into it. And this is where it gets complicated:

On a dGPU, you have to look at the vRAM size; the rest doesn't really matter.
On an iGPU, it will depend on how llama.cpp is configured. If it sets the env variable (for recent versions of llama.cpp) GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON, then the memory size that llama.cpp can allocate is the RAM size. Otherwise, it's vRAM + GTT.

Note: I know there's already been some discussion on this topic, but I don't know where it stands.
On the llama.cpp side, they've recently (?) changed the way they handle this allocation: previously, it had to be done at compile time (which was a problem for ollama), but now it's doable via an env variable.

@Djip007 commented on GitHub (Oct 21, 2025): 1 395 / 5 000 Sorry, I didn't see this issue. There are a lot of assumptions in the comments that aren't correct, I'll try to be clear. Ollama doesn't manage any allocations; it only configures llama.cpp. In the rocm/hip backend of llama.cpp, no allocation is ever requested on the GTT. I don't even think this is possible with hip. With HIP, you can allocate memory on the device or on the host. In the case of host allocation, it's possible to configure cache coherence for better performance. On the llama.cpp side, there's the option to choose whether to allocate memory on the host or the device. Enabling this on a dGPU significantly reduces performance. But on an eGPU, it allows you to use all of the RAM with very little loss. (And I'd say it can work on Windows.) What's changed is the AMD driver in the Linux kernel. Previously, device allocation was always done on vRAM. Since many programs don't handle allocation on the host, this severely limited the usable size. With a kernel running under Linux (>6.11???), AMD made a change to Linux so that the allocation on an APU's device could be allocated either to vRAM or to the GTT. This made it easy to access more memory on laptops where the manufacturer doesn't allow changing the vRAM size in the BIOS. And there were no changes to llama.cpp/ollama to handle this. The problem lies with ollama. It doesn't physically allocate memory, but rather tries to estimate the memory size usable by GPUs to determine how many LLM layers it can fit into it. And this is where it gets complicated: - On a dGPU, you have to look at the vRAM size; the rest doesn't really matter. - On an iGPU, it will depend on how llama.cpp is configured. If it sets the env variable (for recent versions of llama.cpp) `GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON`, then the memory size that llama.cpp can allocate is the RAM size. Otherwise, it's vRAM + GTT. Note: I know there's already been some discussion on this topic, but I don't know where it stands. On the llama.cpp side, they've recently (?) changed the way they handle this allocation: previously, it had to be done at compile time (which was a problem for ollama), but now it's doable via an env variable.

GiteaMirror commented

2026-04-22 17:08:56 -05:00

@Djip007 commented on GitHub (Oct 21, 2025):

To finally address the iGPU case (potentially also under Windows), we need to review the way to calculate available memory, and to configure llama.cpp. I would say for a simple case:

dGPU: memory == vRAM
iGPU: set GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and memory == RAM

(and because it did not need vRAM at all on iGPU it may be best to make it as small as possible...)

@Djip007 commented on GitHub (Oct 21, 2025): To finally address the iGPU case (potentially also under Windows), we need to review the way to calculate available memory, and to configure llama.cpp. I would say for a simple case: - dGPU: memory == vRAM - iGPU: set GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and memory == RAM (and because it did not need vRAM at all on iGPU it may be best to make it as small as possible...)

GiteaMirror commented

2026-04-22 17:08:57 -05:00

@rick-github commented on GitHub (Oct 21, 2025):

ollama is no longer a wrapper for llama.cpp, it doesn't configure it. It uses the same ggml.org library that llama.cpp does.

GGML_CUDA_ENABLE_UNIFIED_MEMORY is not the unified memory is the sense that and iGPU and the CPU share it. It enables the dGPU to access system RAM via the PCI bus and is not included in the memory estimation done by ollama, except if the layers are forced onto the GPU with num_gpu.

@rick-github commented on GitHub (Oct 21, 2025): ollama is no longer a wrapper for llama.cpp, it doesn't configure it. It uses the same ggml.org library that llama.cpp does. `GGML_CUDA_ENABLE_UNIFIED_MEMORY` is not the unified memory is the sense that and iGPU and the CPU share it. It enables the dGPU to access system RAM via the PCI bus and is not included in the memory estimation done by ollama, except if the layers are forced onto the GPU with `num_gpu`.

GiteaMirror commented

2026-04-22 17:08:58 -05:00

@Djip007 commented on GitHub (Oct 22, 2025):

GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it.
It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly).
For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates.

I have a question, where is the memory "compute" did you use what ggml report?

@Djip007 commented on GitHub (Oct 22, 2025): GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it. It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly). For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates. I have a question, where is the memory "compute" did you use what ggml report?

GiteaMirror commented

2026-04-22 17:08:59 -05:00

@MrUhu commented on GitHub (Nov 7, 2025):

GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it. It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly). For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates.

I have a question, where is the memory "compute" did you use what ggml report?

Thanks for this info. Works for me.

If anyone is also using Fedora, here are my Ollama scripts for Fedora:
https://github.com/MrUhu/handy-fedora-scripts-for-ollama

With the update.sh script it will update your PC, check the Ollama release page for the latest update and only run the update command when there is a new update. Then it will add a couple of environment variables to the ollama.service file and restart the service - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and HSA_OVERRIDE_GFX_VERSION=11.0.2 primarily. Change the HSA OVERRIDE to your preffered version.

With the change_gtt_size_for_amd_igpu.sh it will take your desired GTT Size in GB (!!!), check your available system memory and write the set GTT size to grubby. I've added a limiter of 50% of your available system memory. You can edit this out if you want.

And finally overwrite_gpu_restriction_to_modelfiles.sh takes your list of models in ollama ls and looks up their layercount on ollama.com, writes this layer count to a new modelfile with PARAMETER num_gpu and creates these models for ollama to use.

I currently run qwen3-coder (q4 model with 19gb total) on 4gb of VRAM allocated to my 7940hs with a RX 780m as "myqwen3-code" (the edited Models are just called my+Model Name) with all layers running on 28gb of GTT.

Use at your own risk, of course.

@MrUhu commented on GitHub (Nov 7, 2025): > GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it. It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly). For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates. > > I have a question, where is the memory "compute" did you use what ggml report? Thanks for this info. Works for me. If anyone is also using Fedora, here are my Ollama scripts for Fedora: https://github.com/MrUhu/handy-fedora-scripts-for-ollama With the update.sh script it will update your PC, check the Ollama release page for the latest update and only run the update command when there is a new update. Then it will add a couple of environment variables to the ollama.service file and restart the service - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and HSA_OVERRIDE_GFX_VERSION=11.0.2 primarily. Change the HSA OVERRIDE to your preffered version. With the change_gtt_size_for_amd_igpu.sh it will take your desired GTT Size in GB (!!!), check your available system memory and write the set GTT size to grubby. I've added a limiter of 50% of your available system memory. You can edit this out if you want. And finally overwrite_gpu_restriction_to_modelfiles.sh takes your list of models in ollama ls and looks up their layercount on ollama.com, writes this layer count to a new modelfile with PARAMETER num_gpu <layer-count> and creates these models for ollama to use. I currently run qwen3-coder (q4 model with 19gb total) on 4gb of VRAM allocated to my 7940hs with a RX 780m as "myqwen3-code" (the edited Models are just called my+Model Name) with all layers running on 28gb of GTT. Use at your own risk, of course.

GiteaMirror commented

2026-04-22 17:09:01 -05:00

@namecaps3k commented on GitHub (Dec 5, 2025):

I have simmilar problem but my VRAM is set to 1GB (lower value I can set in Bios) because I want to use only GTT (Full 120GB or so). I have ollama installed in official way, newest rocm, ollama finds it but also sets low VRAM and load everything to GTT or partially if model is small.

Also can see user here as exactly same issue: https://github.com/ollama/ollama/issues/12062

Thing is that all inference is done on CPU which is painfully slow. Any idea what can I do to run on GPU? llamacpp runs perfectly fine with this setup. When I set VRAM in bios to something larger it will show 100% GPU instead of CPU and works way faster (50 tokens vs 20)

ai@minis:~$ ollama -v
ollama version is 0.13.1

ai@minis:~$ ollama run gpt-oss:20b hello; ollama ps

Thinking...
User says "hello". They want greeting? We should respond politely, maybe ask how can help.
...done thinking.

Hello! 👋 How can I help you today?

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b 17052f91a42e 14 GB 100% CPU 4096 4 minutes from now

Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=00>
Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="1.0 GiB" threshold="20.0 GiB"
Nov 30 12:21:11 minis ollama[193626]
Nov 30 12:24:32 minis ollama[193626]: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0

@namecaps3k commented on GitHub (Dec 5, 2025): I have simmilar problem but my VRAM is set to 1GB (lower value I can set in Bios) because I want to use only GTT (Full 120GB or so). I have ollama installed in official way, newest rocm, ollama finds it but also sets low VRAM and load everything to GTT or partially if model is small. Also can see user here as exactly same issue: https://github.com/ollama/ollama/issues/12062 Thing is that all inference is done on CPU which is painfully slow. Any idea what can I do to run on GPU? llamacpp runs perfectly fine with this setup. When I set VRAM in bios to something larger it will show 100% GPU instead of CPU and works way faster (50 tokens vs 20) ai@minis:~$ ollama -v ollama version is 0.13.1 ai@minis:~$ ollama run gpt-oss:20b hello; ollama ps Thinking... User says "hello". They want greeting? We should respond politely, maybe ask how can help. ...done thinking. Hello! 👋 How can I help you today? NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b 17052f91a42e 14 GB 100% CPU 4096 4 minutes from now Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=00> Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="1.0 GiB" threshold="20.0 GiB" Nov 30 12:21:11 minis ollama[193626] Nov 30 12:24:32 minis ollama[193626]: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#33958