[GH-ISSUE #12342] Bug: inconsistent to use VRAM and GTT of iGPU of AMD Ryzen AI Processor #33958

Closed
opened 2026-04-22 17:08:35 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @alexhegit on GitHub (Sep 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12342

What is the issue?

Ollama dose not use VRAM and GTT of iGPU of AMD Ryzen AI Processor with consistent logic.

There are two parts of memory of iGPU of AMD Ryzen AI Processor. The VRAM size is set in BIOS and the half of residual is GTT.

e.g. for 128GB DDR memory of AMD Ryzen AI Max+ laptop - ROG Flow Z13 (2025) GZ302

  1. if set VRAM with 96GB, the GTT= (128-96)/2=16GB
  2. if set VRAM with 8GB, the GTT= (128-8)/2=60GB

Test platform:

Hardware: AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302 , set 96GB VRAM (by BIOS)

OS: Ubuntu24.04
Ollama: version is 0.11.11

Test1 :

VRAM=96GB, GTT=16GB

ollama run gpt-oss:20b --verbose "why is sky blue"

Use radeontop to monitor the VRAM usage

The model gpt-oss:20b is loaded to GTT(16GB) rather than VRAM(96GB).

Image

Expect to load the model to VRAM.

Other tests:

Run qwen3:32b failed.

$ ollama run qwen3:32b --verbose "why is sky blue"
Error: 500 Internal Server Error: llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer
llama_model_load_from_file_impl: failed to load model

Expect to load the model to VRAM.

Relevant log output


OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.11.11

Originally created by @alexhegit on GitHub (Sep 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12342 ### What is the issue? Ollama dose not use VRAM and GTT of iGPU of AMD Ryzen AI Processor with consistent logic. There are two parts of memory of iGPU of AMD Ryzen AI Processor. The VRAM size is set in BIOS and the half of residual is GTT. e.g. for 128GB DDR memory of AMD Ryzen AI Max+ laptop - ROG Flow Z13 (2025) GZ302 1. if set VRAM with 96GB, the GTT= (128-96)/2=16GB 2. if set VRAM with 8GB, the GTT= (128-8)/2=60GB ### Test platform: Hardware: AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302 , set 96GB VRAM (by BIOS) OS: Ubuntu24.04 Ollama: version is 0.11.11 ### Test1 : VRAM=96GB, GTT=16GB ```bash ollama run gpt-oss:20b --verbose "why is sky blue" ``` Use `radeontop` to monitor the VRAM usage The model gpt-oss:20b is loaded to GTT(16GB) rather than VRAM(96GB). <img width="1460" height="823" alt="Image" src="https://github.com/user-attachments/assets/98606d1a-db8c-4854-9ae4-fa8f2e761318" /> Expect to load the model to VRAM. Other tests: Run qwen3:32b failed. ```terminal $ ollama run qwen3:32b --verbose "why is sky blue" Error: 500 Internal Server Error: llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer llama_model_load_from_file_impl: failed to load model ``` Expect to load the model to VRAM. ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.11.11
GiteaMirror added the needs more infobug labels 2026-04-22 17:08:35 -05:00
Author
Owner

@alexhegit commented on GitHub (Sep 19, 2025):

Test2 :
VRAM=8GB, GTT=60GB

Then I reset the VRAM to 8GB while GTT is 60GB and run gpt-oss:20b which the GTT is enough for load the full model.

But the ollama ps show it run with GPU/CPU hybird mode. It seems that the ollama use VRAM size to estimate the memoy footprint to decide CPU or GPU or CPU/GPU to run the model. But load the model into GTT in runtime. That means ollama has the bug that do not use consistent logic for estimate memory footprint and real runtime memory usage.

Image
<!-- gh-comment-id:3311048979 --> @alexhegit commented on GitHub (Sep 19, 2025): Test2 : VRAM=8GB, GTT=60GB Then I reset the VRAM to 8GB while GTT is 60GB and run gpt-oss:20b which the GTT is enough for load the full model. But the ollama ps show it run with GPU/CPU hybird mode. It seems that the ollama use VRAM size to estimate the memoy footprint to decide CPU or GPU or CPU/GPU to run the model. But load the model into GTT in runtime. That means ollama has the bug that do not use consistent logic for estimate memory footprint and real runtime memory usage. <img width="1459" height="775" alt="Image" src="https://github.com/user-attachments/assets/cadc949c-29e6-4810-a8e3-4140bdb61043" />
Author
Owner

@alexhegit commented on GitHub (Sep 19, 2025):

Expectation logic:

If chose use GTT for loading the model , it should use GTT to estimate the memoy footprint to decide if use CPU or GPU or CPU+GPU running mode.

Or more simple to use VRAM for loading the model, since

  1. The BIOS has option to set VRAM for iGPU. it could be 96GB VRAM for AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302. That means it could run the gpt-oss-120b in GPU mode.
  2. Use VRAM will make iGPU and dGPU are thee same logic in Ollama.
<!-- gh-comment-id:3311092769 --> @alexhegit commented on GitHub (Sep 19, 2025): Expectation logic: If chose use GTT for loading the model , it should use GTT to estimate the memoy footprint to decide if use CPU or GPU or CPU+GPU running mode. Or more simple to use VRAM for loading the model, since 1. The BIOS has option to set VRAM for iGPU. it could be 96GB VRAM for AMD Ryzen AI Max+395 laptop - ROG Flow Z13 (2025) GZ302. That means it could run the gpt-oss-120b in GPU mode. 2. Use VRAM will make iGPU and dGPU are thee same logic in Ollama.
Author
Owner

@rick-github commented on GitHub (Sep 19, 2025):

Server logs may help in debugging.

I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM.

<!-- gh-comment-id:3311721448 --> @rick-github commented on GitHub (Sep 19, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging. I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM.
Author
Owner

@Ricky1975 commented on GitHub (Sep 20, 2025):

I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics)
Kernel 6.1.0-39-amd64, ROCk module version 6.12.12

I think the issue is:
When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT.
Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full).
If I minimize my VRAM, Ollama does not even allow me to use the GTT.

Suspected solution (from a n00b):
The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

<!-- gh-comment-id:3314893984 --> @Ricky1975 commented on GitHub (Sep 20, 2025): I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics) Kernel 6.1.0-39-amd64, ROCk module version 6.12.12 I think the issue is: When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT. Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT. Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)
Author
Owner

@rick-github commented on GitHub (Sep 20, 2025):

As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.

<!-- gh-comment-id:3314894804 --> @rick-github commented on GitHub (Sep 20, 2025): As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.
Author
Owner

@alexhegit commented on GitHub (Sep 20, 2025):

I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics) Kernel 6.1.0-39-amd64, ROCk module version 6.12.12

I think the issue is: When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT. Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT.

Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

Yes. we have same issue and same expectation.

<!-- gh-comment-id:3314946738 --> @alexhegit commented on GitHub (Sep 20, 2025): > I have the same issue with a M890 Pro Mini PC (AMD Ryzen 9 8945HS w/ Radeon 780M Graphics) Kernel 6.1.0-39-amd64, ROCk module version 6.12.12 > > I think the issue is: When ollama starts up, it uses the amount of VRAM to calculate the layers to push to the GPU. When loading, it pushes the layers to the GTT. Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT. > > Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G) Yes. we have same issue and same expectation.
Author
Owner

@rick-github commented on GitHub (Sep 20, 2025):

As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.

<!-- gh-comment-id:3314947590 --> @rick-github commented on GitHub (Sep 20, 2025): As mentioned, it works fine for me. Useful information to debug this might be found in the server logs.
Author
Owner

@alexhegit commented on GitHub (Sep 20, 2025):

Server logs may help in debugging.

I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM.

Have you use the same Ollama version 0.11.11 with running gpt-oss:20b ? Please use ollama ps to see if it is run with 100%CPU rather than 100%GPU.

The current logic of ollama use VRAM to judge if the GPU memory is enought to load the model but load and run it with GTT.

<!-- gh-comment-id:3314948164 --> @alexhegit commented on GitHub (Sep 20, 2025): > [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging. > > I have an evo-x2 (AMD Ryzen AI Max+ 395) with kernel 6.11.0-29-generic, with 96GB VRAM set in the BIOS ollama loads models into VRAM. Have you use the same Ollama version 0.11.11 with running gpt-oss:20b ? Please use `ollama ps` to see if it is run with 100%CPU rather than 100%GPU. The current logic of ollama use VRAM to judge if the GPU memory is enought to load the model but load and run it with GTT.
Author
Owner

@alexhegit commented on GitHub (Sep 20, 2025):

There is one fork repo trying to solve this issue for AMD APU (with iGPU) https://github.com/rjmalagon/ollama-linux-amd-apu

It add a new path to use GTT for AMD APU in https://github.com/rjmalagon/ollama-linux-amd-apu/blob/main/discover/amd_linux.go

Image
<!-- gh-comment-id:3314953552 --> @alexhegit commented on GitHub (Sep 20, 2025): There is one fork repo trying to solve this issue for AMD APU (with iGPU) https://github.com/rjmalagon/ollama-linux-amd-apu It add a new path to use GTT for AMD APU in https://github.com/rjmalagon/ollama-linux-amd-apu/blob/main/discover/amd_linux.go <img width="1038" height="927" alt="Image" src="https://github.com/user-attachments/assets/90c1f992-4078-45fa-94ab-7290cfca0cea" />
Author
Owner

@rick-github commented on GitHub (Sep 20, 2025):

$ ollama -v ; ollama run gpt-oss:20b hello ; ollama ps
ollama version is 0.11.11
Thinking...
We have a conversation. The user says "hello". We need to respond. The instructions: "You are ChatGPT, a large language model trained by OpenAI." No special instruction. We should greet, ask how can help. 
Let's respond politely.
...done thinking.

Hello! 👋 How can I help you today?

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:20b    aa4295ac10c3    14 GB    100% GPU     8192       Forever    

Useful information to debug this might be found in the server logs.

<!-- gh-comment-id:3314990634 --> @rick-github commented on GitHub (Sep 20, 2025): ```console $ ollama -v ; ollama run gpt-oss:20b hello ; ollama ps ollama version is 0.11.11 Thinking... We have a conversation. The user says "hello". We need to respond. The instructions: "You are ChatGPT, a large language model trained by OpenAI." No special instruction. We should greet, ask how can help. Let's respond politely. ...done thinking. Hello! 👋 How can I help you today? NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 8192 Forever ``` Useful information to debug this might be found in the server logs.
Author
Owner

@alexhegit commented on GitHub (Sep 21, 2025):

ollama -v ; ollama run gpt-oss:20b hello ; ollama ps

Have you use radeontop to monitor where the model to be loaded, GTT or VRAM.

My test show it use VRAM to estimate the model memory footprint but use GTT to load and run the model.

  1. Memory Setting: VRAM=16GB, GTT=56GB

  2. OS

alex@GZ302EA:~$ uname -a
Linux GZ302EA 6.14.0-24-generic #24~24.04.3-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul  7 16:39:17 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  1. test cmd
 ollama -v ; ollama run gpt-oss:20b hello ; ollama ps
Image
<!-- gh-comment-id:3315403737 --> @alexhegit commented on GitHub (Sep 21, 2025): > ollama -v ; ollama run gpt-oss:20b hello ; ollama ps Have you use radeontop to monitor where the model to be loaded, GTT or VRAM. My test show it use VRAM to estimate the model memory footprint but use GTT to load and run the model. 0. Memory Setting: VRAM=16GB, GTT=56GB 1. OS ```shell alex@GZ302EA:~$ uname -a Linux GZ302EA 6.14.0-24-generic #24~24.04.3-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 7 16:39:17 UTC 2 x86_64 x86_64 x86_64 GNU/Linux ``` 2. test cmd ```sheell ollama -v ; ollama run gpt-oss:20b hello ; ollama ps ``` <img width="1360" height="840" alt="Image" src="https://github.com/user-attachments/assets/753b8c0f-91c9-4e17-940e-0344086f9211" />
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

             12979M / 98140M VRAM  13.22% │     
                 14M / 15848M GTT   0.09% │
       1.00G / 1.00G Memory Clock 100.00% │                                      
       0.60G / 2.90G Shader Clock  20.69% │        

Useful information to debug this might be found in the server logs.

<!-- gh-comment-id:3315408483 --> @rick-github commented on GitHub (Sep 21, 2025): ``` 12979M / 98140M VRAM 13.22% │ 14M / 15848M GTT 0.09% │ 1.00G / 1.00G Memory Clock 100.00% │ 0.60G / 2.90G Shader Clock 20.69% │ ``` Useful information to debug this might be found in the server logs.
Author
Owner

@alexhegit commented on GitHub (Sep 21, 2025):

             12979M / 98140M VRAM  13.22% │     
                 14M / 15848M GTT   0.09% │
       1.00G / 1.00G Memory Clock 100.00% │                                      
       0.60G / 2.90G Shader Clock  20.69% │        

Useful information to debug this might be found in the server logs.

Interesting. Can not explain the difference resluts between us. We are use same ollama and same model in same test cases.

I re-set the VRAM=96GB in bios and test again. Still load the model into GTT monitoring by radeontop.

The log show the VRAM=96GB and gpt-oss-20b run with AMD iGPU gfx1151.

Sep 21 10:05:47 GZ302EA systemd[1]: Stopping ollama.service - Ollama Service...
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:265 msg="shutting down scheduler completed loop"
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:766 msg="shutting down runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1683 msg="stopping llama server" pid=55289
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:135 msg="shutting down scheduler pending loop"
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1689 msg="waiting for llama server to exit" pid=55289
Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.869+08:00 level=DEBUG source=server.go:1693 msg="llama server stopped" pid=55289
Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Deactivated successfully.
Sep 21 10:05:47 GZ302EA systemd[1]: Stopped ollama.service - Ollama Service.
Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Consumed 1min 24.998s CPU time, 5.3G memory peak, 210.7M memory swap peak.
Sep 21 10:05:56 GZ302EA systemd[1]: Started ollama.service - Ollama Service.
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.187+08:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[chrome-extension://* moz-extension://* safari-web-extension://* ollama serve http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.189+08:00 level=INFO source=images.go:477 msg="total blobs: 88"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=routes.go:1385 msg="Listening on [::]:11434 (version 0.11.11)"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=DEBUG source=sched.go:121 msg="starting llm scheduler"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcuda.so*
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths=[]
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcudart.so*
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcudart.so* /libcudart.so* /usr/local/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.200+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths="[/usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90 /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88]"
Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90: your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88: your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:203 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=5510 unique_id=0
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:237 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card1/device
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:343 msg="amdgpu memory" gpu=0 total="96.0 GiB"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:344 msg="amdgpu memory" gpu=0 available="95.7 GiB"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/local/lib/ollama/rocm"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:44 msg="detected ROCM next to ollama executable /usr/local/lib/ollama/rocm"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=DEBUG source=amd_linux.go:375 msg="rocm supported GPUs" types="[gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102 gfx1151 gfx1200 gfx1201 gfx900 gfx906 gfx908 gfx90a gfx942]"
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=0 gpu_type=gfx1151
Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=rocm variant="" compute=gfx1151 driver=0.0 name=1002:1586 total="96.0 GiB" available="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 |      67.331µs |       127.0.0.1 | GET      "/api/version"
Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 |      23.062µs |       127.0.0.1 | HEAD     "/"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.617+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 |   56.855199ms |       127.0.0.1 | POST     "/api/show"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.3 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.714+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.715+08:00 level=DEBUG source=sched.go:208 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:200 msg="model wants flash attention"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:324 msg="adding gpu library" path=/usr/local/lib/ollama/rocm
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:332 msg="adding gpu dependency paths" paths=[/usr/local/lib/ollama/rocm]
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 46445"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:400 msg=subprocess PATH=/home/alex/.local/bin:/home/alex/Android/Sdk:/home/alex/development/flutter/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_HOST=0.0.0.0:11434 OLLAMA_ORIGINS="chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve" OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_QUEUE=512 OLLAMA_NUM_PARALLEL=4 OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/rocm LD_LIBRARY_PATH=/usr/local/lib/ollama/rocm:/usr/local/lib/ollama/rocm:/usr/local/lib/ollama:/usr/local/lib/ollama ROCR_VISIBLE_DEVICES=0
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:678 msg="system memory" total="31.0 GiB" free="29.5 GiB" free_swap="7.7 GiB"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="95.3 GiB" free="95.7 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1254 msg="starting ollama engine"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1289 msg="Server listening on 127.0.0.1:46445"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.797+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.name default=""
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.description default=""
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: found 1 ROCm devices:
Sep 21 10:06:04 GZ302EA ollama[57854]:   Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0
Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded ROCm backend from /usr/local/lib/ollama/libggml-hip.so
Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.760+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880U required.CPU.Graph=5898240U required.ROCm0.ID=0 required.ROCm0.Weights="[477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 1158278400U]" required.ROCm0.Cache="[34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 0U]" required.ROCm0.Graph=165415040U
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:alloc LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.791+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880A required.CPU.Graph=5898240A required.ROCm0.ID=0 required.ROCm0.Weights="[477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 1158278400A]" required.ROCm0.Cache="[34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 0U]" required.ROCm0.Graph=165415040A
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:487 msg="offloading 24 repeating layers to GPU"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:342 msg="total memory" size="14.1 GiB"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.738+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.09"
Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.989+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.18"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.239+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.24"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.495+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.33"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.746+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.35"
Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.997+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.45"
Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.248+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.54"
Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.498+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.64"
Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.749+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.73"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.000+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.81"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.251+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.88"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.502+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.91"
Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.752+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95"
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.003+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.99"
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.104+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=INFO source=server.go:1289 msg="llama runner started in 5.47 seconds"
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=server.go:1388 msg="completion request" images=0 prompt=307 format=""
Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.280+08:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68
Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 |  6.648798587s |       127.0.0.1 | POST     "/api/generate"
Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:493 msg="context for request finished"
Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 duration=5m0s
Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 refCount=0
Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 |      30.615µs |       127.0.0.1 | HEAD     "/"
Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 |     193.169µs |       127.0.0.1 | GET      "/api/ps"

<!-- gh-comment-id:3315420133 --> @alexhegit commented on GitHub (Sep 21, 2025): > ``` > 12979M / 98140M VRAM 13.22% │ > 14M / 15848M GTT 0.09% │ > 1.00G / 1.00G Memory Clock 100.00% │ > 0.60G / 2.90G Shader Clock 20.69% │ > ``` > > Useful information to debug this might be found in the server logs. Interesting. Can not explain the difference resluts between us. We are use same ollama and same model in same test cases. I re-set the VRAM=96GB in bios and test again. Still load the model into GTT monitoring by radeontop. The log show the VRAM=96GB and gpt-oss-20b run with AMD iGPU gfx1151. ```terminal Sep 21 10:05:47 GZ302EA systemd[1]: Stopping ollama.service - Ollama Service... Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:265 msg="shutting down scheduler completed loop" Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:766 msg="shutting down runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1683 msg="stopping llama server" pid=55289 Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=sched.go:135 msg="shutting down scheduler pending loop" Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.784+08:00 level=DEBUG source=server.go:1689 msg="waiting for llama server to exit" pid=55289 Sep 21 10:05:47 GZ302EA ollama[39440]: time=2025-09-21T10:05:47.869+08:00 level=DEBUG source=server.go:1693 msg="llama server stopped" pid=55289 Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Deactivated successfully. Sep 21 10:05:47 GZ302EA systemd[1]: Stopped ollama.service - Ollama Service. Sep 21 10:05:47 GZ302EA systemd[1]: ollama.service: Consumed 1min 24.998s CPU time, 5.3G memory peak, 210.7M memory swap peak. Sep 21 10:05:56 GZ302EA systemd[1]: Started ollama.service - Ollama Service. Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.187+08:00 level=INFO source=routes.go:1332 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[chrome-extension://* moz-extension://* safari-web-extension://* ollama serve http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.189+08:00 level=INFO source=images.go:477 msg="total blobs: 88" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=routes.go:1385 msg="Listening on [::]:11434 (version 0.11.11)" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=DEBUG source=sched.go:121 msg="starting llm scheduler" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.190+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcuda.so* Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.191+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths=[] Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:512 msg="Searching for GPU library" name=libcudart.so* Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.199+08:00 level=DEBUG source=gpu.go:536 msg="gpu library search" globs="[/usr/local/lib/ollama/libcudart.so* /libcudart.so* /usr/local/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.200+08:00 level=DEBUG source=gpu.go:569 msg="discovered GPU libraries" paths="[/usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90 /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88]" Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v12/libcudart.so.12.8.90: your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama" Sep 21 10:05:56 GZ302EA ollama[57854]: cudaSetDevice err: 35 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=DEBUG source=gpu.go:585 msg="Unable to load cudart library /usr/local/lib/ollama/cuda_v13/libcudart.so.13.0.88: your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.201+08:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:203 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=5510 unique_id=0 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:237 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card1/device Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:343 msg="amdgpu memory" gpu=0 total="96.0 GiB" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_linux.go:344 msg="amdgpu memory" gpu=0 available="95.7 GiB" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/local/lib/ollama/rocm" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.202+08:00 level=DEBUG source=amd_common.go:44 msg="detected ROCM next to ollama executable /usr/local/lib/ollama/rocm" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=DEBUG source=amd_linux.go:375 msg="rocm supported GPUs" types="[gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102 gfx1151 gfx1200 gfx1201 gfx900 gfx906 gfx908 gfx90a gfx942]" Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=amd_linux.go:390 msg="amdgpu is supported" gpu=0 gpu_type=gfx1151 Sep 21 10:05:56 GZ302EA ollama[57854]: time=2025-09-21T10:05:56.205+08:00 level=INFO source=types.go:131 msg="inference compute" id=0 library=rocm variant="" compute=gfx1151 driver=0.0 name=1002:1586 total="96.0 GiB" available="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 | 67.331µs | 127.0.0.1 | GET "/api/version" Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 | 23.062µs | 127.0.0.1 | HEAD "/" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.617+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:03 | 200 | 56.855199ms | 127.0.0.1 | POST "/api/show" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.3 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.697+08:00 level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.714+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.715+08:00 level=DEBUG source=sched.go:208 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.784+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:200 msg="model wants flash attention" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:324 msg="adding gpu library" path=/usr/local/lib/ollama/rocm Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:332 msg="adding gpu dependency paths" paths=[/usr/local/lib/ollama/rocm] Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 46445" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=server.go:400 msg=subprocess PATH=/home/alex/.local/bin:/home/alex/Android/Sdk:/home/alex/development/flutter/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_HOST=0.0.0.0:11434 OLLAMA_ORIGINS="chrome-extension://*,moz-extension://*,safari-web-extension://* ollama serve" OLLAMA_FLASH_ATTENTION=1 OLLAMA_MAX_QUEUE=512 OLLAMA_NUM_PARALLEL=4 OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/rocm LD_LIBRARY_PATH=/usr/local/lib/ollama/rocm:/usr/local/lib/ollama/rocm:/usr/local/lib/ollama:/usr/local/lib/ollama ROCR_VISIBLE_DEVICES=0 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=gpu.go:402 msg="updating system memory data" before.total="31.0 GiB" before.free="29.5 GiB" before.free_swap="7.7 GiB" now.total="31.0 GiB" now.free="29.5 GiB" now.free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=DEBUG source=amd_linux.go:492 msg="updating rocm free memory" gpu=0 name=1002:1586 before="95.7 GiB" now="95.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:678 msg="system memory" total="31.0 GiB" free="29.5 GiB" free_swap="7.7 GiB" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.785+08:00 level=INFO source=server.go:686 msg="gpu memory" id=0 available="95.3 GiB" free="95.7 GiB" minimum="457.0 MiB" overhead="0 B" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1254 msg="starting ollama engine" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.792+08:00 level=INFO source=runner.go:1289 msg="Server listening on 127.0.0.1:46445" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.797+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.name default="" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.description default="" Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 Sep 21 10:06:03 GZ302EA ollama[57854]: time=2025-09-21T10:06:03.830+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 21 10:06:04 GZ302EA ollama[57854]: ggml_cuda_init: found 1 ROCm devices: Sep 21 10:06:04 GZ302EA ollama[57854]: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0 Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded ROCm backend from /usr/local/lib/ollama/libggml-hip.so Sep 21 10:06:04 GZ302EA ollama[57854]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.609+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295 Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.610+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.760+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2 Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880U required.CPU.Graph=5898240U required.ROCm0.ID=0 required.ROCm0.Weights="[477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 477628800U 1158278400U]" required.ROCm0.Cache="[34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 34603008U 67108864U 0U]" required.ROCm0.Graph=165415040U Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.761+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:alloc LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 21 10:06:04 GZ302EA ollama[57854]: time=2025-09-21T10:06:04.791+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=general.alignment default=32 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.317+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1325 splits=2 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=backend.go:342 msg="total memory" size="14.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.486+08:00 level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=1158266880A required.CPU.Graph=5898240A required.ROCm0.ID=0 required.ROCm0.Weights="[477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 477628800A 1158278400A]" required.ROCm0.Cache="[34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 34603008A 67108864A 0U]" required.ROCm0.Graph=165415040A Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:894 msg="available gpu" id=0 "available layer vram"="95.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="157.8 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=DEBUG source=server.go:728 msg="new layout created" layers="25[ID:0 Layers:25(0..24)]" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=runner.go:1173 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType: NumThreads:16 GPULayers:25[ID:0 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:487 msg="offloading 24 repeating layers to GPU" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="1.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="157.8 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=backend.go:342 msg="total memory" size="14.1 GiB" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=ggml.go:498 msg="offloaded 25/25 layers to GPU" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.487+08:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.738+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.09" Sep 21 10:06:05 GZ302EA ollama[57854]: time=2025-09-21T10:06:05.989+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.18" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.239+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.24" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.495+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.33" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.746+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.35" Sep 21 10:06:06 GZ302EA ollama[57854]: time=2025-09-21T10:06:06.997+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.45" Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.248+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.54" Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.498+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.64" Sep 21 10:06:07 GZ302EA ollama[57854]: time=2025-09-21T10:06:07.749+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.73" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.000+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.81" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.251+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.88" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.502+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.91" Sep 21 10:06:08 GZ302EA ollama[57854]: time=2025-09-21T10:06:08.752+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.95" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.003+08:00 level=DEBUG source=server.go:1295 msg="model load progress 0.99" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.104+08:00 level=DEBUG source=ggml.go:274 msg="key with type not found" key=gptoss.pooling_type default=4294967295 Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=INFO source=server.go:1289 msg="llama runner started in 5.47 seconds" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.254+08:00 level=DEBUG source=server.go:1388 msg="completion request" images=0 prompt=307 format="" Sep 21 10:06:09 GZ302EA ollama[57854]: time=2025-09-21T10:06:09.280+08:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68 Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 | 6.648798587s | 127.0.0.1 | POST "/api/generate" Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:493 msg="context for request finished" Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 duration=5m0s Sep 21 10:06:10 GZ302EA ollama[57854]: time=2025-09-21T10:06:10.268+08:00 level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:20b runner.inference=rocm runner.devices=1 runner.size="14.1 GiB" runner.vram="14.1 GiB" runner.parallel=4 runner.pid=58040 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=8192 refCount=0 Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 | 30.615µs | 127.0.0.1 | HEAD "/" Sep 21 10:06:10 GZ302EA ollama[57854]: [GIN] 2025/09/21 - 10:06:10 | 200 | 193.169µs | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@alexhegit commented on GitHub (Sep 21, 2025):

@rick-github

I'd like to check the gpu driver info with you. Here is mine.

alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/
libdrm_amdgpu.so          libdrm_amdgpu.so.1.124.0  libdrm_radeon.so.1        libdrm.so                 libdrm.so.2.124.0
libdrm_amdgpu.so.1        libdrm_radeon.so          libdrm_radeon.so.1.124.0  libdrm.so.2               pkgconfig/
alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/
libdrm_amdgpu.so    libdrm_amdgpu.so.1.124.0  libdrm_radeon.so.1        libdrm.so    libdrm.so.2.124.0
libdrm_amdgpu.so.1  libdrm_radeon.so          libdrm_radeon.so.1.124.0  libdrm.so.2  pkgconfig
alex@GZ302EA:~$ modinfo amdgpu | grep version
srcversion:     639640A50DD1D71D4F3C5D9
vermagic:       6.14.0-24-generic SMP preempt mod_unload modversions
parm:           hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)
<!-- gh-comment-id:3315472050 --> @alexhegit commented on GitHub (Sep 21, 2025): @rick-github I'd like to check the gpu driver info with you. Here is mine. ```terminal alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/ libdrm_amdgpu.so libdrm_amdgpu.so.1.124.0 libdrm_radeon.so.1 libdrm.so libdrm.so.2.124.0 libdrm_amdgpu.so.1 libdrm_radeon.so libdrm_radeon.so.1.124.0 libdrm.so.2 pkgconfig/ alex@GZ302EA:~$ ls /opt/amdgpu/lib/x86_64-linux-gnu/ libdrm_amdgpu.so libdrm_amdgpu.so.1.124.0 libdrm_radeon.so.1 libdrm.so libdrm.so.2.124.0 libdrm_amdgpu.so.1 libdrm_radeon.so libdrm_radeon.so.1.124.0 libdrm.so.2 pkgconfig alex@GZ302EA:~$ modinfo amdgpu | grep version srcversion: 639640A50DD1D71D4F3C5D9 vermagic: 6.14.0-24-generic SMP preempt mod_unload modversions parm: hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool) ```
Author
Owner

@alexhegit commented on GitHub (Sep 21, 2025):

I get some infomation about GTT, VRAM from https://github.com/ollama/ollama/issues/5471.

Image

The kernel changed the behavior about memory allcation about GTT & VRAM. But Ollama should consider to use GTT+VRAM together for loading model and load VRAM first and then use GTT if VRAM size is not enough. It should be good for the iGPU of AMD with UMA.

@rick-github Do you use the linux kernel version < 6.10.0?

<!-- gh-comment-id:3315857466 --> @alexhegit commented on GitHub (Sep 21, 2025): I get some infomation about GTT, VRAM from https://github.com/ollama/ollama/issues/5471. <img width="1148" height="571" alt="Image" src="https://github.com/user-attachments/assets/66e75e6e-8cc8-4d5c-9162-8853c412721b" /> The kernel changed the behavior about memory allcation about GTT & VRAM. But Ollama should consider to use GTT+VRAM together for loading model and load VRAM first and then use GTT if VRAM size is not enough. It should be good for the iGPU of AMD with UMA. @rick-github Do you use the linux kernel version < 6.10.0?
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

$ ls /opt/amdgpu/lib/x86_64-linux-gnu/
dri                         libgbm.so.1.0.0
gbm                         libGLX_mesa.so.0
libdrm_amdgpu.so            libGLX_mesa.so.0.0.0
libdrm_amdgpu.so.1          libLLVM.so.19.1
libdrm_amdgpu.so.1.124.0    libLTO.so.19.1
libdrm_radeon.so            libRemarks.so.19.1
libdrm_radeon.so.1          libwayland-client.so.0
libdrm_radeon.so.1.124.0    libwayland-client.so.0.23.0
libdrm.so                   libwayland-server.so.0
libdrm.so.2                 libwayland-server.so.0.23.0
libdrm.so.2.124.0           libxatracker.so.2
libEGL_mesa.so.0            libxatracker.so.2.5.0
libEGL_mesa.so.0.0.0        llvm-19.1
libgallium-25.0.0-devel.so  pkgconfig
libgbm.so.1                 vdpau
$ modinfo amdgpu | grep version
version:        6.12.12
srcversion:     9AB0277171A464F184AFEF4
vermagic:       6.11.0-29-generic SMP preempt mod_unload modversions 
parm:           hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool)

@rick-github Do you use the linux kernel version < 6.10.0?

https://github.com/ollama/ollama/issues/12342#issuecomment-3311721448

<!-- gh-comment-id:3315877779 --> @rick-github commented on GitHub (Sep 21, 2025): ```console $ ls /opt/amdgpu/lib/x86_64-linux-gnu/ dri libgbm.so.1.0.0 gbm libGLX_mesa.so.0 libdrm_amdgpu.so libGLX_mesa.so.0.0.0 libdrm_amdgpu.so.1 libLLVM.so.19.1 libdrm_amdgpu.so.1.124.0 libLTO.so.19.1 libdrm_radeon.so libRemarks.so.19.1 libdrm_radeon.so.1 libwayland-client.so.0 libdrm_radeon.so.1.124.0 libwayland-client.so.0.23.0 libdrm.so libwayland-server.so.0 libdrm.so.2 libwayland-server.so.0.23.0 libdrm.so.2.124.0 libxatracker.so.2 libEGL_mesa.so.0 libxatracker.so.2.5.0 libEGL_mesa.so.0.0.0 llvm-19.1 libgallium-25.0.0-devel.so pkgconfig libgbm.so.1 vdpau $ modinfo amdgpu | grep version version: 6.12.12 srcversion: 9AB0277171A464F184AFEF4 vermagic: 6.11.0-29-generic SMP preempt mod_unload modversions parm: hws_gws_support:Assume MEC2 FW supports GWS barriers (false = rely on FW version check (Default), true = force supported) (bool) ``` > @rick-github Do you use the linux kernel version < 6.10.0? https://github.com/ollama/ollama/issues/12342#issuecomment-3311721448
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

Have you set amdgpu.no_system_mem_limit in the boot params?

<!-- gh-comment-id:3315879485 --> @rick-github commented on GitHub (Sep 21, 2025): Have you set `amdgpu.no_system_mem_limit` in the boot params?
Author
Owner

@alexhegit commented on GitHub (Sep 22, 2025):

@rick-github

I never modify this boot param.

alex@GZ302EA:~$ cat /sys/module/amdgpu/parameters/no_system_mem_limit
N
<!-- gh-comment-id:3319484797 --> @alexhegit commented on GitHub (Sep 22, 2025): @rick-github I never modify this boot param. ``` alex@GZ302EA:~$ cat /sys/module/amdgpu/parameters/no_system_mem_limit N ```
Author
Owner

@rick-github commented on GitHub (Sep 22, 2025):

Then maybe modify this boot param?

<!-- gh-comment-id:3319517155 --> @rick-github commented on GitHub (Sep 22, 2025): Then maybe modify this boot param?
Author
Owner

@alexhegit commented on GitHub (Sep 24, 2025):

Then maybe modify this boot param?

Still use GTT ahead VRAM with ollama test with /sys/module/amdgpu/parameters/no_system_mem_limit = Y.

what's your settings of this sysfs args?

<!-- gh-comment-id:3326055330 --> @alexhegit commented on GitHub (Sep 24, 2025): Then maybe modify this boot param? Still use GTT ahead VRAM with ollama test with /sys/module/amdgpu/parameters/no_system_mem_limit = Y. what's your settings of this sysfs args?
Author
Owner

@rick-github commented on GitHub (Sep 24, 2025):

$ cat /sys/module/amdgpu/parameters/no_system_mem_limit
Y
<!-- gh-comment-id:3327897624 --> @rick-github commented on GitHub (Sep 24, 2025): ``` $ cat /sys/module/amdgpu/parameters/no_system_mem_limit Y ```
Author
Owner

@MrUhu commented on GitHub (Sep 30, 2025):

Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT.

Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G)

@Ricky1975 I also noticed that somehow the behaviour of Ollama changed.
Usually I had the problem, that Ollama used VRAM for estimation and then loaded the model into GTT. Now it's running in VRAM but I can't force it anymore to load the model into GTT.

It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

<!-- gh-comment-id:3353915909 --> @MrUhu commented on GitHub (Sep 30, 2025): > Pushing them to GTT seems correct for me (and when I force more layers to be pushed there, it works until my GTT is full). If I minimize my VRAM, Ollama does not even allow me to use the GTT. > > Suspected solution (from a n00b): The calculation of the available GPU-useable memory should take VRAM or GTT into account and not only VRAM. A possible implementation might be as environment parameter to switch or to override (e.g. OLLAMA_USE_GTT=true or OLLAMA_VRAM_OVERRIDE=60G) @Ricky1975 I also noticed that somehow the behaviour of Ollama changed. Usually I had the problem, that Ollama used VRAM for estimation and then loaded the model into GTT. Now it's running in VRAM but I can't force it anymore to load the model into GTT. It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.
Author
Owner

@alexhegit commented on GitHub (Oct 10, 2025):

It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

@MrUhu Hi, so you forgot what changes make it run with VRAM (changed from GTT)?

<!-- gh-comment-id:3387965730 --> @alexhegit commented on GitHub (Oct 10, 2025): > It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS. @MrUhu Hi, so you forgot what changes make it run with VRAM (changed from GTT)?
Author
Owner

@MrUhu commented on GitHub (Oct 13, 2025):

It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS.

@MrUhu Hi, so you forgot what changes make it run with VRAM (changed from GTT)?

No. I didn't forget.
Since one of the earlier versions the vRAM handling changed. I don't know if it was a change in ROCm or Ollama but now Ollama checks VRAM and writes the modell to vRAM and not GTT.
When GTT was still used I created custom modelfile where I told ollama to write all layers to Video Memory. In my case the video memory was the GTT. But now it uses the VRAM instead - so no more GTT tinkering.

I could try an older version of Ollama but have to take a look at the .sh file beforehand.

<!-- gh-comment-id:3399137931 --> @MrUhu commented on GitHub (Oct 13, 2025): > > It's a pitty...the last couple of days I tinkered around a bit with some scripts to make the set up process easier but unfortunately I can't force Ollama to use the GTT anymore. That's quite a bummer because I want to be able to use more than the 16gb I can set in my BIOS. > > [@MrUhu](https://github.com/MrUhu) Hi, so you forgot what changes make it run with VRAM (changed from GTT)? No. I didn't forget. Since one of the earlier versions the vRAM handling changed. I don't know if it was a change in ROCm or Ollama but now Ollama checks VRAM and writes the modell to vRAM and not GTT. When GTT was still used I created custom modelfile where I told ollama to write all layers to Video Memory. In my case the video memory was the GTT. But now it uses the VRAM instead - so no more GTT tinkering. I could try an older version of Ollama but have to take a look at the .sh file beforehand.
Author
Owner

@Djip007 commented on GitHub (Oct 21, 2025):

1 395 / 5 000
Sorry, I didn't see this issue.
There are a lot of assumptions in the comments that aren't correct, I'll try to be clear.
Ollama doesn't manage any allocations; it only configures llama.cpp.
In the rocm/hip backend of llama.cpp, no allocation is ever requested on the GTT. I don't even think this is possible with hip. With HIP, you can allocate memory on the device or on the host. In the case of host allocation, it's possible to configure cache coherence for better performance.
On the llama.cpp side, there's the option to choose whether to allocate memory on the host or the device. Enabling this on a dGPU significantly reduces performance. But on an eGPU, it allows you to use all of the RAM with very little loss. (And I'd say it can work on Windows.)
What's changed is the AMD driver in the Linux kernel. Previously, device allocation was always done on vRAM. Since many programs don't handle allocation on the host, this severely limited the usable size. With a kernel running under Linux (>6.11???), AMD made a change to Linux so that the allocation on an APU's device could be allocated either to vRAM or to the GTT. This made it easy to access more memory on laptops where the manufacturer doesn't allow changing the vRAM size in the BIOS.

And there were no changes to llama.cpp/ollama to handle this.

The problem lies with ollama. It doesn't physically allocate memory, but rather tries to estimate the memory size usable by GPUs to determine how many LLM layers it can fit into it. And this is where it gets complicated:

  • On a dGPU, you have to look at the vRAM size; the rest doesn't really matter.
  • On an iGPU, it will depend on how llama.cpp is configured. If it sets the env variable (for recent versions of llama.cpp) GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON, then the memory size that llama.cpp can allocate is the RAM size. Otherwise, it's vRAM + GTT.

Note: I know there's already been some discussion on this topic, but I don't know where it stands.
On the llama.cpp side, they've recently (?) changed the way they handle this allocation: previously, it had to be done at compile time (which was a problem for ollama), but now it's doable via an env variable.

<!-- gh-comment-id:3428629934 --> @Djip007 commented on GitHub (Oct 21, 2025): 1 395 / 5 000 Sorry, I didn't see this issue. There are a lot of assumptions in the comments that aren't correct, I'll try to be clear. Ollama doesn't manage any allocations; it only configures llama.cpp. In the rocm/hip backend of llama.cpp, no allocation is ever requested on the GTT. I don't even think this is possible with hip. With HIP, you can allocate memory on the device or on the host. In the case of host allocation, it's possible to configure cache coherence for better performance. On the llama.cpp side, there's the option to choose whether to allocate memory on the host or the device. Enabling this on a dGPU significantly reduces performance. But on an eGPU, it allows you to use all of the RAM with very little loss. (And I'd say it can work on Windows.) What's changed is the AMD driver in the Linux kernel. Previously, device allocation was always done on vRAM. Since many programs don't handle allocation on the host, this severely limited the usable size. With a kernel running under Linux (>6.11???), AMD made a change to Linux so that the allocation on an APU's device could be allocated either to vRAM or to the GTT. This made it easy to access more memory on laptops where the manufacturer doesn't allow changing the vRAM size in the BIOS. And there were no changes to llama.cpp/ollama to handle this. The problem lies with ollama. It doesn't physically allocate memory, but rather tries to estimate the memory size usable by GPUs to determine how many LLM layers it can fit into it. And this is where it gets complicated: - On a dGPU, you have to look at the vRAM size; the rest doesn't really matter. - On an iGPU, it will depend on how llama.cpp is configured. If it sets the env variable (for recent versions of llama.cpp) `GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON`, then the memory size that llama.cpp can allocate is the RAM size. Otherwise, it's vRAM + GTT. Note: I know there's already been some discussion on this topic, but I don't know where it stands. On the llama.cpp side, they've recently (?) changed the way they handle this allocation: previously, it had to be done at compile time (which was a problem for ollama), but now it's doable via an env variable.
Author
Owner

@Djip007 commented on GitHub (Oct 21, 2025):

To finally address the iGPU case (potentially also under Windows), we need to review the way to calculate available memory, and to configure llama.cpp. I would say for a simple case:

  • dGPU: memory == vRAM
  • iGPU: set GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and memory == RAM

(and because it did not need vRAM at all on iGPU it may be best to make it as small as possible...)

<!-- gh-comment-id:3428719642 --> @Djip007 commented on GitHub (Oct 21, 2025): To finally address the iGPU case (potentially also under Windows), we need to review the way to calculate available memory, and to configure llama.cpp. I would say for a simple case: - dGPU: memory == vRAM - iGPU: set GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and memory == RAM (and because it did not need vRAM at all on iGPU it may be best to make it as small as possible...)
Author
Owner

@rick-github commented on GitHub (Oct 21, 2025):

ollama is no longer a wrapper for llama.cpp, it doesn't configure it. It uses the same ggml.org library that llama.cpp does.

GGML_CUDA_ENABLE_UNIFIED_MEMORY is not the unified memory is the sense that and iGPU and the CPU share it. It enables the dGPU to access system RAM via the PCI bus and is not included in the memory estimation done by ollama, except if the layers are forced onto the GPU with num_gpu.

<!-- gh-comment-id:3429252282 --> @rick-github commented on GitHub (Oct 21, 2025): ollama is no longer a wrapper for llama.cpp, it doesn't configure it. It uses the same ggml.org library that llama.cpp does. `GGML_CUDA_ENABLE_UNIFIED_MEMORY` is not the unified memory is the sense that and iGPU and the CPU share it. It enables the dGPU to access system RAM via the PCI bus and is not included in the memory estimation done by ollama, except if the layers are forced onto the GPU with `num_gpu`.
Author
Owner

@Djip007 commented on GitHub (Oct 22, 2025):

GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it.
It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly).
For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates.

I have a question, where is the memory "compute" did you use what ggml report?

<!-- gh-comment-id:3430134447 --> @Djip007 commented on GitHub (Oct 22, 2025): GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it. It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly). For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates. I have a question, where is the memory "compute" did you use what ggml report?
Author
Owner

@MrUhu commented on GitHub (Nov 7, 2025):

GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it. It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly). For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates.

I have a question, where is the memory "compute" did you use what ggml report?

Thanks for this info. Works for me.

If anyone is also using Fedora, here are my Ollama scripts for Fedora:
https://github.com/MrUhu/handy-fedora-scripts-for-ollama

With the update.sh script it will update your PC, check the Ollama release page for the latest update and only run the update command when there is a new update. Then it will add a couple of environment variables to the ollama.service file and restart the service - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and HSA_OVERRIDE_GFX_VERSION=11.0.2 primarily. Change the HSA OVERRIDE to your preffered version.

With the change_gtt_size_for_amd_igpu.sh it will take your desired GTT Size in GB (!!!), check your available system memory and write the set GTT size to grubby. I've added a limiter of 50% of your available system memory. You can edit this out if you want.

And finally overwrite_gpu_restriction_to_modelfiles.sh takes your list of models in ollama ls and looks up their layercount on ollama.com, writes this layer count to a new modelfile with PARAMETER num_gpu and creates these models for ollama to use.

I currently run qwen3-coder (q4 model with 19gb total) on 4gb of VRAM allocated to my 7940hs with a RX 780m as "myqwen3-code" (the edited Models are just called my+Model Name) with all layers running on 28gb of GTT.

Use at your own risk, of course.

<!-- gh-comment-id:3502003998 --> @MrUhu commented on GitHub (Nov 7, 2025): > GGML_CUDA_ENABLE_UNIFIED_MEMORY is part of ggml so so we can still use it. It is like you point tu use RAM on GPU, the fact that the APU use unified memory mean that with good config it can be use without without loss of performance (or nearly). For now the CUDA/HIP backend has an inconsistency. When this variable is set, it continues to report the size of the vRAM (+GTT on APUs with a resent kernel) and not the size of the RAM on which it actually allocates. > > I have a question, where is the memory "compute" did you use what ggml report? Thanks for this info. Works for me. If anyone is also using Fedora, here are my Ollama scripts for Fedora: https://github.com/MrUhu/handy-fedora-scripts-for-ollama With the update.sh script it will update your PC, check the Ollama release page for the latest update and only run the update command when there is a new update. Then it will add a couple of environment variables to the ollama.service file and restart the service - GGML_CUDA_ENABLE_UNIFIED_MEMORY=ON and HSA_OVERRIDE_GFX_VERSION=11.0.2 primarily. Change the HSA OVERRIDE to your preffered version. With the change_gtt_size_for_amd_igpu.sh it will take your desired GTT Size in GB (!!!), check your available system memory and write the set GTT size to grubby. I've added a limiter of 50% of your available system memory. You can edit this out if you want. And finally overwrite_gpu_restriction_to_modelfiles.sh takes your list of models in ollama ls and looks up their layercount on ollama.com, writes this layer count to a new modelfile with PARAMETER num_gpu <layer-count> and creates these models for ollama to use. I currently run qwen3-coder (q4 model with 19gb total) on 4gb of VRAM allocated to my 7940hs with a RX 780m as "myqwen3-code" (the edited Models are just called my+Model Name) with all layers running on 28gb of GTT. Use at your own risk, of course.
Author
Owner

@namecaps3k commented on GitHub (Dec 5, 2025):

I have simmilar problem but my VRAM is set to 1GB (lower value I can set in Bios) because I want to use only GTT (Full 120GB or so). I have ollama installed in official way, newest rocm, ollama finds it but also sets low VRAM and load everything to GTT or partially if model is small.

Also can see user here as exactly same issue: https://github.com/ollama/ollama/issues/12062

Thing is that all inference is done on CPU which is painfully slow. Any idea what can I do to run on GPU? llamacpp runs perfectly fine with this setup. When I set VRAM in bios to something larger it will show 100% GPU instead of CPU and works way faster (50 tokens vs 20)

ai@minis:~$ ollama -v
ollama version is 0.13.1

ai@minis:~$ ollama run gpt-oss:20b hello; ollama ps

Thinking...
User says "hello". They want greeting? We should respond politely, maybe ask how can help.
...done thinking.

Hello! 👋 How can I help you today?

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b 17052f91a42e 14 GB 100% CPU 4096 4 minutes from now

Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=00>
Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="1.0 GiB" threshold="20.0 GiB"
Nov 30 12:21:11 minis ollama[193626]
Nov 30 12:24:32 minis ollama[193626]: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0

<!-- gh-comment-id:3617279372 --> @namecaps3k commented on GitHub (Dec 5, 2025): I have simmilar problem but my VRAM is set to 1GB (lower value I can set in Bios) because I want to use only GTT (Full 120GB or so). I have ollama installed in official way, newest rocm, ollama finds it but also sets low VRAM and load everything to GTT or partially if model is small. Also can see user here as exactly same issue: https://github.com/ollama/ollama/issues/12062 Thing is that all inference is done on CPU which is painfully slow. Any idea what can I do to run on GPU? llamacpp runs perfectly fine with this setup. When I set VRAM in bios to something larger it will show 100% GPU instead of CPU and works way faster (50 tokens vs 20) ai@minis:~$ ollama -v ollama version is 0.13.1 ai@minis:~$ ollama run gpt-oss:20b hello; ollama ps Thinking... User says "hello". They want greeting? We should respond politely, maybe ask how can help. ...done thinking. Hello! 👋 How can I help you today? NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b 17052f91a42e 14 GB 100% CPU 4096 4 minutes from now Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=types.go:42 msg="inference compute" id=0 filter_id=0 library=ROCm compute=gfx1151 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=00> Nov 30 12:20:35 minis ollama[193626]: time=2025-11-30T12:20:35.977Z level=INFO source=routes.go:1638 msg="entering low vram mode" "total vram"="1.0 GiB" threshold="20.0 GiB" Nov 30 12:21:11 minis ollama[193626] Nov 30 12:24:32 minis ollama[193626]: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33958