[GH-ISSUE #9220] cannot be used with preferred buffer type ROCm_Host, using CPU instead #52521

Closed
opened 2026-04-28 23:36:23 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @ca80000 on GitHub (Feb 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9220

llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead

environment : windows
GPU : 7900XTX
ROCm version : 6.1.2
model : huihui_ai/deepseek-r1-abliterated:32b

Help, How do I fix this

Originally created by @ca80000 on GitHub (Feb 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9220 llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead environment : windows GPU : 7900XTX ROCm version : 6.1.2 model : huihui_ai/deepseek-r1-abliterated:32b Help, How do I fix this
Author
Owner

@rick-github commented on GitHub (Feb 19, 2025):

It's just a warning, you can ignore it. Server logs would confirm, but you are probably spilling from VRAM to system RAM.

<!-- gh-comment-id:2668342291 --> @rick-github commented on GitHub (Feb 19, 2025): It's just a warning, you can ignore it. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) would confirm, but you are probably spilling from VRAM to system RAM.
Author
Owner

@inkeliz commented on GitHub (Nov 1, 2025):

I'm hit the same "bug". The performance a disaster now. Even with "OLLAMA_NUM_PARALLEL=1" and "OLLAMA_CONTEXT_LENGTH=256" with a small 8b Q4 model it still showing:

llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead

I'm using a similar card, a RX 7900XT, which has 20GB. It shows:

load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        ROCm0 model buffer size =  4403.49 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
<!-- gh-comment-id:3476931681 --> @inkeliz commented on GitHub (Nov 1, 2025): I'm hit the same "bug". The performance a disaster now. Even with "OLLAMA_NUM_PARALLEL=1" and "OLLAMA_CONTEXT_LENGTH=256" with a small 8b Q4 model it still showing: ``` llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead ``` I'm using a similar card, a RX 7900XT, which has 20GB. It shows: ``` load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead load_tensors: offloading 32 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 33/33 layers to GPU load_tensors: ROCm0 model buffer size = 4403.49 MiB load_tensors: CPU_Mapped model buffer size = 281.81 MiB ```
Author
Owner

@rick-github commented on GitHub (Nov 2, 2025):

Server log.

<!-- gh-comment-id:3476977141 --> @rick-github commented on GitHub (Nov 2, 2025): [Server log](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.mdx).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52521