[GH-ISSUE #7966] ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f32) in pre-release version #51609

Closed
opened 2026-04-28 20:38:10 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @dkkb on GitHub (Dec 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7966

What is the issue?

I'm using this model https://huggingface.co/bartowski/Replete-LLM-V2.5-Qwen-32b-GGUF with the v0.5.0 pre-release.
After upgrading to the latest version, I was hoping to see improved performance. However, after making several API calls, I encountered the following error on the client side. I also noticed that GPU memory usage dropped to 0.

error making request: an error was encountered while running the model:
read tcp 127.0.0.1:3914->127.0.0.1:3890: wsarecv: An existing connection was forcibly closed by the remote host

Environment variables:

set OLLAMA_FLASH_ATTENTION=1
set OLLAMA_KV_CACHE_TYPE=q4_0
set CUDA_VISIBLE_DEVICES=xxx
set OLLAMA_HOST=0.0.0.0:11434
set OLLAMA_ORIGINS=*

Server error log:

ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f32)

Maybe same issue with https://github.com/ggerganov/llama.cpp/issues/5652?

Disable the OLLAMA_KV_CACHE_TYPE=q4_0 feature, it seems OK now.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

v0.5.0 pre-release

Originally created by @dkkb on GitHub (Dec 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7966 ### What is the issue? I'm using this model `https://huggingface.co/bartowski/Replete-LLM-V2.5-Qwen-32b-GGUF` with the v0.5.0 pre-release. After upgrading to the latest version, I was hoping to see improved performance. However, after making several API calls, I encountered the following error on the client side. I also noticed that GPU memory usage dropped to 0. ``` error making request: an error was encountered while running the model: read tcp 127.0.0.1:3914->127.0.0.1:3890: wsarecv: An existing connection was forcibly closed by the remote host ``` Environment variables: ``` set OLLAMA_FLASH_ATTENTION=1 set OLLAMA_KV_CACHE_TYPE=q4_0 set CUDA_VISIBLE_DEVICES=xxx set OLLAMA_HOST=0.0.0.0:11434 set OLLAMA_ORIGINS=* ``` Server error log: ``` ggml_cuda_cpy_fn: unsupported type combination (q4_0 to f32) ``` Maybe same issue with https://github.com/ggerganov/llama.cpp/issues/5652? Disable the OLLAMA_KV_CACHE_TYPE=q4_0 feature, it seems OK now. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version v0.5.0 pre-release
GiteaMirror added the bug label 2026-04-28 20:38:10 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 6, 2024):

https://github.com/ollama/ollama/issues/7938

<!-- gh-comment-id:2522921164 --> @rick-github commented on GitHub (Dec 6, 2024): https://github.com/ollama/ollama/issues/7938
Author
Owner

@jessegross commented on GitHub (Dec 7, 2024):

Tracking this in the other bug

<!-- gh-comment-id:2524710773 --> @jessegross commented on GitHub (Dec 7, 2024): Tracking this in the other bug
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51609