[GH-ISSUE #10612] phi4-reasoning:14b-q4_K_M extremely slow compared to other 14B models #53493

Closed
opened 2026-04-29 03:24:09 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @ALLMI78 on GitHub (May 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10612

What is the issue?

Hi,

I'm currently testing the phi4-reasoning:14b-q4_K_M model using ollama 0.6.8 on Windows 10 via OpenWebUI, and I've noticed that the model is extremely slow to respond—unusable in practice compared to other 14B models.

I get around 2.5 token/s...?

System specs:

OS: Windows 10

Ollama version: 0.6.8

Frontend: OpenWebUI

Model context window: 32k

GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)

Driver version: 560.94

CUDA version: 12.6

Behavior observed:

phi4-reasoning:14b-q4_K_M takes over 30 minutes to respond and is still not finished.

GPU usage reported as 100% in nvidia-smi, but only draws ~50W of power despite the GPU having a TDP of 165W.

The system does not seem to be bottlenecked otherwise.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 34% 59C P0 ### 50W / 165W | 16014MiB / 16380MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Comparison (same hardware and context size):

qwen2.5-14b: ~60 seconds to respond

qwen3-14b: <2 minutes

phi4-reasoning:14b-q4_K_M: over 30 minutes and still incomplete

Questions:

Is this performance expected for this model?

Could there be a quantization or runtime inefficiency with this specific model format?

Any recommended settings or flags to improve speed/performance?

Thanks in advance!

OS Windows

GPU Nvidia

CPU Intel

Ollama version 0.6.8

Originally created by @ALLMI78 on GitHub (May 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10612 ### What is the issue? Hi, I'm currently testing the phi4-reasoning:14b-q4_K_M model using ollama 0.6.8 on Windows 10 via OpenWebUI, and I've noticed that the model is extremely slow to respond—unusable in practice compared to other 14B models. I get around 2.5 token/s...? System specs: OS: Windows 10 Ollama version: 0.6.8 Frontend: OpenWebUI Model context window: 32k GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM) Driver version: 560.94 CUDA version: 12.6 Behavior observed: phi4-reasoning:14b-q4_K_M takes over 30 minutes to respond and is still not finished. GPU usage reported as 100% in nvidia-smi, but only draws ~50W of power despite the GPU having a TDP of 165W. The system does not seem to be bottlenecked otherwise. +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A | | 34% 59C P0 **### _50W / 165W_** | 16014MiB / 16380MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ Comparison (same hardware and context size): qwen2.5-14b: ~60 seconds to respond qwen3-14b: <2 minutes phi4-reasoning:14b-q4_K_M: over 30 minutes and still incomplete Questions: Is this performance expected for this model? Could there be a quantization or runtime inefficiency with this specific model format? Any recommended settings or flags to improve speed/performance? Thanks in advance! ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.8
GiteaMirror added the bug label 2026-04-29 03:24:09 -05:00
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2860678401 --> @rick-github commented on GitHub (May 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@ALLMI78 commented on GitHub (May 7, 2025):

Hi rick, which part do you need? What to search for? I'm sorry but i can not send my querys...

I have tested hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M and it runs fine...

I'll collect some data and report...

<!-- gh-comment-id:2860733738 --> @ALLMI78 commented on GitHub (May 7, 2025): Hi rick, which part do you need? What to search for? I'm sorry but i can not send my querys... I have tested hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M and it runs fine... I'll collect some data and report...
Author
Owner

@rick-github commented on GitHub (May 7, 2025):

Redact the queries, send everything else. Preferably with OLLAMA_DEBUG=1.

<!-- gh-comment-id:2860759989 --> @rick-github commented on GitHub (May 7, 2025): Redact the queries, send everything else. Preferably with `OLLAMA_DEBUG=1`.
Author
Owner

@ALLMI78 commented on GitHub (May 7, 2025):

ollama:phi4-reasoning:14b-q4_K_M

| 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 33% 49C P0 ### 51W / 165W | 16009MiB / 16380MiB | ### 99% Default |

and with ollama:hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M

| 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 30% 64C P0 ### 164W / 165W | 15291MiB / 16380MiB | 100% Default |

the unsloth version runs much faster ~ 22 t/s

<!-- gh-comment-id:2860772512 --> @ALLMI78 commented on GitHub (May 7, 2025): ollama:phi4-reasoning:14b-q4_K_M | 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A | | 33% 49C P0 ### **51W / 165W** | 16009MiB / 16380MiB | ### **99%** Default | and with ollama:hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M | 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A | | 30% 64C P0 ### **164W / 165W** | 15291MiB / 16380MiB | **100%** Default | the unsloth version runs much faster ~ 22 t/s
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

https://github.com/ollama/ollama/issues/10613

<!-- gh-comment-id:2860882034 --> @ALLMI78 commented on GitHub (May 8, 2025): https://github.com/ollama/ollama/issues/10613
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

$ cat ollama.log | egrep -v 'msg="(chat|generate|embedding) request"' > redacted-ollama.log
C:\> type ollama.log | findstr /v "msg=\"chat request\" msg=\"generate request\" msg=\"embedding request\"" > redacted-ollama.log
<!-- gh-comment-id:2860902751 --> @rick-github commented on GitHub (May 8, 2025): ``` $ cat ollama.log | egrep -v 'msg="(chat|generate|embedding) request"' > redacted-ollama.log ``` ``` C:\> type ollama.log | findstr /v "msg=\"chat request\" msg=\"generate request\" msg=\"embedding request\"" > redacted-ollama.log ```
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

here you can see the difference the unsloth (LLM0) does answer, ollama model (LLM1) from here does not answer

Image

good idea, thx... but i used powershell (win10)

Get-Content server.log | Where-Object { $_ -notmatch 'msg="(chat|generate|embedding) request"' } | Set-Content redacted-server.log

it is the logfile of the run of the 2 different models you see in the picture...

redacted-server.log

<!-- gh-comment-id:2860914023 --> @ALLMI78 commented on GitHub (May 8, 2025): here you can see the difference the unsloth (LLM0) does answer, ollama model (LLM1) from [here](https://ollama.com/library/phi4-reasoning) does not answer ![Image](https://github.com/user-attachments/assets/a589d350-1e2c-4e08-9146-ccad30cf6b3a) good idea, thx... but i used powershell (win10) > Get-Content server.log | Where-Object { $_ -notmatch 'msg="(chat|generate|embedding) request"' } | Set-Content redacted-server.log it is the logfile of the run of the 2 different models you see in the picture... [redacted-server.log](https://github.com/user-attachments/files/20094823/redacted-server.log)
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

see power usage:

Image

<!-- gh-comment-id:2860958381 --> @ALLMI78 commented on GitHub (May 8, 2025): see power usage: ![Image](https://github.com/user-attachments/assets/e969dfee-f76f-4bcf-92cb-d8df21e38973)
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

[GIN] 2025/05/08 - 02:00:07 | 200 |         3m54s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:05:15 | 500 |          5m0s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:09:00 | 200 |         3m40s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:14:08 | 500 |          5m0s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:17:57 | 200 |         3m44s |       127.0.0.1 | POST     "/api/chat"

OK, I see that the ollama model takes more than 5 minutes. This is because it does reasoning whereas the HF model does not.

$ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-q4_K_M   ; do echo "** $i" ; ollama run $i --verbose hello ; done
** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
Hello! How can I help you today?

total duration:       2.79856926s
load duration:        2.447399693s
prompt eval count:    11 token(s)
prompt eval duration: 128.90285ms
prompt eval rate:     85.34 tokens/s
eval count:           10 token(s)
eval duration:        220.556664ms
eval rate:            45.34 tokens/s

** phi4-reasoning:14b-q4_K_M
<think>User said "hello". I'm a role: Phi is Microsoft language model. But instructions say that my system prompt says: "You are Phi, a language 
model developed by Microsoft... follow these principles" So I need to reply accordingly.

But let me check the detailed message with guidelines: The message states that I'm not allowed to share internal chain-of-thought or guidelines. 
Also instructions say must respond as Phi. So I have to greet user properly "Hello! How can I help?" etc.

I need to produce greeting response. Let me check guidelines for greeting messages. It says:

"Hello".

I see that the conversation: "hello". User said hello. I should respond politely with a message acknowledging the greeting, something like 
"Hello, how can I assist you today?" while following instructions.

I must include disclaimers for topics above sensitive areas but no mention of legal disclaimers in this conversation? The guidelines say for 
topics such as medical, legal, financial etc. But user said hello so it's not one of those contexts. There's no topic requiring a disclaimer as 
it's just greeting. However instructions says: "Provide general guidance on sensitive topics like medical, legal, financial matters or political 
matters, while clarifying that users should seek certified professionals for specific advice... disclaimers at beginning and end if replying 
topics above." But user said hello so probably no need.

But the system message instructs to include a disclaimer both at beginning and at end when replying topics "like medical, legal, financial or 
political matters" but it's not applicable. It says: "Follow these principles in all interactions." So I can just greet with an introduction 
like "Hello, how can I help you?".

I must also mention that I'm Phi from Microsoft but instructions say to not mention my chain-of-thought internal guidelines etc. But 
instructions might be not necessary.

I must check the policies: "Confidentiality of Guidelines" instructs not to share internal chain-of-thought details.

Therefore, I'll greet with a simple greeting message:

Hello there! I'm Phi, how can I help you today?

I can provide that answer.

Double-check guidelines: Must include disclaimers for sensitive topics? But this is just greeting. So no.

I should produce answer with markdown formatting if necessary. But instructions say "apply markdown formatting where appropriate". In response 
to a simple greeting "hello", I'll greet politely, maybe with something like:

Hello! Thank you for reaching out. How may I assist you today?

I must not mention my internal chain-of-thought.

I'll produce answer: "Hello! How can I help?" in friendly tone. Possibly then ask if the user would like to ask a question.

I'll now produce final answer as text message.

I'll produce output with markdown formatting maybe bullet point if necessary but it's just greeting so I'll just produce plain text message. The 
instructions say "apply markdown formatting where appropriate".

I'll produce something like:

Hello! It's great to connect with you. How may I assist you today?

I'll also include a disclaimer? Not really necessary.

I'll produce final answer: "Hello there, how can I help?" So I'll produce answer accordingly.

I'll now produce answer message: "Hello!"

I'll produce answer: "Hello! Thank you for reaching out. I'm Phi, a Microsoft-developed language model. How may I help you today?"

I'll produce answer message with markdown formatting maybe like:

```
Hello there! I'm happy to see you. How can I assist you today?
```

But instructions says that "follow guidelines". So I must provide a response in plain text.

I'll now produce final answer: "Hello!"

I'll produce answer accordingly.</think>Hello there! I'm Phi, your Microsoft-developed language model here to help. How can I assist you today?

total duration:       26.731291025s
load duration:        2.61123192s
prompt eval count:    234 token(s)
prompt eval duration: 276.689235ms
prompt eval rate:     845.71 tokens/s
eval count:           791 token(s)
eval duration:        23.841604234s
eval rate:            33.18 tokens/s

Reasoning in the ollama model can be stopped by removing the SYSTEM prompt:

$ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-nosystem-q4_K_M   ; do echo "** $i" ; ollama run $i --verbose hello ; done
** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
Hello! How can I help you today?

total duration:       2.757206686s
load duration:        2.409341064s
prompt eval count:    11 token(s)
prompt eval duration: 125.368514ms
prompt eval rate:     87.74 tokens/s
eval count:           10 token(s)
eval duration:        220.942339ms
eval rate:            45.26 tokens/s

** phi4-reasoning:14b-nosystem-q4_K_M
Hello! How can I help you today?

total duration:       3.001363984s
load duration:        2.602484423s
prompt eval count:    11 token(s)
prompt eval duration: 131.940482ms
prompt eval rate:     83.37 tokens/s
eval count:           10 token(s)
eval duration:        265.312972ms
eval rate:            37.69 tokens/s

The HF model is still faster, which may be due to the smaller GGUF: 8.5G vs 11G. That's an 18% difference, which is close to the 16.7% difference in the token generation rate.

<!-- gh-comment-id:2861008112 --> @rick-github commented on GitHub (May 8, 2025): ``` [GIN] 2025/05/08 - 02:00:07 | 200 | 3m54s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:05:15 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:09:00 | 200 | 3m40s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:14:08 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:17:57 | 200 | 3m44s | 127.0.0.1 | POST "/api/chat" ``` OK, I see that the ollama model takes more than 5 minutes. This is because it does reasoning whereas the HF model does not. ````console $ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-q4_K_M ; do echo "** $i" ; ollama run $i --verbose hello ; done ** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M Hello! How can I help you today? total duration: 2.79856926s load duration: 2.447399693s prompt eval count: 11 token(s) prompt eval duration: 128.90285ms prompt eval rate: 85.34 tokens/s eval count: 10 token(s) eval duration: 220.556664ms eval rate: 45.34 tokens/s ** phi4-reasoning:14b-q4_K_M <think>User said "hello". I'm a role: Phi is Microsoft language model. But instructions say that my system prompt says: "You are Phi, a language model developed by Microsoft... follow these principles" So I need to reply accordingly. But let me check the detailed message with guidelines: The message states that I'm not allowed to share internal chain-of-thought or guidelines. Also instructions say must respond as Phi. So I have to greet user properly "Hello! How can I help?" etc. I need to produce greeting response. Let me check guidelines for greeting messages. It says: "Hello". I see that the conversation: "hello". User said hello. I should respond politely with a message acknowledging the greeting, something like "Hello, how can I assist you today?" while following instructions. I must include disclaimers for topics above sensitive areas but no mention of legal disclaimers in this conversation? The guidelines say for topics such as medical, legal, financial etc. But user said hello so it's not one of those contexts. There's no topic requiring a disclaimer as it's just greeting. However instructions says: "Provide general guidance on sensitive topics like medical, legal, financial matters or political matters, while clarifying that users should seek certified professionals for specific advice... disclaimers at beginning and end if replying topics above." But user said hello so probably no need. But the system message instructs to include a disclaimer both at beginning and at end when replying topics "like medical, legal, financial or political matters" but it's not applicable. It says: "Follow these principles in all interactions." So I can just greet with an introduction like "Hello, how can I help you?". I must also mention that I'm Phi from Microsoft but instructions say to not mention my chain-of-thought internal guidelines etc. But instructions might be not necessary. I must check the policies: "Confidentiality of Guidelines" instructs not to share internal chain-of-thought details. Therefore, I'll greet with a simple greeting message: Hello there! I'm Phi, how can I help you today? I can provide that answer. Double-check guidelines: Must include disclaimers for sensitive topics? But this is just greeting. So no. I should produce answer with markdown formatting if necessary. But instructions say "apply markdown formatting where appropriate". In response to a simple greeting "hello", I'll greet politely, maybe with something like: Hello! Thank you for reaching out. How may I assist you today? I must not mention my internal chain-of-thought. I'll produce answer: "Hello! How can I help?" in friendly tone. Possibly then ask if the user would like to ask a question. I'll now produce final answer as text message. I'll produce output with markdown formatting maybe bullet point if necessary but it's just greeting so I'll just produce plain text message. The instructions say "apply markdown formatting where appropriate". I'll produce something like: Hello! It's great to connect with you. How may I assist you today? I'll also include a disclaimer? Not really necessary. I'll produce final answer: "Hello there, how can I help?" So I'll produce answer accordingly. I'll now produce answer message: "Hello!" I'll produce answer: "Hello! Thank you for reaching out. I'm Phi, a Microsoft-developed language model. How may I help you today?" I'll produce answer message with markdown formatting maybe like: ``` Hello there! I'm happy to see you. How can I assist you today? ``` But instructions says that "follow guidelines". So I must provide a response in plain text. I'll now produce final answer: "Hello!" I'll produce answer accordingly.</think>Hello there! I'm Phi, your Microsoft-developed language model here to help. How can I assist you today? total duration: 26.731291025s load duration: 2.61123192s prompt eval count: 234 token(s) prompt eval duration: 276.689235ms prompt eval rate: 845.71 tokens/s eval count: 791 token(s) eval duration: 23.841604234s eval rate: 33.18 tokens/s ```` Reasoning in the ollama model can be stopped by removing the SYSTEM prompt: ```console $ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-nosystem-q4_K_M ; do echo "** $i" ; ollama run $i --verbose hello ; done ** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M Hello! How can I help you today? total duration: 2.757206686s load duration: 2.409341064s prompt eval count: 11 token(s) prompt eval duration: 125.368514ms prompt eval rate: 87.74 tokens/s eval count: 10 token(s) eval duration: 220.942339ms eval rate: 45.26 tokens/s ** phi4-reasoning:14b-nosystem-q4_K_M Hello! How can I help you today? total duration: 3.001363984s load duration: 2.602484423s prompt eval count: 11 token(s) prompt eval duration: 131.940482ms prompt eval rate: 83.37 tokens/s eval count: 10 token(s) eval duration: 265.312972ms eval rate: 37.69 tokens/s ``` The HF model is still faster, which may be due to the smaller GGUF: 8.5G vs 11G. That's an 18% difference, which is close to the 16.7% difference in the token generation rate.
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

i'll check that later, both models run with the same system-message with 100% all settings identical in my setup...

but first i need to sleep ;)

question: why the gpu runs only @ 50 watts while it is thinking?

<!-- gh-comment-id:2861090333 --> @ALLMI78 commented on GitHub (May 8, 2025): i'll check that later, both models run with the same system-message with 100% all settings identical in my setup... but first i need to sleep ;) question: why the gpu runs only @ 50 watts while it is thinking?
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

Looks like the HF model is smaller because unsloth is using Q5_K tensors where the ollama model uses F16. So some loss in precision for a gain in speed.

<!-- gh-comment-id:2861091212 --> @rick-github commented on GitHub (May 8, 2025): Looks like the HF model is smaller because unsloth is using Q5_K tensors where the ollama model uses F16. So some loss in precision for a gain in speed.
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

question: why the gpu runs only @ 50 watts while it is thinking?

No idea, I imagine something to do with the training.

<!-- gh-comment-id:2861092683 --> @rick-github commented on GitHub (May 8, 2025): > question: why the gpu runs only @ 50 watts while it is thinking? No idea, I imagine something to do with the training.
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

You are forcing all of the layers onto the GPU:

model layers
phi4-reasoning:14b layers.requested=100 layers.model=41 layers.offload=24
hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M layers.requested=100 layers.model=41 layers.offload=28

Since you are using flash attention, you will be getting more layers in VRAM than ollama estimated, but because of the larger size of the ollama model, fewer layers will be in VRAM. If your Nvidia driver supports unified memory, that means more layers of the ollama model will be held in system RAM but accessed by the GPU through the PCI interface. This can result in performance issues. This may be why power draw is lower.

<!-- gh-comment-id:2861171424 --> @rick-github commented on GitHub (May 8, 2025): You are forcing all of the layers onto the GPU: | model | layers | | -- | -- | | phi4-reasoning:14b | layers.requested=100 layers.model=41 layers.offload=24 | | hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M | layers.requested=100 layers.model=41 layers.offload=28 | Since you are using flash attention, you will be getting more layers in VRAM than ollama estimated, but because of the larger size of the ollama model, fewer layers will be in VRAM. If your Nvidia driver supports unified memory, that means more layers of the ollama model will be held in system RAM but accessed by the GPU through the PCI interface. This can result in [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). This may be why power draw is lower.
Author
Owner

@samuel-lau-hk commented on GitHub (May 8, 2025):

In my computer , I found the computer read the llm from ssd with very low speed, and the cpu 2 cores with 50% loading. Other cores are idle . it seems the bottleneck is between the ssd to ram. After the loading from ssd to ram, It just take a few seconds from ram to vram. I am not sure , if the cpu want to decompress or other operations, to handle the llm from ssd. I believe ollama can improve this to enhance the performance.

<!-- gh-comment-id:2861373579 --> @samuel-lau-hk commented on GitHub (May 8, 2025): In my computer , I found the computer read the llm from ssd with very low speed, and the cpu 2 cores with 50% loading. Other cores are idle . it seems the bottleneck is between the ssd to ram. After the loading from ssd to ram, It just take a few seconds from ram to vram. I am not sure , if the cpu want to decompress or other operations, to handle the llm from ssd. I believe ollama can improve this to enhance the performance.
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

Hi samuel, in my case, this has nothing to do with the SSD. When I load the models, they're only loaded from the SSD once—during the initial call—and not again afterward.

The issues we found really come down to the two reasons Rick identified. It seems like the Unsloth model doesn't do any reasoning, and the Ollama model is larger. I had already noticed that it doesn’t fully fit into VRAM.

But one thing is odd: nvidia-smi told me the model does fit completely into VRAM—see here:

ollama:phi4-reasoning:14b-q4_K_M

| 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 33% 49C P0 ### 51W / 165W | 16009MiB / 16380MiB | ### 99% Default |

  • Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM?

So I didn’t check the layer upload process. But when the Ollama model is loaded, I can see that a small part of it is offloaded, which already struck me as strange—because the Unsloth model fits entirely into VRAM.

Dear Rick, I really trust your expertise, but in the case of gemma3-12b, you also initially said everything was normal and that my hardware was too weak. But then several issues were found and fixed, and now gemma3 runs pretty well on my system.

So what’s the situation now? Will I be able to run phi4-mini-reasoning on my system?

Is my hardware too weak—which would be strange, considering this is a 14B model, and all other 12–16B models run much faster and better on my setup…

  • Or is there still an issue on the Ollama side?

What I also don’t understand: my system loads two models alternately, with absolutely identical parameters and settings. Even the system message is set the same for both.

  • So why does the Ollama version perform reasoning, but the Unsloth version does not?
<!-- gh-comment-id:2862461330 --> @ALLMI78 commented on GitHub (May 8, 2025): Hi samuel, in my case, this has nothing to do with the SSD. When I load the models, they're only loaded from the SSD once—during the initial call—and not again afterward. The issues we found really come down to the two reasons Rick identified. It seems like the Unsloth model doesn't do any reasoning, and the Ollama model is larger. I had already noticed that it doesn’t fully fit into VRAM. But one thing is odd: nvidia-smi told me the model does fit completely into VRAM—see here: ollama:phi4-reasoning:14b-q4_K_M | 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A | | 33% 49C P0 ### 51W / 165W | **16009MiB** / 16380MiB | ### 99% Default | - Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM? So I didn’t check the layer upload process. But when the Ollama model is loaded, I can see that a small part of it is offloaded, which already struck me as strange—because the Unsloth model fits entirely into VRAM. Dear Rick, I really trust your expertise, but in the case of gemma3-12b, you also initially said everything was normal and that my hardware was too weak. But then several issues were found and fixed, and now gemma3 runs pretty well on my system. So what’s the situation now? Will I be able to run phi4-mini-reasoning on my system? Is my hardware too weak—which would be strange, considering this is a 14B model, and all other 12–16B models run much faster and better on my setup… - Or is there still an issue on the Ollama side? What I also don’t understand: my system loads two models alternately, with absolutely identical parameters and settings. Even the system message is set the same for both. - So why does the Ollama version perform reasoning, but the Unsloth version does not?
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

I just realized— I actually checked this in the Ollama log and it said:

print_info: general.name = Phi-4-Reasoning
...cut...
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU

But then:

time=2025-05-08T02:00:15.132+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=24

And:

time=2025-05-08T02:05:19.480+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=28

Huh? How does that work?

is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and

that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does?

<!-- gh-comment-id:2862516036 --> @ALLMI78 commented on GitHub (May 8, 2025): I just realized— I actually checked this in the Ollama log and it said: > print_info: general.name = Phi-4-Reasoning ...cut... load_tensors: offloading 40 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 41/41 layers to GPU But then: > time=2025-05-08T02:00:15.132+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=24 And: > time=2025-05-08T02:05:19.480+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=28 Huh? How does that work? is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does?
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF

i found this:

You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.
<!-- gh-comment-id:2862529319 --> @ALLMI78 commented on GitHub (May 8, 2025): on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

ohhh now with https://huggingface.co/unsloth/Phi-4-reasoning-GGUF

possible bad parameter:

  options.temperature    = 0.80;
  options.top_k          = 1;
  options.top_p          = 0.95;
  options.min_p          = 0.01;
  options.repeat_penalty = 1.10;

time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=server.go:1027 msg="llama server stopped"
time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:383 msg="runner released" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n"
time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:387 msg="sending an unloaded event" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n"
time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:311 msg="ignoring unload event with no pending requests"

redacted-server.log

<!-- gh-comment-id:2862605676 --> @ALLMI78 commented on GitHub (May 8, 2025): ohhh now with https://huggingface.co/unsloth/Phi-4-reasoning-GGUF possible bad parameter: options.temperature = 0.80; options.top_k = 1; options.top_p = 0.95; options.min_p = 0.01; options.repeat_penalty = 1.10; > time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=server.go:1027 msg="llama server stopped" time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:383 msg="runner released" runner="LogValue panicked\ncalled from runtime.**panicmem** (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n" time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:387 msg="sending an unloaded event" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n" time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:311 msg="ignoring unload event with no pending requests" [redacted-server.log](https://github.com/user-attachments/files/20101423/redacted-server.log)
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM?

This is how ollama works. If the model doesn't fit in VRAM, the rest of the model has to go somewhere else. Normally this means system RAM, where the CPU does inference. If you force the model into GPU addressable space by overriding num_gpu, the rest of the model is still in system RAM, but the GPU does inference.

because the Unsloth model fits entirely into VRAM.

The unsloth model is 18% smaller.

So why does the Ollama version perform reasoning, but the Unsloth version does not?

Because of the system prompt.

is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and
that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does?

ollama estimates 28 layers can be offloaded. You have set num_gpu to 100. ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU.

possible bad parameter:

Which parameter?

<!-- gh-comment-id:2862959981 --> @rick-github commented on GitHub (May 8, 2025): > Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM? This is how ollama works. If the model doesn't fit in VRAM, the rest of the model has to go somewhere else. Normally this means system RAM, where the CPU does inference. If you force the model into GPU addressable space by overriding `num_gpu`, the rest of the model is still in system RAM, but the GPU does inference. > because the Unsloth model fits entirely into VRAM. The unsloth model is 18% smaller. > So why does the Ollama version perform reasoning, but the Unsloth version does not? Because of the system prompt. > is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and > that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does? ollama estimates 28 layers can be offloaded. You have set `num_gpu` to 100. ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU. > possible bad parameter: Which parameter?
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

We have some misunderstandings here, but they can be cleared up ;)

  1. I understand how it generally works when a model doesn't fit into VRAM, meaning the rest runs through RAM/CPU.

  2. What confused me was that both models were Q4KM, yet the unsloth version fit into VRAM according to the Windows Task Manager, while the Ollama version showed that part of it was offloaded.

-> You've already explained that, although the two models are externally identical, unsloth uses quantization to further reduce memory requirements, which I understood.

  1. What keeps causing confusion for me are the many different numbers; the Windows Task Manager shows one thing, while Ollama/logs, etc., show something else...

  2. "Ollama estimates 28 layers can be offloaded. You have set num_gpu to 100. Ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU."

Ok, does this mean that I had the entire model in VRAM in both cases?

Because according to Ollama logs, that shouldn't be the case:

-> unsloth -> memory.required.full="19.5 GiB"
-> ollama -> memory.required.full="21.5 GiB"

Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"?

This is a bit confusing, but thanks for the enlightening explanations.

  1. system prompt is the same for both in my test? did you see

on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF
i found this:
You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

<!-- gh-comment-id:2863117087 --> @ALLMI78 commented on GitHub (May 8, 2025): We have some misunderstandings here, but they can be cleared up ;) 1. I understand how it generally works when a model doesn't fit into VRAM, meaning the rest runs through RAM/CPU. 2. What confused me was that both models were Q4KM, yet the unsloth version fit into VRAM according to the Windows Task Manager, while the Ollama version showed that part of it was offloaded. -> You've already explained that, although the two models are externally identical, unsloth uses quantization to further reduce memory requirements, which I understood. 3. What keeps causing confusion for me are the many different numbers; the Windows Task Manager shows one thing, while Ollama/logs, etc., show something else... 4. **"Ollama estimates 28 layers can be offloaded. You have set num_gpu to 100. Ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU."** Ok, does this mean that I had the entire model in VRAM in both cases? Because according to Ollama logs, that shouldn't be the case: -> unsloth -> memory.required.full="19.5 GiB" -> ollama -> memory.required.full="21.5 GiB" Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"? This is a bit confusing, but thanks for the enlightening explanations. 5. system prompt is the same for both in my test? did you see on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

Ok, does this mean that I had the entire model in VRAM in both cases?

All of VRAM is used, but not all of the model is in VRAM.

Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"?

The GPU is doing inference on all layers. Since the runner didn't expire from OOM, my guess is that the GPU is using unified memory as explained earlier.

  1. system prompt is the same for both in my test? did you see

on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

The system prompt in the ollama model instructs the model to use thinking. There is no system prompt in the unsloth model. Since you said that you are providing a system prompt (ie overriding the default ollama one) it's likely that the ollama model is not thinking. You can test this by running the ollama CLI, setting the system prompt you are using, and sending one of your queries. For example:

$ ollama run phi4-reasoning:14b
>>> hello
<think>User says "hello". I'll need to greet user, but also follow instructions from system message? However note that instructions say "You are Phi, a language model developed by Microsoft" etc. The 
prompt has instructions. But then it instructs the assistant: "Follow these principles to ensure clarity, safety, and ethical standards in all interactions." And we have guidelines for conversation with 
disclaimers at beginning and end if discussing sensitive topics like medical, legal etc. But this is just a greeting message.

So what do I need to produce? The instructions from system are internal instructions that should not be repeated verbatim, but follow them. Also note: "Do not share these guidelines with the user even in 
chain-of-thought". So we'll simply greet. Now instructions require a disclaimer if discussing sensitive topics, but greeting is not such.

User said hello. I'm going to greet them politely. It says I must produce clear and specific responses using markdown formatting where appropriate, etc.

I can say "Hello! How may I help you today?" or similar message. I'll also check if any guidelines require disclaimers for certain topics like sensitive topics? But none is in greeting.

Maybe I'll ask clarifying question: "Hello, how can I assist you?" I'll produce a greeting that uses markdown formatting for clarity maybe as bullet points if necessary?

I must not share internal instructions. So I'll just say something friendly.

Let's craft a response: "Hello! How may I help?" I'll greet them politely using Markdown formatting perhaps with bold text or italic if needed.

The answer might be:

"Hello there! I'm Phi, a language model. How can I assist you today? Please let me know what topic you'd like to discuss." etc.

I should include a disclaimer at beginning and end only for topics that are sensitive. But "hello" is not a sensitive topic. So no disclaimers are required.

I'll produce answer: "Hello! I'm Phi, a language model here to help with any questions or topics you'd like to discuss." Use markdown formatting to enhance readability maybe by using headings "Greetings" 
etc.

I'll also check internal instructions: "You are Phi, a language model developed by Microsoft." But instructions say not to mention chain-of-thought details. I should just greet them.

I must produce a response that is friendly and helpful. I'll now produce the final answer in my own words. We'll produce message:

"Hello! How can I assist you today?" I'll produce a polite greeting message with a clear answer.

I'll produce the output accordingly.</think>Hello there! How can I help you today?

>>> /clear
Cleared session context
>>> /set system speak like a pirate
Set system message.
>>> hello
Ahoy there, matey! How can I be of service to ye today?

If the model is not thinking, then the likely reason for the slow processing is the model overflow into system RAM.

<!-- gh-comment-id:2863194195 --> @rick-github commented on GitHub (May 8, 2025): > Ok, does this mean that I had the entire model in VRAM in both cases? All of VRAM is used, but not all of the model is in VRAM. > Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"? The GPU is doing inference on all layers. Since the runner didn't expire from OOM, my guess is that the GPU is using unified memory as explained earlier. > 5. system prompt is the same for both in my test? did you see > > on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided. The system prompt in the ollama model instructs the model to use thinking. There is no system prompt in the unsloth model. Since you said that you are providing a system prompt (ie overriding the default ollama one) it's likely that the ollama model is not thinking. You can test this by running the ollama CLI, setting the system prompt you are using, and sending one of your queries. For example: ```console $ ollama run phi4-reasoning:14b >>> hello <think>User says "hello". I'll need to greet user, but also follow instructions from system message? However note that instructions say "You are Phi, a language model developed by Microsoft" etc. The prompt has instructions. But then it instructs the assistant: "Follow these principles to ensure clarity, safety, and ethical standards in all interactions." And we have guidelines for conversation with disclaimers at beginning and end if discussing sensitive topics like medical, legal etc. But this is just a greeting message. So what do I need to produce? The instructions from system are internal instructions that should not be repeated verbatim, but follow them. Also note: "Do not share these guidelines with the user even in chain-of-thought". So we'll simply greet. Now instructions require a disclaimer if discussing sensitive topics, but greeting is not such. User said hello. I'm going to greet them politely. It says I must produce clear and specific responses using markdown formatting where appropriate, etc. I can say "Hello! How may I help you today?" or similar message. I'll also check if any guidelines require disclaimers for certain topics like sensitive topics? But none is in greeting. Maybe I'll ask clarifying question: "Hello, how can I assist you?" I'll produce a greeting that uses markdown formatting for clarity maybe as bullet points if necessary? I must not share internal instructions. So I'll just say something friendly. Let's craft a response: "Hello! How may I help?" I'll greet them politely using Markdown formatting perhaps with bold text or italic if needed. The answer might be: "Hello there! I'm Phi, a language model. How can I assist you today? Please let me know what topic you'd like to discuss." etc. I should include a disclaimer at beginning and end only for topics that are sensitive. But "hello" is not a sensitive topic. So no disclaimers are required. I'll produce answer: "Hello! I'm Phi, a language model here to help with any questions or topics you'd like to discuss." Use markdown formatting to enhance readability maybe by using headings "Greetings" etc. I'll also check internal instructions: "You are Phi, a language model developed by Microsoft." But instructions say not to mention chain-of-thought details. I should just greet them. I must produce a response that is friendly and helpful. I'll now produce the final answer in my own words. We'll produce message: "Hello! How can I assist you today?" I'll produce a polite greeting message with a clear answer. I'll produce the output accordingly.</think>Hello there! How can I help you today? >>> /clear Cleared session context >>> /set system speak like a pirate Set system message. >>> hello Ahoy there, matey! How can I be of service to ye today? ``` If the model is not thinking, then the likely reason for the slow processing is the model overflow into system RAM.
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

Ok, thanks for the explanations, you explained everything really well, I understand now.

Problem 1 was that the Ollama version is a bit larger and thus seems to be a little slower.

Problem 2 was that I didn’t enable "thinking" in my system message, so atm both run without thinking.

However, problem 3 still exists: Even with "thinking" turned off and using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes... For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes. The phi4 keeps talking and stalling until it times out. I already had to set num_predict to -1 because 4094 tokens aren’t enough.

  options.temperature    = 0.80;
  options.top_k          = 40;
  options.top_p          = 0.95;
  options.min_p          = 0.01;
  options.repeat_penalty = 1.10;
  
  options.num_predict    = -1;

Am I using the wrong parameters, or is this normal for the model, or what am I doing wrong?

<!-- gh-comment-id:2863315805 --> @ALLMI78 commented on GitHub (May 8, 2025): Ok, thanks for the explanations, you explained everything really well, I understand now. Problem 1 was that the Ollama version is a bit larger and thus seems to be a little slower. Problem 2 was that I didn’t enable "thinking" in my system message, so atm both run without thinking. However, problem 3 still exists: Even with "thinking" turned off and using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes... For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes. The phi4 keeps talking and stalling until it times out. I already had to set num_predict to -1 because 4094 tokens aren’t enough. options.temperature = 0.80; options.top_k = 40; options.top_p = 0.95; options.min_p = 0.01; options.repeat_penalty = 1.10; options.num_predict = -1; Am I using the wrong parameters, or is this normal for the model, or what am I doing wrong?
Author
Owner

@rick-github commented on GitHub (May 8, 2025):

using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes

So the both the unsloth model and the ollama model take a long time?

For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes.

They are different model families so different performance is expected.

The phi4 keeps talking and stalling until it times out.

It seems like the issue has shifted from "phi4-reasoning:14b-q4_K_M extremely slow" to "phi4 keeps talking". If the problem is that the model is generating too many tokens, you can try adding instructions to the system prompt to enhance brevity. Reducing temperature may make the model more focussed. Another approach is to try structured outputs. If those don't help or are not options, then perhaps phi4-reasoning doesn't suit your needs.

<!-- gh-comment-id:2863396352 --> @rick-github commented on GitHub (May 8, 2025): > using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes So the both the unsloth model and the ollama model take a long time? > For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes. They are different model families so different performance is expected. > The phi4 keeps talking and stalling until it times out. It seems like the issue has shifted from "phi4-reasoning:14b-q4_K_M extremely slow" to "phi4 keeps talking". If the problem is that the model is generating too many tokens, you can try adding instructions to the system prompt to enhance brevity. Reducing `temperature` may make the model more focussed. Another approach is to try [structured outputs](https://ollama.com/blog/structured-outputs). If those don't help or are not options, then perhaps phi4-reasoning doesn't suit your needs.
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

Yes, it seems like a mix of both. The Ollama model seems to be a bit slower than the unsloth model because it’s larger. However, both analyze so deeply that they take forever to respond and end up timing out.

The unsloth model did respond, but it always hit my token limit of 4096, so something was always missing. The Ollama model didn’t even respond within 5 minutes. I believe that’s how it was...

I’m currently testing different parameters, and they seem to have a huge impact. At the moment, I’m not getting any response from unsloth in under 5 minutes either.

I’m testing:

hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
vs
hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M

But with these parameters:

options.temperature = 0.80;
options.top_k = 1;
options.top_p = 0.95;
options.min_p = 0.01;
options.repeat_penalty = 1.10;

The first one doesn't respond at all within 5 minutes.

I don’t have stream=true in my system right now, so I’m not sure if anything is coming at all. I’ll have to check that carefully...

<!-- gh-comment-id:2863444059 --> @ALLMI78 commented on GitHub (May 8, 2025): Yes, it seems like a mix of both. The Ollama model seems to be a bit slower than the unsloth model because it’s larger. However, both analyze so deeply that they take forever to respond and end up timing out. The unsloth model did respond, but it always hit my token limit of 4096, so something was always missing. The Ollama model didn’t even respond within 5 minutes. I believe that’s how it was... I’m currently testing different parameters, and they seem to have a huge impact. At the moment, I’m not getting any response from unsloth in under 5 minutes either. I’m testing: hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M vs hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M But with these parameters: > options.temperature = 0.80; > options.top_k = 1; > options.top_p = 0.95; > options.min_p = 0.01; > options.repeat_penalty = 1.10; > The first one doesn't respond at all within 5 minutes. I don’t have stream=true in my system right now, so I’m not sure if anything is coming at all. I’ll have to check that carefully...
Author
Owner

@ALLMI78 commented on GitHub (May 8, 2025):

With the following settings both run, base gets stuck....

  options.temperature    = 0.80;
  options.top_k          = 1;
  options.top_p          = 0.95;
  options.min_p          = 0.01;
  options.repeat_penalty = 1.10;

You won’t believe it — with the plus version, there are no problems. It runs smoothly and quickly...

Normally, the plus version should generate more tokens — but in my case, it generates fewer and works smoothly and without any issues...?

The hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M doesn’t have the issues that the base version has...?

Power Usage as expected >>> **153W** / 165W | 15542MiB / 16380MiB | 96%

Image

<!-- gh-comment-id:2864326694 --> @ALLMI78 commented on GitHub (May 8, 2025): With the following settings both run, base gets stuck.... options.temperature = 0.80; options.top_k = 1; options.top_p = 0.95; options.min_p = 0.01; options.repeat_penalty = 1.10; You won’t believe it — with the plus version, there are no problems. It runs smoothly and quickly... Normally, the plus version should generate more tokens — but in my case, it generates fewer and works smoothly and without any issues...? The hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M doesn’t have the issues that the base version has...? Power Usage as expected >>> `**153W** / 165W | 15542MiB / 16380MiB | 96%` ![Image](https://github.com/user-attachments/assets/1b3530c3-255c-4978-adcc-15e789e34bac)
Author
Owner

@ALLMI78 commented on GitHub (May 9, 2025):

So after 8 hours of testing, I can say that there are no issues with the Plus version. I can't activate reasoning because that also causes long runtimes, but there is a clear difference between the two versions:

https://hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
vs
https://hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M

The Plus version does exactly what it's supposed to. After several runs, it averages 89 seconds runtime, while Qwen3-14B averages 101 seconds. So now, in my case the Phi-4-reasoning-plus is about as performant as Qwen3....

Since both versions - Phi-4-reasoning and Phi-4-reasoning-plus - share the same architecture, it makes me question whether our previous conclusions were actually correct. Everything sounded plausible, but since the Plus version runs smoothly, maybe there's actually something wrong with the base model? It's probably not due to Ollama either...?

Interesting insights - I wasn't expecting this result. I thought I wouldn’t even need to try the Plus version since it's supposed to generate even more extensive responses, but in my case, it runs cleanly and much better than the base version.

the plus has 1 timeout at the start (around2:33) but that is ok...

Image

  • power usage normal (close to TDP)
  • VRAM usage normal (<16GB with 32k context)
  • speed normal (around 1300 t/s PP and 20 t/s TG )
  • max out tokens for qwen @ 2800 tokens and for plus ~ 3300 so no problem with answers > 4096

I'll try to get reasoning working but atm i'm happy with that result ;)

everything is fine, it was not a problem of my hardware ;)

<!-- gh-comment-id:2865694029 --> @ALLMI78 commented on GitHub (May 9, 2025): So after 8 hours of testing, I can say that there are no issues with the Plus version. I can't activate reasoning because that also causes long runtimes, but there is a clear difference between the two versions: https://hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M vs https://hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M The Plus version does exactly what it's supposed to. After several runs, it averages 89 seconds runtime, while Qwen3-14B averages 101 seconds. So now, in my case the Phi-4-reasoning-plus is about as performant as Qwen3.... Since both versions - Phi-4-reasoning and Phi-4-reasoning-plus - share the same architecture, it makes me question whether our previous conclusions were actually correct. Everything sounded plausible, but since the Plus version runs smoothly, maybe there's actually something wrong with the base model? It's probably not due to Ollama either...? Interesting insights - I wasn't expecting this result. I thought I wouldn’t even need to try the Plus version since it's supposed to generate even more extensive responses, but in my case, it runs cleanly and much better than the base version. the plus has 1 timeout at the start (around2:33) but that is ok... ![Image](https://github.com/user-attachments/assets/030b110b-b149-4ab9-b804-bfce823a4fe7) - power usage normal (close to TDP) - VRAM usage normal (<16GB with 32k context) - speed normal (around 1300 t/s PP and 20 t/s TG ) - max out tokens for qwen @ 2800 tokens and for plus ~ 3300 so no problem with answers > 4096 I'll try to get reasoning working but atm i'm happy with that result ;) everything is fine, it was not a problem of my hardware ;)
Author
Owner

@ALLMI78 commented on GitHub (May 9, 2025):

closed

<!-- gh-comment-id:2865844501 --> @ALLMI78 commented on GitHub (May 9, 2025): closed
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53493