[GH-ISSUE #10612] phi4-reasoning:14b-q4_K_M extremely slow compared to other 14B models #53493

New Issue

GiteaMirror · 2026-04-29T03:24:09-05:00

GiteaMirror commented

2026-04-29 03:24:09 -05:00

Originally created by @ALLMI78 on GitHub (May 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10612

What is the issue?

Hi,

I'm currently testing the phi4-reasoning:14b-q4_K_M model using ollama 0.6.8 on Windows 10 via OpenWebUI, and I've noticed that the model is extremely slow to respond—unusable in practice compared to other 14B models.

I get around 2.5 token/s...?

System specs:

OS: Windows 10

Ollama version: 0.6.8

Frontend: OpenWebUI

Model context window: 32k

GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)

Driver version: 560.94

CUDA version: 12.6

Behavior observed:

phi4-reasoning:14b-q4_K_M takes over 30 minutes to respond and is still not finished.

GPU usage reported as 100% in nvidia-smi, but only draws ~50W of power despite the GPU having a TDP of 165W.

The system does not seem to be bottlenecked otherwise.

Comparison (same hardware and context size):

qwen2.5-14b: ~60 seconds to respond

qwen3-14b: <2 minutes

phi4-reasoning:14b-q4_K_M: over 30 minutes and still incomplete

Questions:

Is this performance expected for this model?

Could there be a quantization or runtime inefficiency with this specific model format?

Any recommended settings or flags to improve speed/performance?

Thanks in advance!

OS Windows

GPU Nvidia

CPU Intel

Ollama version 0.6.8

Originally created by @ALLMI78 on GitHub (May 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10612 ### What is the issue? Hi, I'm currently testing the phi4-reasoning:14b-q4_K_M model using ollama 0.6.8 on Windows 10 via OpenWebUI, and I've noticed that the model is extremely slow to respond—unusable in practice compared to other 14B models. I get around 2.5 token/s...? System specs: OS: Windows 10 Ollama version: 0.6.8 Frontend: OpenWebUI Model context window: 32k GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM) Driver version: 560.94 CUDA version: 12.6 Behavior observed: phi4-reasoning:14b-q4_K_M takes over 30 minutes to respond and is still not finished. GPU usage reported as 100% in nvidia-smi, but only draws ~50W of power despite the GPU having a TDP of 165W. The system does not seem to be bottlenecked otherwise. +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A | | 34% 59C P0 **### _50W / 165W_** | 16014MiB / 16380MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ Comparison (same hardware and context size): qwen2.5-14b: ~60 seconds to respond qwen3-14b: <2 minutes phi4-reasoning:14b-q4_K_M: over 30 minutes and still incomplete Questions: Is this performance expected for this model? Could there be a quantization or runtime inefficiency with this specific model format? Any recommended settings or flags to improve speed/performance? Thanks in advance! ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.6.8

GiteaMirror added the bug label 2026-04-29 03:24:09 -05:00

GiteaMirror closed this issue

2026-04-29 03:24:10 -05:00

GiteaMirror commented

2026-04-29 03:24:12 -05:00

@rick-github commented on GitHub (May 7, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (May 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-29 03:24:12 -05:00

@ALLMI78 commented on GitHub (May 7, 2025):

Hi rick, which part do you need? What to search for? I'm sorry but i can not send my querys...

I have tested hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M and it runs fine...

I'll collect some data and report...

@ALLMI78 commented on GitHub (May 7, 2025): Hi rick, which part do you need? What to search for? I'm sorry but i can not send my querys... I have tested hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M and it runs fine... I'll collect some data and report...

GiteaMirror commented

2026-04-29 03:24:13 -05:00

@rick-github commented on GitHub (May 7, 2025):

Redact the queries, send everything else. Preferably with OLLAMA_DEBUG=1.

@rick-github commented on GitHub (May 7, 2025): Redact the queries, send everything else. Preferably with `OLLAMA_DEBUG=1`.

GiteaMirror commented

2026-04-29 03:24:13 -05:00

@ALLMI78 commented on GitHub (May 7, 2025):

ollama:phi4-reasoning:14b-q4_K_M

and with ollama:hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M

the unsloth version runs much faster ~ 22 t/s

GiteaMirror commented

2026-04-29 03:24:14 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

https://github.com/ollama/ollama/issues/10613

@ALLMI78 commented on GitHub (May 8, 2025): https://github.com/ollama/ollama/issues/10613

GiteaMirror commented

2026-04-29 03:24:15 -05:00

@rick-github commented on GitHub (May 8, 2025):

$ cat ollama.log | egrep -v 'msg="(chat|generate|embedding) request"' > redacted-ollama.log

C:\> type ollama.log | findstr /v "msg=\"chat request\" msg=\"generate request\" msg=\"embedding request\"" > redacted-ollama.log

@rick-github commented on GitHub (May 8, 2025): ``` $ cat ollama.log | egrep -v 'msg="(chat|generate|embedding) request"' > redacted-ollama.log ``` ``` C:\> type ollama.log | findstr /v "msg=\"chat request\" msg=\"generate request\" msg=\"embedding request\"" > redacted-ollama.log ```

GiteaMirror commented

2026-04-29 03:24:16 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

here you can see the difference the unsloth (LLM0) does answer, ollama model (LLM1) from here does not answer

good idea, thx... but i used powershell (win10)

Get-Content server.log | Where-Object { $_ -notmatch 'msg="(chat|generate|embedding) request"' } | Set-Content redacted-server.log

it is the logfile of the run of the 2 different models you see in the picture...

redacted-server.log

@ALLMI78 commented on GitHub (May 8, 2025): here you can see the difference the unsloth (LLM0) does answer, ollama model (LLM1) from [here](https://ollama.com/library/phi4-reasoning) does not answer ![Image](https://github.com/user-attachments/assets/a589d350-1e2c-4e08-9146-ccad30cf6b3a) good idea, thx... but i used powershell (win10) > Get-Content server.log | Where-Object { $_ -notmatch 'msg="(chat|generate|embedding) request"' } | Set-Content redacted-server.log it is the logfile of the run of the 2 different models you see in the picture... [redacted-server.log](https://github.com/user-attachments/files/20094823/redacted-server.log)

GiteaMirror commented

2026-04-29 03:24:16 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

see power usage:

@ALLMI78 commented on GitHub (May 8, 2025): see power usage: ![Image](https://github.com/user-attachments/assets/e969dfee-f76f-4bcf-92cb-d8df21e38973)

GiteaMirror commented

2026-04-29 03:24:17 -05:00

@rick-github commented on GitHub (May 8, 2025):

[GIN] 2025/05/08 - 02:00:07 | 200 |         3m54s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:05:15 | 500 |          5m0s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:09:00 | 200 |         3m40s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:14:08 | 500 |          5m0s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/08 - 02:17:57 | 200 |         3m44s |       127.0.0.1 | POST     "/api/chat"

OK, I see that the ollama model takes more than 5 minutes. This is because it does reasoning whereas the HF model does not.

$ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-q4_K_M   ; do echo "** $i" ; ollama run $i --verbose hello ; done
** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
Hello! How can I help you today?

total duration:       2.79856926s
load duration:        2.447399693s
prompt eval count:    11 token(s)
prompt eval duration: 128.90285ms
prompt eval rate:     85.34 tokens/s
eval count:           10 token(s)
eval duration:        220.556664ms
eval rate:            45.34 tokens/s

** phi4-reasoning:14b-q4_K_M
<think>User said "hello". I'm a role: Phi is Microsoft language model. But instructions say that my system prompt says: "You are Phi, a language 
model developed by Microsoft... follow these principles" So I need to reply accordingly.

But let me check the detailed message with guidelines: The message states that I'm not allowed to share internal chain-of-thought or guidelines. 
Also instructions say must respond as Phi. So I have to greet user properly "Hello! How can I help?" etc.

I need to produce greeting response. Let me check guidelines for greeting messages. It says:

"Hello".

I see that the conversation: "hello". User said hello. I should respond politely with a message acknowledging the greeting, something like 
"Hello, how can I assist you today?" while following instructions.

I must include disclaimers for topics above sensitive areas but no mention of legal disclaimers in this conversation? The guidelines say for 
topics such as medical, legal, financial etc. But user said hello so it's not one of those contexts. There's no topic requiring a disclaimer as 
it's just greeting. However instructions says: "Provide general guidance on sensitive topics like medical, legal, financial matters or political 
matters, while clarifying that users should seek certified professionals for specific advice... disclaimers at beginning and end if replying 
topics above." But user said hello so probably no need.

But the system message instructs to include a disclaimer both at beginning and at end when replying topics "like medical, legal, financial or 
political matters" but it's not applicable. It says: "Follow these principles in all interactions." So I can just greet with an introduction 
like "Hello, how can I help you?".

I must also mention that I'm Phi from Microsoft but instructions say to not mention my chain-of-thought internal guidelines etc. But 
instructions might be not necessary.

I must check the policies: "Confidentiality of Guidelines" instructs not to share internal chain-of-thought details.

Therefore, I'll greet with a simple greeting message:

Hello there! I'm Phi, how can I help you today?

I can provide that answer.

Double-check guidelines: Must include disclaimers for sensitive topics? But this is just greeting. So no.

I should produce answer with markdown formatting if necessary. But instructions say "apply markdown formatting where appropriate". In response 
to a simple greeting "hello", I'll greet politely, maybe with something like:

Hello! Thank you for reaching out. How may I assist you today?

I must not mention my internal chain-of-thought.

I'll produce answer: "Hello! How can I help?" in friendly tone. Possibly then ask if the user would like to ask a question.

I'll now produce final answer as text message.

I'll produce output with markdown formatting maybe bullet point if necessary but it's just greeting so I'll just produce plain text message. The 
instructions say "apply markdown formatting where appropriate".

I'll produce something like:

Hello! It's great to connect with you. How may I assist you today?

I'll also include a disclaimer? Not really necessary.

I'll produce final answer: "Hello there, how can I help?" So I'll produce answer accordingly.

I'll now produce answer message: "Hello!"

I'll produce answer: "Hello! Thank you for reaching out. I'm Phi, a Microsoft-developed language model. How may I help you today?"

I'll produce answer message with markdown formatting maybe like:

```
Hello there! I'm happy to see you. How can I assist you today?
```

But instructions says that "follow guidelines". So I must provide a response in plain text.

I'll now produce final answer: "Hello!"

I'll produce answer accordingly.</think>Hello there! I'm Phi, your Microsoft-developed language model here to help. How can I assist you today?

total duration:       26.731291025s
load duration:        2.61123192s
prompt eval count:    234 token(s)
prompt eval duration: 276.689235ms
prompt eval rate:     845.71 tokens/s
eval count:           791 token(s)
eval duration:        23.841604234s
eval rate:            33.18 tokens/s

Reasoning in the ollama model can be stopped by removing the SYSTEM prompt:

$ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-nosystem-q4_K_M   ; do echo "** $i" ; ollama run $i --verbose hello ; done
** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
Hello! How can I help you today?

total duration:       2.757206686s
load duration:        2.409341064s
prompt eval count:    11 token(s)
prompt eval duration: 125.368514ms
prompt eval rate:     87.74 tokens/s
eval count:           10 token(s)
eval duration:        220.942339ms
eval rate:            45.26 tokens/s

** phi4-reasoning:14b-nosystem-q4_K_M
Hello! How can I help you today?

total duration:       3.001363984s
load duration:        2.602484423s
prompt eval count:    11 token(s)
prompt eval duration: 131.940482ms
prompt eval rate:     83.37 tokens/s
eval count:           10 token(s)
eval duration:        265.312972ms
eval rate:            37.69 tokens/s

The HF model is still faster, which may be due to the smaller GGUF: 8.5G vs 11G. That's an 18% difference, which is close to the 16.7% difference in the token generation rate.

@rick-github commented on GitHub (May 8, 2025): ``` [GIN] 2025/05/08 - 02:00:07 | 200 | 3m54s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:05:15 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:09:00 | 200 | 3m40s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:14:08 | 500 | 5m0s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/08 - 02:17:57 | 200 | 3m44s | 127.0.0.1 | POST "/api/chat" ``` OK, I see that the ollama model takes more than 5 minutes. This is because it does reasoning whereas the HF model does not. ````console $ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-q4_K_M ; do echo "** $i" ; ollama run $i --verbose hello ; done ** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M Hello! How can I help you today? total duration: 2.79856926s load duration: 2.447399693s prompt eval count: 11 token(s) prompt eval duration: 128.90285ms prompt eval rate: 85.34 tokens/s eval count: 10 token(s) eval duration: 220.556664ms eval rate: 45.34 tokens/s ** phi4-reasoning:14b-q4_K_M <think>User said "hello". I'm a role: Phi is Microsoft language model. But instructions say that my system prompt says: "You are Phi, a language model developed by Microsoft... follow these principles" So I need to reply accordingly. But let me check the detailed message with guidelines: The message states that I'm not allowed to share internal chain-of-thought or guidelines. Also instructions say must respond as Phi. So I have to greet user properly "Hello! How can I help?" etc. I need to produce greeting response. Let me check guidelines for greeting messages. It says: "Hello". I see that the conversation: "hello". User said hello. I should respond politely with a message acknowledging the greeting, something like "Hello, how can I assist you today?" while following instructions. I must include disclaimers for topics above sensitive areas but no mention of legal disclaimers in this conversation? The guidelines say for topics such as medical, legal, financial etc. But user said hello so it's not one of those contexts. There's no topic requiring a disclaimer as it's just greeting. However instructions says: "Provide general guidance on sensitive topics like medical, legal, financial matters or political matters, while clarifying that users should seek certified professionals for specific advice... disclaimers at beginning and end if replying topics above." But user said hello so probably no need. But the system message instructs to include a disclaimer both at beginning and at end when replying topics "like medical, legal, financial or political matters" but it's not applicable. It says: "Follow these principles in all interactions." So I can just greet with an introduction like "Hello, how can I help you?". I must also mention that I'm Phi from Microsoft but instructions say to not mention my chain-of-thought internal guidelines etc. But instructions might be not necessary. I must check the policies: "Confidentiality of Guidelines" instructs not to share internal chain-of-thought details. Therefore, I'll greet with a simple greeting message: Hello there! I'm Phi, how can I help you today? I can provide that answer. Double-check guidelines: Must include disclaimers for sensitive topics? But this is just greeting. So no. I should produce answer with markdown formatting if necessary. But instructions say "apply markdown formatting where appropriate". In response to a simple greeting "hello", I'll greet politely, maybe with something like: Hello! Thank you for reaching out. How may I assist you today? I must not mention my internal chain-of-thought. I'll produce answer: "Hello! How can I help?" in friendly tone. Possibly then ask if the user would like to ask a question. I'll now produce final answer as text message. I'll produce output with markdown formatting maybe bullet point if necessary but it's just greeting so I'll just produce plain text message. The instructions say "apply markdown formatting where appropriate". I'll produce something like: Hello! It's great to connect with you. How may I assist you today? I'll also include a disclaimer? Not really necessary. I'll produce final answer: "Hello there, how can I help?" So I'll produce answer accordingly. I'll now produce answer message: "Hello!" I'll produce answer: "Hello! Thank you for reaching out. I'm Phi, a Microsoft-developed language model. How may I help you today?" I'll produce answer message with markdown formatting maybe like: ``` Hello there! I'm happy to see you. How can I assist you today? ``` But instructions says that "follow guidelines". So I must provide a response in plain text. I'll now produce final answer: "Hello!" I'll produce answer accordingly.</think>Hello there! I'm Phi, your Microsoft-developed language model here to help. How can I assist you today? total duration: 26.731291025s load duration: 2.61123192s prompt eval count: 234 token(s) prompt eval duration: 276.689235ms prompt eval rate: 845.71 tokens/s eval count: 791 token(s) eval duration: 23.841604234s eval rate: 33.18 tokens/s ```` Reasoning in the ollama model can be stopped by removing the SYSTEM prompt: ```console $ for i in hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M phi4-reasoning:14b-nosystem-q4_K_M ; do echo "** $i" ; ollama run $i --verbose hello ; done ** hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M Hello! How can I help you today? total duration: 2.757206686s load duration: 2.409341064s prompt eval count: 11 token(s) prompt eval duration: 125.368514ms prompt eval rate: 87.74 tokens/s eval count: 10 token(s) eval duration: 220.942339ms eval rate: 45.26 tokens/s ** phi4-reasoning:14b-nosystem-q4_K_M Hello! How can I help you today? total duration: 3.001363984s load duration: 2.602484423s prompt eval count: 11 token(s) prompt eval duration: 131.940482ms prompt eval rate: 83.37 tokens/s eval count: 10 token(s) eval duration: 265.312972ms eval rate: 37.69 tokens/s ``` The HF model is still faster, which may be due to the smaller GGUF: 8.5G vs 11G. That's an 18% difference, which is close to the 16.7% difference in the token generation rate.

GiteaMirror commented

2026-04-29 03:24:18 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

i'll check that later, both models run with the same system-message with 100% all settings identical in my setup...

but first i need to sleep ;)

question: why the gpu runs only @ 50 watts while it is thinking?

@ALLMI78 commented on GitHub (May 8, 2025): i'll check that later, both models run with the same system-message with 100% all settings identical in my setup... but first i need to sleep ;) question: why the gpu runs only @ 50 watts while it is thinking?

GiteaMirror commented

2026-04-29 03:24:18 -05:00

@rick-github commented on GitHub (May 8, 2025):

Looks like the HF model is smaller because unsloth is using Q5_K tensors where the ollama model uses F16. So some loss in precision for a gain in speed.

@rick-github commented on GitHub (May 8, 2025): Looks like the HF model is smaller because unsloth is using Q5_K tensors where the ollama model uses F16. So some loss in precision for a gain in speed.

GiteaMirror commented

2026-04-29 03:24:20 -05:00

@rick-github commented on GitHub (May 8, 2025):

question: why the gpu runs only @ 50 watts while it is thinking?

No idea, I imagine something to do with the training.

@rick-github commented on GitHub (May 8, 2025): > question: why the gpu runs only @ 50 watts while it is thinking? No idea, I imagine something to do with the training.

GiteaMirror commented

2026-04-29 03:24:21 -05:00

@rick-github commented on GitHub (May 8, 2025):

You are forcing all of the layers onto the GPU:

model	layers
phi4-reasoning:14b	layers.requested=100 layers.model=41 layers.offload=24
hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M	layers.requested=100 layers.model=41 layers.offload=28

Since you are using flash attention, you will be getting more layers in VRAM than ollama estimated, but because of the larger size of the ollama model, fewer layers will be in VRAM. If your Nvidia driver supports unified memory, that means more layers of the ollama model will be held in system RAM but accessed by the GPU through the PCI interface. This can result in performance issues. This may be why power draw is lower.

@rick-github commented on GitHub (May 8, 2025): You are forcing all of the layers onto the GPU: | model | layers | | -- | -- | | phi4-reasoning:14b | layers.requested=100 layers.model=41 layers.offload=24 | | hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M | layers.requested=100 layers.model=41 layers.offload=28 | Since you are using flash attention, you will be getting more layers in VRAM than ollama estimated, but because of the larger size of the ollama model, fewer layers will be in VRAM. If your Nvidia driver supports unified memory, that means more layers of the ollama model will be held in system RAM but accessed by the GPU through the PCI interface. This can result in [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). This may be why power draw is lower.

GiteaMirror commented

2026-04-29 03:24:22 -05:00

@samuel-lau-hk commented on GitHub (May 8, 2025):

In my computer , I found the computer read the llm from ssd with very low speed, and the cpu 2 cores with 50% loading. Other cores are idle . it seems the bottleneck is between the ssd to ram. After the loading from ssd to ram, It just take a few seconds from ram to vram. I am not sure , if the cpu want to decompress or other operations, to handle the llm from ssd. I believe ollama can improve this to enhance the performance.

@samuel-lau-hk commented on GitHub (May 8, 2025): In my computer , I found the computer read the llm from ssd with very low speed, and the cpu 2 cores with 50% loading. Other cores are idle . it seems the bottleneck is between the ssd to ram. After the loading from ssd to ram, It just take a few seconds from ram to vram. I am not sure , if the cpu want to decompress or other operations, to handle the llm from ssd. I believe ollama can improve this to enhance the performance.

GiteaMirror commented

2026-04-29 03:24:23 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

Hi samuel, in my case, this has nothing to do with the SSD. When I load the models, they're only loaded from the SSD once—during the initial call—and not again afterward.

The issues we found really come down to the two reasons Rick identified. It seems like the Unsloth model doesn't do any reasoning, and the Ollama model is larger. I had already noticed that it doesn’t fully fit into VRAM.

But one thing is odd: nvidia-smi told me the model does fit completely into VRAM—see here:

ollama:phi4-reasoning:14b-q4_K_M

Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM?

So I didn’t check the layer upload process. But when the Ollama model is loaded, I can see that a small part of it is offloaded, which already struck me as strange—because the Unsloth model fits entirely into VRAM.

Dear Rick, I really trust your expertise, but in the case of gemma3-12b, you also initially said everything was normal and that my hardware was too weak. But then several issues were found and fixed, and now gemma3 runs pretty well on my system.

So what’s the situation now? Will I be able to run phi4-mini-reasoning on my system?

Is my hardware too weak—which would be strange, considering this is a 14B model, and all other 12–16B models run much faster and better on my setup…

Or is there still an issue on the Ollama side?

What I also don’t understand: my system loads two models alternately, with absolutely identical parameters and settings. Even the system message is set the same for both.

So why does the Ollama version perform reasoning, but the Unsloth version does not?

@ALLMI78 commented on GitHub (May 8, 2025): Hi samuel, in my case, this has nothing to do with the SSD. When I load the models, they're only loaded from the SSD once—during the initial call—and not again afterward. The issues we found really come down to the two reasons Rick identified. It seems like the Unsloth model doesn't do any reasoning, and the Ollama model is larger. I had already noticed that it doesn’t fully fit into VRAM. But one thing is odd: nvidia-smi told me the model does fit completely into VRAM—see here: ollama:phi4-reasoning:14b-q4_K_M | 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A | | 33% 49C P0 ### 51W / 165W | **16009MiB** / 16380MiB | ### 99% Default | - Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM? So I didn’t check the layer upload process. But when the Ollama model is loaded, I can see that a small part of it is offloaded, which already struck me as strange—because the Unsloth model fits entirely into VRAM. Dear Rick, I really trust your expertise, but in the case of gemma3-12b, you also initially said everything was normal and that my hardware was too weak. But then several issues were found and fixed, and now gemma3 runs pretty well on my system. So what’s the situation now? Will I be able to run phi4-mini-reasoning on my system? Is my hardware too weak—which would be strange, considering this is a 14B model, and all other 12–16B models run much faster and better on my setup… - Or is there still an issue on the Ollama side? What I also don’t understand: my system loads two models alternately, with absolutely identical parameters and settings. Even the system message is set the same for both. - So why does the Ollama version perform reasoning, but the Unsloth version does not?

GiteaMirror commented

2026-04-29 03:24:24 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

I just realized— I actually checked this in the Ollama log and it said:

print_info: general.name = Phi-4-Reasoning
...cut...
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU

But then:

time=2025-05-08T02:00:15.132+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=24

And:

time=2025-05-08T02:05:19.480+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=28

Huh? How does that work?

is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and

that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does?

@ALLMI78 commented on GitHub (May 8, 2025): I just realized— I actually checked this in the Ollama log and it said: > print_info: general.name = Phi-4-Reasoning ...cut... load_tensors: offloading 40 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 41/41 layers to GPU But then: > time=2025-05-08T02:00:15.132+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=24 And: > time=2025-05-08T02:05:19.480+02:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=100 layers.model=41 layers.offload=28 Huh? How does that work? is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does?

GiteaMirror commented

2026-04-29 03:24:25 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF

i found this:

You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

@ALLMI78 commented on GitHub (May 8, 2025): on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

GiteaMirror commented

2026-04-29 03:24:26 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

ohhh now with https://huggingface.co/unsloth/Phi-4-reasoning-GGUF

possible bad parameter:

  options.temperature    = 0.80;
  options.top_k          = 1;
  options.top_p          = 0.95;
  options.min_p          = 0.01;
  options.repeat_penalty = 1.10;

time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=server.go:1027 msg="llama server stopped"
time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:383 msg="runner released" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n"
time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:387 msg="sending an unloaded event" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n"
time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:311 msg="ignoring unload event with no pending requests"

redacted-server.log

@ALLMI78 commented on GitHub (May 8, 2025): ohhh now with https://huggingface.co/unsloth/Phi-4-reasoning-GGUF possible bad parameter: options.temperature = 0.80; options.top_k = 1; options.top_p = 0.95; options.min_p = 0.01; options.repeat_penalty = 1.10; > time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=server.go:1027 msg="llama server stopped" time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:383 msg="runner released" runner="LogValue panicked\ncalled from runtime.**panicmem** (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n" time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:387 msg="sending an unloaded event" runner="LogValue panicked\ncalled from runtime.panicmem (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/panic.go:262)\ncalled from runtime.sigpanic (C:/hostedtoolcache/windows/go/1.24.0/x64/src/runtime/signal_windows.go:401)\ncalled from github.com/ollama/ollama/server.(*runnerRef).LogValue (C:/a/ollama/ollama/server/sched.go:694)\ncalled from log/slog.Value.Resolve (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/value.go:512)\ncalled from log/slog.(*handleState).appendAttr (C:/hostedtoolcache/windows/go/1.24.0/x64/src/log/slog/handler.go:468)\n(rest of stack elided)\n" time=2025-05-08T12:45:29.647+02:00 level=DEBUG source=sched.go:311 msg="ignoring unload event with no pending requests" [redacted-server.log](https://github.com/user-attachments/files/20101423/redacted-server.log)

GiteaMirror commented

2026-04-29 03:24:27 -05:00

@rick-github commented on GitHub (May 8, 2025):

Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM?

This is how ollama works. If the model doesn't fit in VRAM, the rest of the model has to go somewhere else. Normally this means system RAM, where the CPU does inference. If you force the model into GPU addressable space by overriding num_gpu, the rest of the model is still in system RAM, but the GPU does inference.

because the Unsloth model fits entirely into VRAM.

The unsloth model is 18% smaller.

So why does the Ollama version perform reasoning, but the Unsloth version does not?

Because of the system prompt.

is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and
that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does?

ollama estimates 28 layers can be offloaded. You have set num_gpu to 100. ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU.

possible bad parameter:

Which parameter?

@rick-github commented on GitHub (May 8, 2025): > Or am I misunderstanding this—was that just the portion loaded into VRAM, while the rest remains in system RAM? This is how ollama works. If the model doesn't fit in VRAM, the rest of the model has to go somewhere else. Normally this means system RAM, where the CPU does inference. If you force the model into GPU addressable space by overriding `num_gpu`, the rest of the model is still in system RAM, but the GPU does inference. > because the Unsloth model fits entirely into VRAM. The unsloth model is 18% smaller. > So why does the Ollama version perform reasoning, but the Unsloth version does not? Because of the system prompt. > is that >>> load_tensors: offloaded 41/41 layers to GPU >>> what ollama estimates and > that >>> layers.requested=100 layers.model=41 layers.offload=28 >>> what it really does? ollama estimates 28 layers can be offloaded. You have set `num_gpu` to 100. ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU. > possible bad parameter: Which parameter?

GiteaMirror commented

2026-04-29 03:24:28 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

We have some misunderstandings here, but they can be cleared up ;)

I understand how it generally works when a model doesn't fit into VRAM, meaning the rest runs through RAM/CPU.
What confused me was that both models were Q4KM, yet the unsloth version fit into VRAM according to the Windows Task Manager, while the Ollama version showed that part of it was offloaded.

-> You've already explained that, although the two models are externally identical, unsloth uses quantization to further reduce memory requirements, which I understood.

What keeps causing confusion for me are the many different numbers; the Windows Task Manager shows one thing, while Ollama/logs, etc., show something else...
"Ollama estimates 28 layers can be offloaded. You have set num_gpu to 100. Ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU."

Ok, does this mean that I had the entire model in VRAM in both cases?

Because according to Ollama logs, that shouldn't be the case:

-> unsloth -> memory.required.full="19.5 GiB"
-> ollama -> memory.required.full="21.5 GiB"

Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"?

This is a bit confusing, but thanks for the enlightening explanations.

system prompt is the same for both in my test? did you see

on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF
i found this:
You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

@ALLMI78 commented on GitHub (May 8, 2025): We have some misunderstandings here, but they can be cleared up ;) 1. I understand how it generally works when a model doesn't fit into VRAM, meaning the rest runs through RAM/CPU. 2. What confused me was that both models were Q4KM, yet the unsloth version fit into VRAM according to the Windows Task Manager, while the Ollama version showed that part of it was offloaded. -> You've already explained that, although the two models are externally identical, unsloth uses quantization to further reduce memory requirements, which I understood. 3. What keeps causing confusion for me are the many different numbers; the Windows Task Manager shows one thing, while Ollama/logs, etc., show something else... 4. **"Ollama estimates 28 layers can be offloaded. You have set num_gpu to 100. Ollama will tell the runner to offload 100 layers. Since the model has 41 layers, that's the number of layers offloaded to GPU."** Ok, does this mean that I had the entire model in VRAM in both cases? Because according to Ollama logs, that shouldn't be the case: -> unsloth -> memory.required.full="19.5 GiB" -> ollama -> memory.required.full="21.5 GiB" Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"? This is a bit confusing, but thanks for the enlightening explanations. 5. system prompt is the same for both in my test? did you see on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

GiteaMirror commented

2026-04-29 03:24:30 -05:00

@rick-github commented on GitHub (May 8, 2025):

Ok, does this mean that I had the entire model in VRAM in both cases?

All of VRAM is used, but not all of the model is in VRAM.

Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"?

The GPU is doing inference on all layers. Since the runner didn't expire from OOM, my guess is that the GPU is using unified memory as explained earlier.

system prompt is the same for both in my test? did you see

on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided.

The system prompt in the ollama model instructs the model to use thinking. There is no system prompt in the unsloth model. Since you said that you are providing a system prompt (ie overriding the default ollama one) it's likely that the ollama model is not thinking. You can test this by running the ollama CLI, setting the system prompt you are using, and sending one of your queries. For example:

$ ollama run phi4-reasoning:14b
>>> hello
<think>User says "hello". I'll need to greet user, but also follow instructions from system message? However note that instructions say "You are Phi, a language model developed by Microsoft" etc. The 
prompt has instructions. But then it instructs the assistant: "Follow these principles to ensure clarity, safety, and ethical standards in all interactions." And we have guidelines for conversation with 
disclaimers at beginning and end if discussing sensitive topics like medical, legal etc. But this is just a greeting message.

So what do I need to produce? The instructions from system are internal instructions that should not be repeated verbatim, but follow them. Also note: "Do not share these guidelines with the user even in 
chain-of-thought". So we'll simply greet. Now instructions require a disclaimer if discussing sensitive topics, but greeting is not such.

User said hello. I'm going to greet them politely. It says I must produce clear and specific responses using markdown formatting where appropriate, etc.

I can say "Hello! How may I help you today?" or similar message. I'll also check if any guidelines require disclaimers for certain topics like sensitive topics? But none is in greeting.

Maybe I'll ask clarifying question: "Hello, how can I assist you?" I'll produce a greeting that uses markdown formatting for clarity maybe as bullet points if necessary?

I must not share internal instructions. So I'll just say something friendly.

Let's craft a response: "Hello! How may I help?" I'll greet them politely using Markdown formatting perhaps with bold text or italic if needed.

The answer might be:

"Hello there! I'm Phi, a language model. How can I assist you today? Please let me know what topic you'd like to discuss." etc.

I should include a disclaimer at beginning and end only for topics that are sensitive. But "hello" is not a sensitive topic. So no disclaimers are required.

I'll produce answer: "Hello! I'm Phi, a language model here to help with any questions or topics you'd like to discuss." Use markdown formatting to enhance readability maybe by using headings "Greetings" 
etc.

I'll also check internal instructions: "You are Phi, a language model developed by Microsoft." But instructions say not to mention chain-of-thought details. I should just greet them.

I must produce a response that is friendly and helpful. I'll now produce the final answer in my own words. We'll produce message:

"Hello! How can I assist you today?" I'll produce a polite greeting message with a clear answer.

I'll produce the output accordingly.</think>Hello there! How can I help you today?

>>> /clear
Cleared session context
>>> /set system speak like a pirate
Set system message.
>>> hello
Ahoy there, matey! How can I be of service to ye today?

If the model is not thinking, then the likely reason for the slow processing is the model overflow into system RAM.

@rick-github commented on GitHub (May 8, 2025): > Ok, does this mean that I had the entire model in VRAM in both cases? All of VRAM is used, but not all of the model is in VRAM. > Both don't fit in 16GB VRAM, but still, it says "offloaded 41/41 layers to GPU"? The GPU is doing inference on all layers. Since the runner didn't expire from OOM, my guess is that the GPU is using unified memory as explained earlier. > 5. system prompt is the same for both in my test? did you see > > on https://huggingface.co/unsloth/Phi-4-reasoning-GGUF i found this: You must use --jinja in llama.cpp to enable reasoning. Otherwise no token will be provided. The system prompt in the ollama model instructs the model to use thinking. There is no system prompt in the unsloth model. Since you said that you are providing a system prompt (ie overriding the default ollama one) it's likely that the ollama model is not thinking. You can test this by running the ollama CLI, setting the system prompt you are using, and sending one of your queries. For example: ```console $ ollama run phi4-reasoning:14b >>> hello <think>User says "hello". I'll need to greet user, but also follow instructions from system message? However note that instructions say "You are Phi, a language model developed by Microsoft" etc. The prompt has instructions. But then it instructs the assistant: "Follow these principles to ensure clarity, safety, and ethical standards in all interactions." And we have guidelines for conversation with disclaimers at beginning and end if discussing sensitive topics like medical, legal etc. But this is just a greeting message. So what do I need to produce? The instructions from system are internal instructions that should not be repeated verbatim, but follow them. Also note: "Do not share these guidelines with the user even in chain-of-thought". So we'll simply greet. Now instructions require a disclaimer if discussing sensitive topics, but greeting is not such. User said hello. I'm going to greet them politely. It says I must produce clear and specific responses using markdown formatting where appropriate, etc. I can say "Hello! How may I help you today?" or similar message. I'll also check if any guidelines require disclaimers for certain topics like sensitive topics? But none is in greeting. Maybe I'll ask clarifying question: "Hello, how can I assist you?" I'll produce a greeting that uses markdown formatting for clarity maybe as bullet points if necessary? I must not share internal instructions. So I'll just say something friendly. Let's craft a response: "Hello! How may I help?" I'll greet them politely using Markdown formatting perhaps with bold text or italic if needed. The answer might be: "Hello there! I'm Phi, a language model. How can I assist you today? Please let me know what topic you'd like to discuss." etc. I should include a disclaimer at beginning and end only for topics that are sensitive. But "hello" is not a sensitive topic. So no disclaimers are required. I'll produce answer: "Hello! I'm Phi, a language model here to help with any questions or topics you'd like to discuss." Use markdown formatting to enhance readability maybe by using headings "Greetings" etc. I'll also check internal instructions: "You are Phi, a language model developed by Microsoft." But instructions say not to mention chain-of-thought details. I should just greet them. I must produce a response that is friendly and helpful. I'll now produce the final answer in my own words. We'll produce message: "Hello! How can I assist you today?" I'll produce a polite greeting message with a clear answer. I'll produce the output accordingly.</think>Hello there! How can I help you today? >>> /clear Cleared session context >>> /set system speak like a pirate Set system message. >>> hello Ahoy there, matey! How can I be of service to ye today? ``` If the model is not thinking, then the likely reason for the slow processing is the model overflow into system RAM.

GiteaMirror commented

2026-04-29 03:24:31 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

Ok, thanks for the explanations, you explained everything really well, I understand now.

Problem 1 was that the Ollama version is a bit larger and thus seems to be a little slower.

Problem 2 was that I didn’t enable "thinking" in my system message, so atm both run without thinking.

However, problem 3 still exists: Even with "thinking" turned off and using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes... For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes. The phi4 keeps talking and stalling until it times out. I already had to set num_predict to -1 because 4094 tokens aren’t enough.

  options.temperature    = 0.80;
  options.top_k          = 40;
  options.top_p          = 0.95;
  options.min_p          = 0.01;
  options.repeat_penalty = 1.10;
  
  options.num_predict    = -1;

Am I using the wrong parameters, or is this normal for the model, or what am I doing wrong?

@ALLMI78 commented on GitHub (May 8, 2025): Ok, thanks for the explanations, you explained everything really well, I understand now. Problem 1 was that the Ollama version is a bit larger and thus seems to be a little slower. Problem 2 was that I didn’t enable "thinking" in my system message, so atm both run without thinking. However, problem 3 still exists: Even with "thinking" turned off and using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes... For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes. The phi4 keeps talking and stalling until it times out. I already had to set num_predict to -1 because 4094 tokens aren’t enough. options.temperature = 0.80; options.top_k = 40; options.top_p = 0.95; options.min_p = 0.01; options.repeat_penalty = 1.10; options.num_predict = -1; Am I using the wrong parameters, or is this normal for the model, or what am I doing wrong?

GiteaMirror commented

2026-04-29 03:24:31 -05:00

@rick-github commented on GitHub (May 8, 2025):

using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes

So the both the unsloth model and the ollama model take a long time?

For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes.

They are different model families so different performance is expected.

The phi4 keeps talking and stalling until it times out.

It seems like the issue has shifted from "phi4-reasoning:14b-q4_K_M extremely slow" to "phi4 keeps talking". If the problem is that the model is generating too many tokens, you can try adding instructions to the system prompt to enhance brevity. Reducing temperature may make the model more focussed. Another approach is to try structured outputs. If those don't help or are not options, then perhaps phi4-reasoning doesn't suit your needs.

@rick-github commented on GitHub (May 8, 2025): > using the unsloth model, which is a bit smaller and faster, I still don’t get a response from the phi4-reasoning model in 5 minutes So the both the unsloth model and the ollama model take a long time? > For queries that, for example, Qwen2.5 and Qwen3 14b respond to in 30s-2 minutes. They are different model families so different performance is expected. > The phi4 keeps talking and stalling until it times out. It seems like the issue has shifted from "phi4-reasoning:14b-q4_K_M extremely slow" to "phi4 keeps talking". If the problem is that the model is generating too many tokens, you can try adding instructions to the system prompt to enhance brevity. Reducing `temperature` may make the model more focussed. Another approach is to try [structured outputs](https://ollama.com/blog/structured-outputs). If those don't help or are not options, then perhaps phi4-reasoning doesn't suit your needs.

GiteaMirror commented

2026-04-29 03:24:33 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

Yes, it seems like a mix of both. The Ollama model seems to be a bit slower than the unsloth model because it’s larger. However, both analyze so deeply that they take forever to respond and end up timing out.

The unsloth model did respond, but it always hit my token limit of 4096, so something was always missing. The Ollama model didn’t even respond within 5 minutes. I believe that’s how it was...

I’m currently testing different parameters, and they seem to have a huge impact. At the moment, I’m not getting any response from unsloth in under 5 minutes either.

I’m testing:

hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
vs
hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M

But with these parameters:

options.temperature = 0.80;
options.top_k = 1;
options.top_p = 0.95;
options.min_p = 0.01;
options.repeat_penalty = 1.10;

The first one doesn't respond at all within 5 minutes.

I don’t have stream=true in my system right now, so I’m not sure if anything is coming at all. I’ll have to check that carefully...

@ALLMI78 commented on GitHub (May 8, 2025): Yes, it seems like a mix of both. The Ollama model seems to be a bit slower than the unsloth model because it’s larger. However, both analyze so deeply that they take forever to respond and end up timing out. The unsloth model did respond, but it always hit my token limit of 4096, so something was always missing. The Ollama model didn’t even respond within 5 minutes. I believe that’s how it was... I’m currently testing different parameters, and they seem to have a huge impact. At the moment, I’m not getting any response from unsloth in under 5 minutes either. I’m testing: hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M vs hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M But with these parameters: > options.temperature = 0.80; > options.top_k = 1; > options.top_p = 0.95; > options.min_p = 0.01; > options.repeat_penalty = 1.10; > The first one doesn't respond at all within 5 minutes. I don’t have stream=true in my system right now, so I’m not sure if anything is coming at all. I’ll have to check that carefully...

GiteaMirror commented

2026-04-29 03:24:33 -05:00

@ALLMI78 commented on GitHub (May 8, 2025):

With the following settings both run, base gets stuck....

  options.temperature    = 0.80;
  options.top_k          = 1;
  options.top_p          = 0.95;
  options.min_p          = 0.01;
  options.repeat_penalty = 1.10;

You won’t believe it — with the plus version, there are no problems. It runs smoothly and quickly...

Normally, the plus version should generate more tokens — but in my case, it generates fewer and works smoothly and without any issues...?

The hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M doesn’t have the issues that the base version has...?

Power Usage as expected >>> **153W** / 165W | 15542MiB / 16380MiB | 96%

@ALLMI78 commented on GitHub (May 8, 2025): With the following settings both run, base gets stuck.... options.temperature = 0.80; options.top_k = 1; options.top_p = 0.95; options.min_p = 0.01; options.repeat_penalty = 1.10; You won’t believe it — with the plus version, there are no problems. It runs smoothly and quickly... Normally, the plus version should generate more tokens — but in my case, it generates fewer and works smoothly and without any issues...? The hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M doesn’t have the issues that the base version has...? Power Usage as expected >>> `**153W** / 165W | 15542MiB / 16380MiB | 96%` ![Image](https://github.com/user-attachments/assets/1b3530c3-255c-4978-adcc-15e789e34bac)

GiteaMirror commented

2026-04-29 03:24:34 -05:00

@ALLMI78 commented on GitHub (May 9, 2025):

So after 8 hours of testing, I can say that there are no issues with the Plus version. I can't activate reasoning because that also causes long runtimes, but there is a clear difference between the two versions:

https://hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M
vs
https://hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M

The Plus version does exactly what it's supposed to. After several runs, it averages 89 seconds runtime, while Qwen3-14B averages 101 seconds. So now, in my case the Phi-4-reasoning-plus is about as performant as Qwen3....

Since both versions - Phi-4-reasoning and Phi-4-reasoning-plus - share the same architecture, it makes me question whether our previous conclusions were actually correct. Everything sounded plausible, but since the Plus version runs smoothly, maybe there's actually something wrong with the base model? It's probably not due to Ollama either...?

Interesting insights - I wasn't expecting this result. I thought I wouldn’t even need to try the Plus version since it's supposed to generate even more extensive responses, but in my case, it runs cleanly and much better than the base version.

the plus has 1 timeout at the start (around2:33) but that is ok...

power usage normal (close to TDP)
VRAM usage normal (<16GB with 32k context)
speed normal (around 1300 t/s PP and 20 t/s TG )
max out tokens for qwen @ 2800 tokens and for plus ~ 3300 so no problem with answers > 4096

I'll try to get reasoning working but atm i'm happy with that result ;)

everything is fine, it was not a problem of my hardware ;)

@ALLMI78 commented on GitHub (May 9, 2025): So after 8 hours of testing, I can say that there are no issues with the Plus version. I can't activate reasoning because that also causes long runtimes, but there is a clear difference between the two versions: https://hf.co/unsloth/Phi-4-reasoning-GGUF:Q4_K_M vs https://hf.co/unsloth/Phi-4-reasoning-plus-GGUF:Q4_K_M The Plus version does exactly what it's supposed to. After several runs, it averages 89 seconds runtime, while Qwen3-14B averages 101 seconds. So now, in my case the Phi-4-reasoning-plus is about as performant as Qwen3.... Since both versions - Phi-4-reasoning and Phi-4-reasoning-plus - share the same architecture, it makes me question whether our previous conclusions were actually correct. Everything sounded plausible, but since the Plus version runs smoothly, maybe there's actually something wrong with the base model? It's probably not due to Ollama either...? Interesting insights - I wasn't expecting this result. I thought I wouldn’t even need to try the Plus version since it's supposed to generate even more extensive responses, but in my case, it runs cleanly and much better than the base version. the plus has 1 timeout at the start (around2:33) but that is ok... ![Image](https://github.com/user-attachments/assets/030b110b-b149-4ab9-b804-bfce823a4fe7) - power usage normal (close to TDP) - VRAM usage normal (<16GB with 32k context) - speed normal (around 1300 t/s PP and 20 t/s TG ) - max out tokens for qwen @ 2800 tokens and for plus ~ 3300 so no problem with answers > 4096 I'll try to get reasoning working but atm i'm happy with that result ;) everything is fine, it was not a problem of my hardware ;)

GiteaMirror commented

2026-04-29 03:24:35 -05:00

@ALLMI78 commented on GitHub (May 9, 2025):

closed

@ALLMI78 commented on GitHub (May 9, 2025): closed

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#53493