[GH-ISSUE #1827] Massive slowdown on v 0.1.18 vs 0.1.17 with same model on Intel Mac #1041

Closed
opened 2026-04-12 10:46:33 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @pjv on GitHub (Jan 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1827

SCR-20240106-kfri

I don’t have exact timings but the same model (in this case, deepseek-coder:6.7b-instruct-q4_K_S) generates tokens roughly 5 times faster on 0.1.17 than on 0.1.18 on my Intel Mac.

I upgraded to 0.1.18 and noticed the slowdown in token generation and then downgraded back to 0.1.17 and immediately saw the faster throughput I am accustomed to.

Originally created by @pjv on GitHub (Jan 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1827 <img width="255" alt="SCR-20240106-kfri" src="https://github.com/jmorganca/ollama/assets/327716/c1b5ce5b-acd5-4c81-a59a-6db39ff6a257"> I don’t have exact timings but the same model (in this case, `deepseek-coder:6.7b-instruct-q4_K_S`) generates tokens roughly 5 times faster on 0.1.17 than on 0.1.18 on my Intel Mac. I upgraded to 0.1.18 and noticed the slowdown in token generation and then downgraded back to 0.1.17 and immediately saw the faster throughput I am accustomed to.
GiteaMirror added the bug label 2026-04-12 10:46:33 -05:00
Author
Owner

@jmorganca commented on GitHub (Jan 6, 2024):

Sorry you hit this slowdown. Would it be possible to share the logs? They should be in ~/.ollama/logs/server.log - thanks so much!

<!-- gh-comment-id:1879761414 --> @jmorganca commented on GitHub (Jan 6, 2024): Sorry you hit this slowdown. Would it be possible to share the logs? They should be in `~/.ollama/logs/server.log` - thanks so much!
Author
Owner

@jmorganca commented on GitHub (Jan 6, 2024):

Also would it be possible to test llama2 and see if you see the same slowdown with that model architecture? Thanks!

<!-- gh-comment-id:1879762977 --> @jmorganca commented on GitHub (Jan 6, 2024): Also would it be possible to test `llama2` and see if you see the same slowdown with that model architecture? Thanks!
Author
Owner

@jmorganca commented on GitHub (Jan 6, 2024):

Ok! Update: I'm able to reproduce this for models with k-quants (e.g. q4_K_S, but not for regular quantization – e.g. q4_0). Will look into this!

<!-- gh-comment-id:1879765642 --> @jmorganca commented on GitHub (Jan 6, 2024): Ok! Update: I'm able to reproduce this for models with k-quants (e.g. `q4_K_S`, but not for regular quantization – e.g. `q4_0`). Will look into this!
Author
Owner

@pjv commented on GitHub (Jan 6, 2024):

wow, you’re a lot faster than me. I’m still generating logs for you. Do you still need them?

<!-- gh-comment-id:1879765867 --> @pjv commented on GitHub (Jan 6, 2024): wow, you’re a lot faster than me. I’m still generating logs for you. Do you still need them?
Author
Owner

@pjv commented on GitHub (Jan 6, 2024):

Ok! Update: I'm able to reproduce this for models with k-quants (e.g. q4_K_S, but not for regular quantization – e.g. q4_0). Will look into this!

Yup, testing the llama2 model, 0.1.18 seems a bit faster than 0.1.17. but the q4_K_S model is very slow.

<!-- gh-comment-id:1879767723 --> @pjv commented on GitHub (Jan 6, 2024): > Ok! Update: I'm able to reproduce this for models with k-quants (e.g. q4_K_S, but not for regular quantization – e.g. q4_0). Will look into this! Yup, testing the `llama2` model, 0.1.18 seems a bit faster than 0.1.17. but the q4_K_S model is very slow.
Author
Owner

@jmorganca commented on GitHub (Jan 6, 2024):

No worries about the logs – I can reproduce on my side. Tracking this down

<!-- gh-comment-id:1879782247 --> @jmorganca commented on GitHub (Jan 6, 2024): No worries about the logs – I can reproduce on my side. Tracking this down
Author
Owner

@coder543 commented on GitHub (Jan 6, 2024):

Semi-related, but isn't k-quant the newer/better quantization method? I have found it confusing that ollama defaults to the non-K quants, but maybe I'm confused about which method is better.

<!-- gh-comment-id:1879782769 --> @coder543 commented on GitHub (Jan 6, 2024): Semi-related, but isn't k-quant the newer/better quantization method? I have found it confusing that ollama defaults to the non-K quants, but maybe I'm confused about which method is better.
Author
Owner

@oldgithubman commented on GitHub (Apr 15, 2024):

Semi-related, but isn't k-quant the newer/better quantization method? I have found it confusing that ollama defaults to the non-K quants, but maybe I'm confused about which method is better.

Would also appreciate some insight into this

<!-- gh-comment-id:2054268428 --> @oldgithubman commented on GitHub (Apr 15, 2024): > Semi-related, but isn't k-quant the newer/better quantization method? I have found it confusing that ollama defaults to the non-K quants, but maybe I'm confused about which method is better. Would also appreciate some insight into this
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1041