[GH-ISSUE #1919] create model, not meeting the performance requirements of the gguf #63141

Closed
opened 2026-05-03 12:16:59 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @quanpinjie on GitHub (Jan 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1919

i convert baichuan2 to gguf and create a model,

The result is poor performance,do I need to configure anything else

modelfile:

FROM ./baichuan2-ggml-model-f16.gguf

image

Originally created by @quanpinjie on GitHub (Jan 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1919 i convert baichuan2 to gguf and create a model, The result is poor performance,do I need to configure anything else modelfile: FROM ./baichuan2-ggml-model-f16.gguf ![image](https://github.com/jmorganca/ollama/assets/2564119/ea70b5b6-9729-4a93-b990-a4ce439e6921)
Author
Owner

@easp commented on GitHub (Jan 11, 2024):

Poor performance compared to what? What hardware are you running on?

It looks like you are using the fp16 version of the model. That will require a lot of VRAM and memory bandwidth. Try a q4_k_m quantization.

<!-- gh-comment-id:1887976708 --> @easp commented on GitHub (Jan 11, 2024): Poor performance compared to what? What hardware are you running on? It looks like you are using the fp16 version of the model. That will require a lot of VRAM and memory bandwidth. Try a q4_k_m quantization.
Author
Owner

@pdevine commented on GitHub (Mar 12, 2024):

Hey @quanpinjie, sorry for taking so long to comment on the issue. The reason why the performance is so slow is because you need to "quantize" the model for better performance. We created a document here which explains the steps you need to take. By default Ollama uses q4_0 as the default quantization level, so if you want to see similar performance you should try using that level.

I'm going to go ahead and close out the issue, but please feel free to keep commenting or reopen it if I didn't answer your question.

<!-- gh-comment-id:1992691498 --> @pdevine commented on GitHub (Mar 12, 2024): Hey @quanpinjie, sorry for taking so long to comment on the issue. The reason why the performance is so slow is because you need to "quantize" the model for better performance. We created a document [here](https://github.com/ollama/ollama/blob/main/docs/import.md#quantize-the-model) which explains the steps you need to take. By default Ollama uses `q4_0` as the default quantization level, so if you want to see similar performance you should try using that level. I'm going to go ahead and close out the issue, but please feel free to keep commenting or reopen it if I didn't answer your question.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63141