[GH-ISSUE #3357] Run GGUF files directly #2063

Closed
opened 2026-04-12 12:18:06 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @Dampfinchen on GitHub (Mar 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3357

Originally assigned to: @pdevine on GitHub.

What are you trying to do?

Why is the GGUF converted instead of just being run directly like all the other inference engines (Llama.cpp, Koboldcpp, Oobabooga, LM-Studio etc).

How should we solve this?

Let the GGUF file be able to run directly without conversion.

What is the impact of not solving this?

No response

Anything else?

No response

Originally created by @Dampfinchen on GitHub (Mar 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3357 Originally assigned to: @pdevine on GitHub. ### What are you trying to do? Why is the GGUF converted instead of just being run directly like all the other inference engines (Llama.cpp, Koboldcpp, Oobabooga, LM-Studio etc). ### How should we solve this? Let the GGUF file be able to run directly without conversion. ### What is the impact of not solving this? _No response_ ### Anything else? _No response_
GiteaMirror added the question label 2026-04-12 12:18:06 -05:00
Author
Owner

@pdevine commented on GitHub (Mar 27, 2024):

Hey @Dampfinchen , thanks for the issue. There are a few reasons why we don't run them directly, but the most important one is that the blob format is content addressable. That means if you want to create another model based upon that same model Ollama will deduplicate any storage. For now this really helps with different prompts and parameters that you want to try out (i.e. w/ changes to a Modelfile), but in the future it will also really help with LoRA layers and embeddings where there is a common base model.

I'm going to go ahead and close the issue, but if you've got any other questions around this, feel free to keep commenting.

<!-- gh-comment-id:2024144270 --> @pdevine commented on GitHub (Mar 27, 2024): Hey @Dampfinchen , thanks for the issue. There are a few reasons why we don't run them directly, but the most important one is that the blob format is content addressable. That means if you want to create another model based upon that same model Ollama will deduplicate any storage. For now this really helps with different prompts and parameters that you want to try out (i.e. w/ changes to a Modelfile), but in the future it will also really help with LoRA layers and embeddings where there is a common base model. I'm going to go ahead and close the issue, but if you've got any other questions around this, feel free to keep commenting.
Author
Owner

@MoonRide303 commented on GitHub (May 25, 2024):

@pdevine Maybe it's worth reaching out to the team maintaining GGUF format, and address that in the next version of it, then? Model files eat up quite a lot of disk space, and having to create separate copies of them per application is a terrible waste - that was main reason for me to stop using ollama.

<!-- gh-comment-id:2131395781 --> @MoonRide303 commented on GitHub (May 25, 2024): @pdevine Maybe it's worth reaching out to the team maintaining [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), and address that in the next version of it, then? Model files eat up quite a lot of disk space, and having to create separate copies of them per application is a terrible waste - that was main reason for me to stop using ollama.
Author
Owner

@pdevine commented on GitHub (May 30, 2024):

@MoonRide303 I definitely appreciate the feedback. I'm just now starting to work on the finetunes and being able to have separate adapters which can all use a same base model. That will enormously cut down on the amount of disk space and dramatically cut down download times for fine tuned models.

Unfortunately I feel like the GGUF folks (and the community in general) have gone down a path where everyone is OK w/ fusing finetunes and wasting huge amounts of disk space and people's bandwidth.

<!-- gh-comment-id:2138494523 --> @pdevine commented on GitHub (May 30, 2024): @MoonRide303 I definitely appreciate the feedback. I'm just now starting to work on the finetunes and being able to have separate adapters which can all use a same base model. That will enormously cut down on the amount of disk space and dramatically cut down download times for fine tuned models. Unfortunately I feel like the GGUF folks (and the community in general) have gone down a path where everyone is OK w/ fusing finetunes and wasting huge amounts of disk space and people's bandwidth.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2063