[GH-ISSUE #1404] Add support for llamafile #746

Closed
opened 2026-04-12 10:25:33 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @rupurt on GitHub (Dec 6, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1404

Mozilla has announced a new file format like the modelfile but compiled to a single executable. Are there any plans to support it?

https://github.com/Mozilla-Ocho/llamafile

Originally created by @rupurt on GitHub (Dec 6, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1404 Mozilla has announced a new file format like the modelfile but compiled to a single executable. Are there any plans to support it? https://github.com/Mozilla-Ocho/llamafile
Author
Owner

@pdevine commented on GitHub (Dec 6, 2023):

No plans to do this right now. The issue with adding everything into a single executable is that if any part of the model needs to be updated (such as the system prompt or the template), you have to download all of the weights again. A lot of models also share those weights (e.g. models based on llama2, mistral, etc) where it's really fast to do a pull because you don't have to pull the weights again.

<!-- gh-comment-id:1843631470 --> @pdevine commented on GitHub (Dec 6, 2023): No plans to do this right now. The issue with adding everything into a single executable is that if any part of the model needs to be updated (such as the system prompt or the template), you have to download all of the weights again. A lot of models also share those weights (e.g. models based on llama2, mistral, etc) where it's really fast to do a pull because you don't have to pull the weights again.
Author
Owner

@heresandyboy commented on GitHub (Jul 16, 2024):

Having the weights in the llamafile isn't mandatory, so shouldn't be a show stopper hopefully. We are all interested in the potentially huge speed up in CPU inference over anything else I imagine.

https://github.com/Mozilla-Ocho/llamafile#:~:text=The%20weights%20for%20an,executable%20file%20size%20limit.

The weights for an LLM can be embedded within the llamafile. We added support for PKZIP to the GGML library. This lets uncompressed weights be mapped directly into memory, similar to a self-extracting archive. It enables quantized weights distributed online to be prefixed with a compatible version of the llama.cpp software, thereby ensuring its originally observed behaviors can be reproduced indefinitely.

Finally, with the tools included in this project you can create your own llamafiles, using any compatible model weights you want. You can then distribute these llamafiles to other people, who can easily make use of them regardless of what kind of computer they have.

Using llamafile with external weights
Even though our example llamafiles have the weights built-in, you don't have to use llamafile that way. Instead, you can download just the llamafile software (without any weights included) from our releases page. You can then use it alongside any external weights you may have on hand. External weights are particularly useful for Windows users because they enable you to work around Windows' 4GB executable file size limit.

<!-- gh-comment-id:2231534604 --> @heresandyboy commented on GitHub (Jul 16, 2024): Having the weights in the llamafile isn't mandatory, so shouldn't be a show stopper hopefully. We are all interested in the potentially huge speed up in CPU inference over anything else I imagine. [https://github.com/Mozilla-Ocho/llamafile#:~:text=The%20weights%20for%20an,executable%20file%20size%20limit.](https://github.com/Mozilla-Ocho/llamafile#:~:text=The%20weights%20for%20an,executable%20file%20size%20limit.) > The weights for an LLM can be embedded within the llamafile. We added support for PKZIP to the GGML library. This lets uncompressed weights be mapped directly into memory, similar to a self-extracting archive. It enables quantized weights distributed online to be prefixed with a compatible version of the llama.cpp software, thereby ensuring its originally observed behaviors can be reproduced indefinitely. > > Finally, with the tools included in this project you can create your own llamafiles, using any compatible model weights you want. You can then distribute these llamafiles to other people, who can easily make use of them regardless of what kind of computer they have. > > Using llamafile with external weights > Even though our example llamafiles have the weights built-in, you don't have to use llamafile that way. Instead, you can download just the llamafile software (without any weights included) from our releases page. You can then use it alongside any external weights you may have on hand. External weights are particularly useful for Windows users because they enable you to work around Windows' 4GB executable file size limit. >
Author
Owner

@BradKML commented on GitHub (Jul 24, 2024):

Thinking the recent presentation on the "Holy Grail" makes this more important as a means to speed up CPU inference, but at the same time not sure if this is an Ollama issue or a LLaMA.cpp issue. @pdevine could llamafile be treated as a "driver" of sorts?

<!-- gh-comment-id:2246899637 --> @BradKML commented on GitHub (Jul 24, 2024): Thinking the recent presentation on the "Holy Grail" makes this more important as a means to speed up CPU inference, but at the same time not sure if this is an Ollama issue or a LLaMA.cpp issue. @pdevine could llamafile be treated as a "driver" of sorts?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#746