[GH-ISSUE #2830] Seeking Information on the Origin of ollama Models #48232

Closed
opened 2026-04-28 07:17:01 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @aaronyy9 on GitHub (Feb 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2830

Who can tell me where the models in ollama are downloaded from? For example, is gemma:7b-instruct-fp16, as in ollama run gemma:7b-instruct-fp16, sourced from Hugging Face? If so, who can tell me the specific source, or are the large models in ollama all newly quantized or fine-tuned by themselves?

Originally created by @aaronyy9 on GitHub (Feb 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2830 Who can tell me where the models in ollama are downloaded from? For example, is gemma:7b-instruct-fp16, as in ollama run gemma:7b-instruct-fp16, sourced from Hugging Face? If so, who can tell me the specific source, or are the large models in ollama all newly quantized or fine-tuned by themselves?
Author
Owner

@pdevine commented on GitHub (Mar 1, 2024):

The sources are typically covered on the model page (e.g. for gemma). We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format. You can find out more information about how the import works here.

There are some changes coming which will allow you to use a Modelfile and directly import the safetensor's directory on the FROM line.

<!-- gh-comment-id:1972280824 --> @pdevine commented on GitHub (Mar 1, 2024): The sources are typically covered on the model page (e.g. [for gemma](https://ollama.com/library/gemma)). We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format. You can find out more information about how the import works [here](https://github.com/ollama/ollama/blob/main/docs/import.md). There are some changes coming which will allow you to use a Modelfile and directly import the safetensor's directory on the `FROM` line.
Author
Owner

@aaronyy9 commented on GitHub (Mar 1, 2024):

The sources are typically covered on the model page (e.g. for gemma). We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format. You can find out more information about how the import works here.

There are some changes coming which will allow you to use a Modelfile and directly import the safetensor's directory on the FROM line.

Thank you very much for your help. However, I am still confused about the GPU usage of the model. For example, in ollama, the GPU usage of gemma:7b-instruct is shown as 5.2GB, but there should be four safetensors files in Hugging Face, totaling up to 18GB. That's why I'm guessing whether ollama has done some kind of quantization process on the models, like 8bit or 16bit.

<!-- gh-comment-id:1972311525 --> @aaronyy9 commented on GitHub (Mar 1, 2024): > The sources are typically covered on the model page (e.g. [for gemma](https://ollama.com/library/gemma)). We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format. You can find out more information about how the import works [here](https://github.com/ollama/ollama/blob/main/docs/import.md). > > There are some changes coming which will allow you to use a Modelfile and directly import the safetensor's directory on the `FROM` line. Thank you very much for your help. However, I am still confused about the GPU usage of the model. For example, in ollama, the GPU usage of gemma:7b-instruct is shown as 5.2GB, but there should be four safetensors files in Hugging Face, totaling up to 18GB. That's why I'm guessing whether ollama has done some kind of quantization process on the models, like 8bit or 16bit.
Author
Owner

@pdevine commented on GitHub (Mar 1, 2024):

here's the size of the model:
Screenshot 2024-02-29 at 6 31 34 PM

I'm not sure why it would say it's only using 5.2GB. The safetensors version is in brainfloat16, and when it's converted to gguf it will be mostly fp16 and fp32 (for the normalization layers). It's definitely not quantized.

<!-- gh-comment-id:1972351107 --> @pdevine commented on GitHub (Mar 1, 2024): here's the size of the model: <img width="767" alt="Screenshot 2024-02-29 at 6 31 34 PM" src="https://github.com/ollama/ollama/assets/75239/85e442fc-1652-4f5b-b2ef-665e8bde1276"> I'm not sure why it would say it's only using 5.2GB. The safetensors version is in brainfloat16, and when it's converted to gguf it will be mostly fp16 and fp32 (for the normalization layers). It's definitely not quantized.
Author
Owner

@aaronyy9 commented on GitHub (Mar 1, 2024):

brainfloat16

here's the size of the model: Screenshot 2024-02-29 at 6 31 34 PM

I'm not sure why it would say it's only using 5.2GB. The safetensors version is in brainfloat16, and when it's converted to gguf it will be mostly fp16 and fp32 (for the normalization layers). It's definitely not quantized.

The 7b-instruct I saw on ollama is this, showing 5.2G, so I'm quite confused.
image

<!-- gh-comment-id:1972366132 --> @aaronyy9 commented on GitHub (Mar 1, 2024): brainfloat16 > here's the size of the model: <img alt="Screenshot 2024-02-29 at 6 31 34 PM" width="767" src="https://private-user-images.githubusercontent.com/75239/309129704-85e442fc-1652-4f5b-b2ef-665e8bde1276.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDkyNjA4MTksIm5iZiI6MTcwOTI2MDUxOSwicGF0aCI6Ii83NTIzOS8zMDkxMjk3MDQtODVlNDQyZmMtMTY1Mi00ZjViLWIyZWYtNjY1ZThiZGUxMjc2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAzMDElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMzAxVDAyMzUxOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTgxOTEwMGRjMzAzYzNiOGZmMjlhODM5NjRkMTIyZTliMTAxNjM5NDMyMmZiNGMxY2Q3OWM1MmZmNWU5ZjJhM2QmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.aauG08tsDTUsu5PxJfjLvnn3j9HWnzkdHvQbgYaiX5w"> > > I'm not sure why it would say it's only using 5.2GB. The safetensors version is in brainfloat16, and when it's converted to gguf it will be mostly fp16 and fp32 (for the normalization layers). It's definitely not quantized. The 7b-instruct I saw on ollama is this, showing 5.2G, so I'm quite confused. ![image](https://github.com/ollama/ollama/assets/54395520/e34793f4-1215-400b-8af4-b8c24fad8951)
Author
Owner

@pdevine commented on GitHub (Mar 1, 2024):

Yeah, that is a 4bit quantized version. Make sure you ollama pull gemma:7b-instruct-fp16 to get the non-quantized version.

<!-- gh-comment-id:1972369141 --> @pdevine commented on GitHub (Mar 1, 2024): Yeah, that is a 4bit quantized version. Make sure you `ollama pull gemma:7b-instruct-fp16` to get the non-quantized version.
Author
Owner

@aaronyy9 commented on GitHub (Mar 1, 2024):

Yeah, that is a 4bit quantized version. Make sure you ollama pull gemma:7b-instruct-fp16 to get the non-quantized version.

Thank you. Then, where was this quantized version of the model downloaded from? It seems from the logs that it came from Hugging Face, but I couldn't find similar resources on Hugging Face. Could I see the URL address?

<!-- gh-comment-id:1972372809 --> @aaronyy9 commented on GitHub (Mar 1, 2024): > Yeah, that is a 4bit quantized version. Make sure you `ollama pull gemma:7b-instruct-fp16` to get the non-quantized version. Thank you. Then, where was this quantized version of the model downloaded from? It seems from the logs that it came from Hugging Face, but I couldn't find similar resources on Hugging Face. Could I see the URL address?
Author
Owner

@pdevine commented on GitHub (Mar 1, 2024):

We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format

It's the same safetensors file. It's first converted from safetensors to GGUF (in fp16) and then quantized to the various versions that you can see in the tag list. If you follow the link I mentioned before you can find instructions on how to quantize a model.

<!-- gh-comment-id:1972379725 --> @pdevine commented on GitHub (Mar 1, 2024): > We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format It's the same safetensors file. It's first converted from safetensors to GGUF (in fp16) and then quantized to the various versions that you can see in the tag list. If you follow the link I mentioned before you can find instructions on how to quantize a model.
Author
Owner

@aaronyy9 commented on GitHub (Mar 1, 2024):

We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format

It's the same safetensors file. It's first converted from safetensors to GGUF (in fp16) and then quantized to the various versions that you can see in the tag list. If you follow the link I mentioned before you can find instructions on how to quantize a model.

Okay, thank you again for your response.

<!-- gh-comment-id:1972414977 --> @aaronyy9 commented on GitHub (Mar 1, 2024): > > We usually take the safetensors implementation from HF, convert it to GGUF, and then pull the GGUF file into the Ollama model format > > It's the same safetensors file. It's first converted from safetensors to GGUF (in fp16) and then quantized to the various versions that you can see in the tag list. If you follow the link I mentioned before you can find instructions on how to quantize a model. Okay, thank you again for your response.
Author
Owner

@michelle-chou25 commented on GitHub (Apr 23, 2024):

Can I use the original models without quantization?

<!-- gh-comment-id:2071728066 --> @michelle-chou25 commented on GitHub (Apr 23, 2024): Can I use the original models without quantization?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48232