[GH-ISSUE #10088] Use HF's unsloth quantized multipart GGUF and maybe safetensors files directly (a directory with 6 files: ...-00001-of-00006.gguf, etc) without requiring merging into a single GGUF file #32373

Closed
opened 2026-04-22 13:34:42 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @vadimkantorov on GitHub (Apr 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10088

Is it possible to use directly the quant weight files as published by unsloth on HF? A week ago Unsloth has published new quantized variants of DeepSeek-V3-0324 at
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF with a table with suggested models

For instance, https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL and the HF fast downloader would produce a directory such as:

/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL
/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf
/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00002-of-00006.gguf
/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00003-of-00006.gguf
/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00004-of-00006.gguf
/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00005-of-00006.gguf
/mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00006-of-00006.gguf

How can one load these already downloaded multi-part gguf files directly from the local filesystem?

For this particular version, there seems to be a re-published version at https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl, but how to do for any other quants?

These models are quite large (200Gb), so is quite important to have a way to avoid downloading them multiple times or storing in multiple local caches or in multiple formats. Also, HF servers and hf_transfer tool enable faster download times than https://ollama.com . So would be nice to work directly with the HF-downloaded files (and maybe even with prepopulated HF cache dir).

Thanks!

Originally created by @vadimkantorov on GitHub (Apr 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10088 Is it possible to use directly the quant weight files as published by unsloth on HF? A week ago Unsloth has published new quantized variants of DeepSeek-V3-0324 at https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF with a table with suggested models For instance, https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/tree/main/UD-Q2_K_XL and the HF fast downloader would produce a directory such as: ``` /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00002-of-00006.gguf /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00003-of-00006.gguf /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00004-of-00006.gguf /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00005-of-00006.gguf /mnt/fs/unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00006-of-00006.gguf ``` How can one load these already downloaded multi-part gguf files directly from the local filesystem? For this particular version, there seems to be a re-published version at https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl, but how to do for any other quants? These models are quite large (200Gb), so is quite important to have a way to avoid downloading them multiple times or storing in multiple local caches or in multiple formats. Also, HF servers and hf_transfer tool enable faster download times than https://ollama.com . So would be nice to work directly with the HF-downloaded files (and maybe even with prepopulated HF cache dir). Thanks!
GiteaMirror added the feature request label 2026-04-22 13:34:42 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 2, 2025):

ollama doesn't currently support shared models; #5245. The only way to use them at the moment is to merge them.

<!-- gh-comment-id:2772584961 --> @rick-github commented on GitHub (Apr 2, 2025): ollama doesn't currently support shared models; #5245. The only way to use them at the moment is to merge them.
Author
Owner

@vadimkantorov commented on GitHub (Apr 2, 2025):

Oh :(

Is the obstacle mainly about creating mmap's onto several files instead of a mmaps onto a single one?

Because for really big multiple hundred Gb models, I guess some form of file chunking will be used anyway... Be it multiple safetensors or multiple gguf... So having native support for these is crucial, as duplication with merged models can bite the disk space and eat multiple hours / network bandwidth for re-download :(

<!-- gh-comment-id:2772603561 --> @vadimkantorov commented on GitHub (Apr 2, 2025): Oh :( Is the obstacle mainly about creating mmap's onto several files instead of a mmaps onto a single one? Because for really big multiple hundred Gb models, I guess some form of file chunking will be used anyway... Be it multiple safetensors or multiple gguf... So having native support for these is crucial, as duplication with merged models can bite the disk space and eat multiple hours / network bandwidth for re-download :(
Author
Owner

@vadimkantorov commented on GitHub (Apr 18, 2025):

Same question about multi-part *.safetensors models.

@rick-github Is it currently supported without manual merging?

Thanks! :)

<!-- gh-comment-id:2815334589 --> @vadimkantorov commented on GitHub (Apr 18, 2025): Same question about multi-part `*.safetensors` models. @rick-github Is it currently supported without manual merging? Thanks! :)
Author
Owner

@rick-github commented on GitHub (Apr 18, 2025):

ollama supports split safetensor models. The files are merged and then converted into a single GGUF file.

<!-- gh-comment-id:2815383717 --> @rick-github commented on GitHub (Apr 18, 2025): ollama supports split safetensor models. The files are merged and then converted into a single GGUF file.
Author
Owner

@vadimkantorov commented on GitHub (Apr 18, 2025):

Oh great. So should I do sth like echo "FROM https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B" > Modelfile && ollama create Kimina-Prover-Preview-Distill-7B and then ollama run Kimina-Prover-Preview-Distill-7B?

<!-- gh-comment-id:2815403777 --> @vadimkantorov commented on GitHub (Apr 18, 2025): Oh great. So should I do sth like `echo "FROM https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B" > Modelfile && ollama create Kimina-Prover-Preview-Distill-7B` and then `ollama run Kimina-Prover-Preview-Distill-7B`?
Author
Owner

@rick-github commented on GitHub (Apr 18, 2025):

FROM doesn't work with URLs that don't reference a GGUF or llama.cpp compatible repo . You need to download the model and then use FROM /path/to/directory.

<!-- gh-comment-id:2815407373 --> @rick-github commented on GitHub (Apr 18, 2025): `FROM` doesn't work with URLs that don't reference a GGUF or llama.cpp compatible repo . You need to download the model and then use `FROM /path/to/directory`.
Author
Owner

@rick-github commented on GitHub (Apr 18, 2025):

Note that you will likely have to supply a template. The model is based on qwen2 so using the template from qwen2 should provide a starting point.

<!-- gh-comment-id:2815418122 --> @rick-github commented on GitHub (Apr 18, 2025): Note that you will likely have to supply a template. The model is based on qwen2 so using the [template from qwen2](https://ollama.com/library/qwen2/blobs/77c91b422cc9) should provide a starting point.
Author
Owner

@vadimkantorov commented on GitHub (Apr 18, 2025):

Hmm, I then get Error: pull model manifest: 400: {"error":"Repository is not GGUF or is not compatible with llama.cpp"}

The repo contains 4-part safetensors weights and is based on Qwen2 (so should be llama.cpp-compat)

<!-- gh-comment-id:2815439399 --> @vadimkantorov commented on GitHub (Apr 18, 2025): Hmm, I then get `Error: pull model manifest: 400: {"error":"Repository is not GGUF or is not compatible with llama.cpp"}` The repo contains 4-part safetensors weights and is based on Qwen2 (so should be llama.cpp-compat)
Author
Owner

@rick-github commented on GitHub (Apr 18, 2025):

https://github.com/ollama/ollama/issues/10088#issuecomment-2815407373

<!-- gh-comment-id:2815441258 --> @rick-github commented on GitHub (Apr 18, 2025): https://github.com/ollama/ollama/issues/10088#issuecomment-2815407373
Author
Owner

@vadimkantorov commented on GitHub (Apr 18, 2025):

Thank you! It seems that it has worked! Should I file a feature request on supporting importing multi-part safetensors directly from URL to not forget this? I could not find an issue for this

For large models here is how I clone it without data duplication:

# Usage: bash git_clone_lfs_nodup.sh https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B $USER/Kimina-Prover-Preview-Distill-7B

GIT_LFS_SKIP_SMUDGE=1 git clone $1 $2
cd $2
git lfs fetch
git lfs ls-files -l | while read SHA DASH FILEPATH; do mv ".git/lfs/objects/${SHA:0:2}/${SHA:2:2}/$SHA" "$FILEPATH"; done
<!-- gh-comment-id:2815489968 --> @vadimkantorov commented on GitHub (Apr 18, 2025): Thank you! It seems that it has worked! Should I file a feature request on supporting importing multi-part safetensors directly from URL to not forget this? I could not find an issue for this For large models here is how I clone it without data duplication: ```shell # Usage: bash git_clone_lfs_nodup.sh https://huggingface.co/AI-MO/Kimina-Prover-Preview-Distill-7B $USER/Kimina-Prover-Preview-Distill-7B GIT_LFS_SKIP_SMUDGE=1 git clone $1 $2 cd $2 git lfs fetch git lfs ls-files -l | while read SHA DASH FILEPATH; do mv ".git/lfs/objects/${SHA:0:2}/${SHA:2:2}/$SHA" "$FILEPATH"; done ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32373