[PR #13259] Multi-file GGUF models loading support #24675

Open
opened 2026-04-19 17:43:55 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13259
Author: @cvrunmin
Created: 11/27/2025
Status: 🔄 Open

Base: mainHead: feat/split-gguf


📝 Commits (10+)

  • efdd9b7 gguf: add split gguf loading
  • fb7c898 ggml: fix cannot read split info
  • 5f32d76 server: sort baselayers by split.no in create route
  • a12cabb ggml: fix wrong param size of metaggml
  • b2ebfcc server: get correct info from split ggufs while creating models
  • 10dc89f server: sanity check when creating model with split gguf
  • 3e353c9 docs: add infos for split gguf
  • fa9f1ee Merge branch 'main' into feat/split-gguf
  • 469ac5b server: more sanity check when loading split gguf
  • 44e6a78 Merge branch 'main' into feat/split-gguf

📊 Changes

25 files changed (+770 additions, -119 deletions)

View changed files

📝 discover/runner.go (+1 -0)
📝 docs/api.md (+2 -1)
📝 docs/import.mdx (+4 -0)
📝 docs/modelfile.mdx (+10 -2)
📝 fs/ggml/ggml.go (+180 -3)
📝 fs/ggml/gguf.go (+4 -1)
📝 go.mod (+1 -1)
📝 llama/llama.go (+12 -2)
📝 llm/server.go (+79 -26)
📝 ml/backend.go (+4 -4)
📝 ml/backend/ggml/ggml.go (+53 -14)
📝 ml/backend/ggml/ggml_test.go (+1 -1)
📝 model/model.go (+2 -2)
📝 runner/llamarunner/runner.go (+13 -8)
📝 runner/ollamarunner/runner.go (+14 -9)
📝 server/create.go (+97 -0)
📝 server/images.go (+29 -14)
📝 server/routes.go (+4 -4)
📝 server/routes_create_test.go (+233 -0)
📝 server/routes_debug_test.go (+2 -2)

...and 5 more files

📄 Description

Summery

This PR aims to provide a solution to fix #5245, a prolonged obstacle to block ollama users from simply pulling large size LLM in GGUF format from repos like Huggingface, as most of the model provider provides split GGUF files for users with limited Internet connectivity.

Changes

  1. MetaGGML, a container with multiple GGMLs, acts like a single ggml model and provide very similar, if not the same, functionality from GGML.
    • GraphSize, SupportsKVCacheType, SupportsFlashAttention, FlashAttention, KV are supported by both GGML and MetaGGML. Calling these functions from GGML will call the MetaGGML version of the same function by wrapping it for code reusing.
    • Add MetaGGML.TotalTensorBytes() to provide tensor bytes length from all gguf shards.
  2. ForeignTensor, a wrapper of Tensor with ModelPath and TensorRegionOffset, allows loading tensor weights from corresponding weight file with correct offset, without searching and accessing GGML instance.
  3. llamarunner and ollamarunner now supports reading multiple model binaries. These runners will not check if the model binaries are provided in correct order. Ollama server provides these in order, and users manually execute runners shall make sure the order is correct.
  4. ollama create can now create model with multiple GGUF files by providing FROM ./dir/to/ggufs/ or multiple FROM ./file/to/split.ggufs. Ollama will sort gguf shards in correct order by reading split.no in gguf key-value. GGUF files with no split.* will always at the front of the file list.
  5. raise github.com/spf13/pflag from indirect into direct dependency.

Note: saving split GGUF files is out of scope of this pr.

Helpers Welcome

Since limited access from powerful hardwares, this pr are not fully tested with large model (I only tested with Qwen3-1.7B by splitting the weight myself). If you are interested in this and have powerful hardware access, welcome to build and test this pr, and make a comment on the result. You are also welcome to discuss the naming of MetaGGML and ForeignTensor from my not-so-good naming taste.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13259 **Author:** [@cvrunmin](https://github.com/cvrunmin) **Created:** 11/27/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat/split-gguf` --- ### 📝 Commits (10+) - [`efdd9b7`](https://github.com/ollama/ollama/commit/efdd9b76da5a6b7bfb2fc3a2ee5034ce98808abe) gguf: add split gguf loading - [`fb7c898`](https://github.com/ollama/ollama/commit/fb7c89801e2cf7fd87cb7fe97a2f4795404ceafc) ggml: fix cannot read split info - [`5f32d76`](https://github.com/ollama/ollama/commit/5f32d764fda1550fc4b142bc3fd47fc2d0e8b04e) server: sort baselayers by split.no in create route - [`a12cabb`](https://github.com/ollama/ollama/commit/a12cabbdf9f7a7937e03cfc6a8d8155d717ed179) ggml: fix wrong param size of metaggml - [`b2ebfcc`](https://github.com/ollama/ollama/commit/b2ebfccff83c120a3dffce68a24cea4d98081d21) server: get correct info from split ggufs while creating models - [`10dc89f`](https://github.com/ollama/ollama/commit/10dc89faca991ec1a278b6c07d4081d7d3c169de) server: sanity check when creating model with split gguf - [`3e353c9`](https://github.com/ollama/ollama/commit/3e353c980b7c7eb7f5d7fde7baa0aaf91bc74208) docs: add infos for split gguf - [`fa9f1ee`](https://github.com/ollama/ollama/commit/fa9f1eef9d86758a941ca544a5e6e9606c8d959b) Merge branch 'main' into feat/split-gguf - [`469ac5b`](https://github.com/ollama/ollama/commit/469ac5b63843974a204d076bf982a99cdf56ce38) server: more sanity check when loading split gguf - [`44e6a78`](https://github.com/ollama/ollama/commit/44e6a789cafa64d160889c209aaade72a8546a3b) Merge branch 'main' into feat/split-gguf ### 📊 Changes **25 files changed** (+770 additions, -119 deletions) <details> <summary>View changed files</summary> 📝 `discover/runner.go` (+1 -0) 📝 `docs/api.md` (+2 -1) 📝 `docs/import.mdx` (+4 -0) 📝 `docs/modelfile.mdx` (+10 -2) 📝 `fs/ggml/ggml.go` (+180 -3) 📝 `fs/ggml/gguf.go` (+4 -1) 📝 `go.mod` (+1 -1) 📝 `llama/llama.go` (+12 -2) 📝 `llm/server.go` (+79 -26) 📝 `ml/backend.go` (+4 -4) 📝 `ml/backend/ggml/ggml.go` (+53 -14) 📝 `ml/backend/ggml/ggml_test.go` (+1 -1) 📝 `model/model.go` (+2 -2) 📝 `runner/llamarunner/runner.go` (+13 -8) 📝 `runner/ollamarunner/runner.go` (+14 -9) 📝 `server/create.go` (+97 -0) 📝 `server/images.go` (+29 -14) 📝 `server/routes.go` (+4 -4) 📝 `server/routes_create_test.go` (+233 -0) 📝 `server/routes_debug_test.go` (+2 -2) _...and 5 more files_ </details> ### 📄 Description ## Summery This PR aims to provide a solution to fix #5245, a prolonged obstacle to block ollama users from simply pulling large size LLM in GGUF format from repos like Huggingface, as most of the model provider provides split GGUF files for users with limited Internet connectivity. ## Changes 1. **MetaGGML**, a container with multiple `GGML`s, acts like a single ggml model and provide very similar, if not the same, functionality from `GGML`. - `GraphSize`, `SupportsKVCacheType`, `SupportsFlashAttention`, `FlashAttention`, `KV` are supported by both `GGML` and `MetaGGML`. Calling these functions from `GGML` will call the `MetaGGML` version of the same function by wrapping it for code reusing. - Add `MetaGGML.TotalTensorBytes()` to provide tensor bytes length from all gguf shards. 2. **ForeignTensor**, a wrapper of `Tensor` with `ModelPath` and `TensorRegionOffset`, allows loading tensor weights from corresponding weight file with correct offset, without searching and accessing `GGML` instance. 3. `llamarunner` and `ollamarunner` now supports reading multiple model binaries. These runners will not check if the model binaries are provided in correct order. Ollama server provides these in order, and users manually execute runners shall make sure the order is correct. 4. `ollama create` can now create model with multiple GGUF files by providing `FROM ./dir/to/ggufs/` or multiple `FROM ./file/to/split.gguf`s. Ollama will sort gguf shards in correct order by reading `split.no` in gguf key-value. GGUF files with no `split.*` will always at the front of the file list. 5. raise `github.com/spf13/pflag` from indirect into direct dependency. Note: *saving* split GGUF files is out of scope of this pr. ## Helpers Welcome Since limited access from powerful hardwares, this pr are not fully tested with large model (I only tested with Qwen3-1.7B by splitting the weight myself). If you are interested in this and have powerful hardware access, welcome to build and test this pr, and make a comment on the result. You are also welcome to discuss the naming of `MetaGGML` and `ForeignTensor` from my not-so-good naming taste. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:43:55 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#24675