[PR #11195] [CLOSED] Granite four (llama.cpp bump 443e7e7+) #60171

Closed
opened 2026-04-29 15:05:25 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11195
Author: @gabe-l-hart
Created: 6/25/2025
Status: Closed

Base: mainHead: GraniteFour


📝 Commits (10+)

  • a30ae1f TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch
  • 73d089b feat: Update all patches
  • 2613f5d feat: Sync llama.cpp and ggml
  • 424e05c fix: Update rsync-filter for all moved/new/removed files
  • 414a097 fix: Add files missing from sync
  • 62af160 fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs
  • 85aba51 fix: Add ggml files missing from sync
  • 1cd9352 fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files
  • d0fd9e5 fix: Remove mtmd main cpp files
  • fa54a3c fix: Add missing include in sampling_ext.cpp

📊 Changes

214 files changed (+154880 additions, -41616 deletions)

View changed files

📝 CMakeLists.txt (+1 -0)
📝 Makefile.sync (+2 -2)
📝 llama/build-info.cpp (+1 -1)
📝 llama/llama.cpp/.rsync-filter (+12 -3)
📝 llama/llama.cpp/common/common.cpp (+101 -119)
📝 llama/llama.cpp/common/common.go (+2 -2)
📝 llama/llama.cpp/common/common.h (+43 -23)
📝 llama/llama.cpp/common/json-schema-to-grammar.cpp (+5 -47)
📝 llama/llama.cpp/common/json-schema-to-grammar.h (+4 -4)
📝 llama/llama.cpp/common/sampling.cpp (+7 -8)
📝 llama/llama.cpp/include/llama.h (+174 -198)
📝 llama/llama.cpp/src/llama-arch.cpp (+491 -3)
📝 llama/llama.cpp/src/llama-arch.h (+50 -1)
📝 llama/llama.cpp/src/llama-batch.cpp (+774 -271)
📝 llama/llama.cpp/src/llama-batch.h (+126 -55)
📝 llama/llama.cpp/src/llama-chat.cpp (+88 -9)
📝 llama/llama.cpp/src/llama-chat.h (+4 -0)
📝 llama/llama.cpp/src/llama-context.cpp (+671 -466)
📝 llama/llama.cpp/src/llama-context.h (+58 -28)
📝 llama/llama.cpp/src/llama-cparams.cpp (+4 -0)

...and 80 more files

📄 Description

Draft Status

Update: With https://github.com/ggml-org/llama.cpp/pull/13550 merged, this PR is ready to point back at ggml-org/llama.cpp and come out of draft status

This PR is based on an ongoing set of work in llama.cpp:

Since this chain is is long and the pieces are all being merged over time, this draft PR will allow testing of the other correlated llama.cpp changes in the meantime.

Description

This PR is a draft to add support for IBM's granite 4.0 architecture (GraniteMoeHybrid). To do so, it bumps llama.cpp (versus adding support directly within the ollama new engine). This is a first step as support is being added to llama.cpp initially. The plan is to add ollama engine support as a follow up (pending #9966 #10079 and a few other prereq PRs).

Changes

There are two big buckets of changes in this PR: (1) changes coming in with the bump of llama.cpp, and (2) changes in the wrapper code to handle (1).

llama.cpp changes

There were some major refactors in llama.cpp since the last bump:

  • Embeddings logic overhauled, removing the need for 0003-embeddings.patch
  • Complete overhaul of KV Cache class hierarchy and ubatch logic, including removal of defrag API. This caused the removal of 0008-ensure-KV-cache-is-fully-defragmented.patch
    • THIS SHOULD BE CAREFULLY TESTED!
  • The full deprecation of libllava and libclip as public APIs. This has been replaced with mtmd which theoretically provides equivalent capabilities (and introduces audio 😉), but the API is quite different and attempts to push for mtmd to own the end-to-end inference, including formatting the token sequence with the multimodal embeddings and then running the decode on the underlying text model.
    • THIS SHOULD BE CAREFULLY TESTED!
  • All architecture-specific implementations for ggml-cpu have been split out into a new source tree structure under ggml-cpu/arch/<arch name>

ollama changes

  • To handle the change from llava/clip to mtmd, I've attempted to update llama.go to use mtmd instead of llava and clip (f629c520ad), but this is quite likely to be buggy.
    • There is an outstanding TODO to support non-default parameters in the mtmd initialization: https://github.com/gabe-l-hart/ollama/blob/GraniteFour/llama/llama.go#L466
    • The implementation of NewEmbd (here) attempts to create an isolated image embedding by passing an empty text input to the model that only contains the default placeholder for the image tokens. This should result in a pass through the image encoder followed by placing those tokens into the placeholder.
  • To handle the new directory layout for ggml-cpu, I added a new build layout that uses preprocessor directives to conditionally #include the right architecture implementations from out-of-source based on the GOARCH value.
    • I have only tested this on my M3 mac, so I haven't validated that there are not additional issues with the non-arm implementations, but those seem to work.

Testing

To test the GraniteMoeHybrid architecture, use https://huggingface.co/ibm-granite/granite-4.0-tiny-preview. This PR does not contain conversion support, so conversion should be done using convert_hf_to_gguf.py from my fork.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11195 **Author:** [@gabe-l-hart](https://github.com/gabe-l-hart) **Created:** 6/25/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `GraniteFour` --- ### 📝 Commits (10+) - [`a30ae1f`](https://github.com/ollama/ollama/commit/a30ae1fa20ba66eeaca5a5ef9e3ce724195d45b2) TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch - [`73d089b`](https://github.com/ollama/ollama/commit/73d089bb90a371e3e31cb0474de1fcd12ebbdf0e) feat: Update all patches - [`2613f5d`](https://github.com/ollama/ollama/commit/2613f5da2d130fbb776d34b16c50c020ec802647) feat: Sync llama.cpp and ggml - [`424e05c`](https://github.com/ollama/ollama/commit/424e05c20e3473e8077a7a5b3d9d6f0a04a53f6d) fix: Update rsync-filter for all moved/new/removed files - [`414a097`](https://github.com/ollama/ollama/commit/414a0973721d47505cdbe15741fdc2e924cc9cb6) fix: Add files missing from sync - [`62af160`](https://github.com/ollama/ollama/commit/62af160d825e4d5dad3289e2e687bf169401d7b8) fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs - [`85aba51`](https://github.com/ollama/ollama/commit/85aba511ecbb006fd90b7df504540b6d3c7ab4a1) fix: Add ggml files missing from sync - [`1cd9352`](https://github.com/ollama/ollama/commit/1cd9352cc35442af7a5595453e6aee5997de1a5d) fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files - [`d0fd9e5`](https://github.com/ollama/ollama/commit/d0fd9e5aa2b6cd5a029427f03d44c1e29f469e07) fix: Remove mtmd main cpp files - [`fa54a3c`](https://github.com/ollama/ollama/commit/fa54a3cf3a87c5af3b50fa6e437a072e5ec8934f) fix: Add missing include in sampling_ext.cpp ### 📊 Changes **214 files changed** (+154880 additions, -41616 deletions) <details> <summary>View changed files</summary> 📝 `CMakeLists.txt` (+1 -0) 📝 `Makefile.sync` (+2 -2) 📝 `llama/build-info.cpp` (+1 -1) 📝 `llama/llama.cpp/.rsync-filter` (+12 -3) 📝 `llama/llama.cpp/common/common.cpp` (+101 -119) 📝 `llama/llama.cpp/common/common.go` (+2 -2) 📝 `llama/llama.cpp/common/common.h` (+43 -23) 📝 `llama/llama.cpp/common/json-schema-to-grammar.cpp` (+5 -47) 📝 `llama/llama.cpp/common/json-schema-to-grammar.h` (+4 -4) 📝 `llama/llama.cpp/common/sampling.cpp` (+7 -8) 📝 `llama/llama.cpp/include/llama.h` (+174 -198) 📝 `llama/llama.cpp/src/llama-arch.cpp` (+491 -3) 📝 `llama/llama.cpp/src/llama-arch.h` (+50 -1) 📝 `llama/llama.cpp/src/llama-batch.cpp` (+774 -271) 📝 `llama/llama.cpp/src/llama-batch.h` (+126 -55) 📝 `llama/llama.cpp/src/llama-chat.cpp` (+88 -9) 📝 `llama/llama.cpp/src/llama-chat.h` (+4 -0) 📝 `llama/llama.cpp/src/llama-context.cpp` (+671 -466) 📝 `llama/llama.cpp/src/llama-context.h` (+58 -28) 📝 `llama/llama.cpp/src/llama-cparams.cpp` (+4 -0) _...and 80 more files_ </details> ### 📄 Description # ~~Draft Status~~ Update: With https://github.com/ggml-org/llama.cpp/pull/13550 merged, this PR is ready to point back at `ggml-org/llama.cpp` and come out of draft status This PR is based on an ongoing set of work in `llama.cpp`: - [x] Hybrid Recurrent Cache: https://github.com/ggml-org/llama.cpp/pull/13979 - [x] mamba2: https://github.com/ggml-org/llama.cpp/pull/9126 - [x] Granite Four: https://github.com/ggml-org/llama.cpp/pull/13550 ~~Since this chain is is long and the pieces are all being merged over time, this draft PR will allow testing of the other correlated `llama.cpp` changes in the meantime.~~ # Description This PR is a draft to add support for IBM's granite 4.0 architecture (`GraniteMoeHybrid`). To do so, it bumps `llama.cpp` (versus adding support directly within the `ollama` new engine). This is a first step as support is being added to `llama.cpp` initially. The plan is to add `ollama` engine support as a follow up (pending #9966 #10079 and a few other prereq PRs). # Changes There are two big buckets of changes in this PR: (1) changes coming in with the bump of `llama.cpp`, and (2) changes in the wrapper code to handle (1). ## llama.cpp changes There were some major refactors in `llama.cpp` since the last bump: - Embeddings logic overhauled, removing the need for [0003-embeddings.patch](https://github.com/ollama/ollama/blob/main/llama/patches/0003-embeddings.patch) - Complete overhaul of KV Cache class hierarchy and ubatch logic, including removal of defrag API. This caused the removal of [0008-ensure-KV-cache-is-fully-defragmented.patch](https://github.com/ollama/ollama/blob/main/llama/patches/0008-ensure-KV-cache-is-fully-defragmented.patch) - THIS SHOULD BE CAREFULLY TESTED! - The full deprecation of `libllava` and `libclip` as public APIs. This has been replaced with `mtmd` which theoretically provides equivalent capabilities (and introduces audio 😉), but the API is quite different and attempts to push for `mtmd` to own the end-to-end inference, including formatting the token sequence with the multimodal embeddings and then running the decode on the underlying text model. - THIS SHOULD BE CAREFULLY TESTED! - All architecture-specific implementations for `ggml-cpu` have been split out into a new source tree structure under `ggml-cpu/arch/<arch name>` ## ollama changes - To handle the change from `llava`/`clip` to `mtmd`, I've attempted to update [llama.go](https://github.com/gabe-l-hart/ollama/blob/GraniteFour/llama/llama.go#L463) to use `mtmd` instead of `llava` and `clip` (https://github.com/gabe-l-hart/ollama/commit/f629c520ad31e634703e75c1942e026bf3715aa7), but this is _quite_ likely to be buggy. - There is an outstanding TODO to support non-default parameters in the `mtmd` initialization: https://github.com/gabe-l-hart/ollama/blob/GraniteFour/llama/llama.go#L466 - The implementation of `NewEmbd` ([here](https://github.com/gabe-l-hart/ollama/blob/GraniteFour/llama/llama.go#L482)) attempts to create an isolated image embedding by passing an empty text input to the model that only contains the default placeholder for the image tokens. This _should_ result in a pass through the image encoder followed by placing those tokens into the placeholder. - To handle the new directory layout for `ggml-cpu`, I added a new build layout that uses preprocessor directives to conditionally `#include` the right architecture implementations from out-of-source based on the `GOARCH` value. - I have only tested this on my M3 mac, so I haven't validated that there are not additional issues with the non-`arm` implementations, but those _seem_ to work. # Testing To test the `GraniteMoeHybrid` architecture, use https://huggingface.co/ibm-granite/granite-4.0-tiny-preview. This PR does _not_ contain conversion support, so conversion should be done using [convert_hf_to_gguf.py](https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFour/convert_hf_to_gguf.py) from my fork. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 15:05:25 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#60171