[PR #12552] [MERGED] Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes #24411

Closed
opened 2026-04-19 17:33:40 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12552
Author: @gabe-l-hart
Created: 10/9/2025
Status: Merged
Merged: 10/13/2025
Merged by: @mxyng

Base: mainHead: LlamaCPPBump-GraniteDocling


📝 Commits (10+)

  • e48637f feat: Bump llama.cpp to df1b612
  • 78010b7 fix(mtmd): Correctly encode text chunks during mtmd tokenization
  • 3eeca50 tests: Use MtmdChunk in image_test
  • 58d19c1 Merge remote-tracking branch 'origin/main' into LlamaCPPBump-GraniteDocling
  • e99ba3f style: Fix unnecessary conversion linting
  • 0d45bc8 fix(ggml): Revert changes to ggml_hip.cpp
  • 90a3cb6 fix: Revert changes in mem_nvml.cpp
  • dee012f feat: Update sync point to 1deee0
  • c94ebfc feat: Update patches for 1deee0
  • 6bb2cfc feat: sync for bump to 1deee0

📊 Changes

99 files changed (+4567 additions, -2310 deletions)

View changed files

📝 Makefile.sync (+1 -1)
📝 llama/build-info.cpp (+1 -1)
📝 llama/llama.cpp/common/common.cpp (+1 -0)
📝 llama/llama.cpp/common/common.h (+5 -3)
📝 llama/llama.cpp/include/llama.h (+8 -0)
📝 llama/llama.cpp/src/llama-arch.cpp (+62 -0)
📝 llama/llama.cpp/src/llama-arch.h (+15 -0)
📝 llama/llama.cpp/src/llama-chat.cpp (+1 -1)
📝 llama/llama.cpp/src/llama-context.cpp (+6 -0)
📝 llama/llama.cpp/src/llama-graph.cpp (+17 -0)
📝 llama/llama.cpp/src/llama-graph.h (+8 -0)
📝 llama/llama.cpp/src/llama-hparams.cpp (+5 -1)
📝 llama/llama.cpp/src/llama-hparams.h (+13 -1)
📝 llama/llama.cpp/src/llama-kv-cache-iswa.cpp (+2 -2)
📝 llama/llama.cpp/src/llama-kv-cache.cpp (+2 -5)
📝 llama/llama.cpp/src/llama-memory-hybrid.cpp (+11 -9)
📝 llama/llama.cpp/src/llama-memory-recurrent.cpp (+11 -3)
📝 llama/llama.cpp/src/llama-model-loader.cpp (+1 -0)
📝 llama/llama.cpp/src/llama-model.cpp (+334 -41)
📝 llama/llama.cpp/src/llama-model.h (+13 -0)

...and 79 more files

📄 Description

Description

This PR bumps llama.cpp to df1b612 and fixes the logic for how multimodal image tokenization is translated to tokens and embeddings when running through VLMs. Key changes coming in with this PR:

  • Improved support for Idefics3 models (SmolVLM, GraniteDocling) to properly tile images and inject delimiter tokens
  • Fix multimodal tokenization to include all text chunks as text tokens, including those interspersed with image tiles.
  • Performance improvements for many ggml operations, especially SSM_SCAN on metal. This brings performance for Granite 4 hybrid models closer to parity with Granite 4 non-hybrid models

Testing

Tested ibm-granite/granite-docling-258M:

Convert to GGUF

python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M --mmproj

Import to Ollama

Modelfile

FROM ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf
FROM ~/models/ibm-granite/mmproj-granite-docling-258M

TEMPLATE """{{- range $message := .Messages -}}
    {{- "<|start_of_role|>" }}{{ $message.Role }}{{ "<|end_of_role|>" -}}
    {{- $message.Content -}}
    {{- "<|end_of_text|>\n" -}}
{{- end -}}
{{- "<|start_of_role|>assistant<|end_of_role|>" -}}"""

PARAMETER temperature 0.0
ollama create granite-docling:258M

Results on main

sample-text-screenshot
ollama run granite-docling:258M "/Users/ghart/Pictures/sample-text-screenshot.png Convert this image to markdown."

Added image '/Users/ghart/Pictures/sample-text-screenshot.png'
<doctag><section_header_level_1><loc_10><loc_202><loc_300><loc_223>2.1.4.3.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20<doctag><section_header_level_1><loc_10><loc_202><loc_300><loc_223>2.1.4.3.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.96.97.98.90.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.96.97.98.99.100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116.117.118.119.120.121.122.123.124.125.126.127.128..100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116.117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134.135.136.137.138.139.140.141.142.143.144.145.146.147.148^C

Results with this PR

OLLAMA_HOST=http://localhost:22434 ./ollama run granite-docling:258M "/Users/ghart/Pictures/sample-text-screenshot.png Convert this image to markdown."
Added image '/Users/ghart/Pictures/sample-text-screenshot.png'
<doctag><section_header_level_1><loc_10><loc_28><loc_324><loc_130>Model: GraniteDocling 
#16112</section_header_level_1>
<text><loc_18><loc_194><loc_56><loc_243>1 : Draft</text>
<text><loc_38><loc_389><loc_101><loc_427>Conversation 11</text>
<text><loc_148><loc_393><loc_204><loc_428>3 Commits</text>
<text><loc_248><loc_392><loc_298><loc_429>4 Checks</text>
<text><loc_350><loc_388><loc_446><loc_427>Files changed 4</text>
<picture><loc_349><loc_384><loc_449><loc_428><logo></picture>
</doctag>


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12552 **Author:** [@gabe-l-hart](https://github.com/gabe-l-hart) **Created:** 10/9/2025 **Status:** ✅ Merged **Merged:** 10/13/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `LlamaCPPBump-GraniteDocling` --- ### 📝 Commits (10+) - [`e48637f`](https://github.com/ollama/ollama/commit/e48637f846933bf2e6cdae1213d013c9bb75e01c) feat: Bump llama.cpp to df1b612 - [`78010b7`](https://github.com/ollama/ollama/commit/78010b7e64aa0f8cea3abb5ae5e55b0b220a815a) fix(mtmd): Correctly encode text chunks during mtmd tokenization - [`3eeca50`](https://github.com/ollama/ollama/commit/3eeca507899fe973887e599b14774347ecd38e22) tests: Use MtmdChunk in image_test - [`58d19c1`](https://github.com/ollama/ollama/commit/58d19c113892597133d6ce690b435678a61468bf) Merge remote-tracking branch 'origin/main' into LlamaCPPBump-GraniteDocling - [`e99ba3f`](https://github.com/ollama/ollama/commit/e99ba3f04981a312e084d3bd6e20965a45c46f1d) style: Fix unnecessary conversion linting - [`0d45bc8`](https://github.com/ollama/ollama/commit/0d45bc82e086a97058f8e3c0606a82b99b9714f0) fix(ggml): Revert changes to ggml_hip.cpp - [`90a3cb6`](https://github.com/ollama/ollama/commit/90a3cb687825cbaf2b658d88dd28551dc83e5178) fix: Revert changes in mem_nvml.cpp - [`dee012f`](https://github.com/ollama/ollama/commit/dee012f78443d0e353cb7b60f85826cd12c7013c) feat: Update sync point to 1deee0 - [`c94ebfc`](https://github.com/ollama/ollama/commit/c94ebfc14e2ceb8672a6d43b3dab9a7de4cf811b) feat: Update patches for 1deee0 - [`6bb2cfc`](https://github.com/ollama/ollama/commit/6bb2cfc1ca79fe86f43e782e5169b7af10776577) feat: sync for bump to 1deee0 ### 📊 Changes **99 files changed** (+4567 additions, -2310 deletions) <details> <summary>View changed files</summary> 📝 `Makefile.sync` (+1 -1) 📝 `llama/build-info.cpp` (+1 -1) 📝 `llama/llama.cpp/common/common.cpp` (+1 -0) 📝 `llama/llama.cpp/common/common.h` (+5 -3) 📝 `llama/llama.cpp/include/llama.h` (+8 -0) 📝 `llama/llama.cpp/src/llama-arch.cpp` (+62 -0) 📝 `llama/llama.cpp/src/llama-arch.h` (+15 -0) 📝 `llama/llama.cpp/src/llama-chat.cpp` (+1 -1) 📝 `llama/llama.cpp/src/llama-context.cpp` (+6 -0) 📝 `llama/llama.cpp/src/llama-graph.cpp` (+17 -0) 📝 `llama/llama.cpp/src/llama-graph.h` (+8 -0) 📝 `llama/llama.cpp/src/llama-hparams.cpp` (+5 -1) 📝 `llama/llama.cpp/src/llama-hparams.h` (+13 -1) 📝 `llama/llama.cpp/src/llama-kv-cache-iswa.cpp` (+2 -2) 📝 `llama/llama.cpp/src/llama-kv-cache.cpp` (+2 -5) 📝 `llama/llama.cpp/src/llama-memory-hybrid.cpp` (+11 -9) 📝 `llama/llama.cpp/src/llama-memory-recurrent.cpp` (+11 -3) 📝 `llama/llama.cpp/src/llama-model-loader.cpp` (+1 -0) 📝 `llama/llama.cpp/src/llama-model.cpp` (+334 -41) 📝 `llama/llama.cpp/src/llama-model.h` (+13 -0) _...and 79 more files_ </details> ### 📄 Description ## Description This PR bumps `llama.cpp` to `df1b612` and fixes the logic for how multimodal image tokenization is translated to tokens and embeddings when running through VLMs. Key changes coming in with this PR: * Improved support for `Idefics3` models (`SmolVLM`, `GraniteDocling`) to properly tile images and inject delimiter tokens * Fix multimodal tokenization to include all text chunks as text tokens, including those interspersed with image tiles. * Performance improvements for many `ggml` operations, especially `SSM_SCAN` on `metal`. This brings performance for Granite 4 hybrid models closer to parity with Granite 4 non-hybrid models ## Testing Tested [ibm-granite/granite-docling-258M](https://huggingface.co/ibm-granite/granite-docling-258M): ### Convert to GGUF ```sh python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M --mmproj ``` ### Import to Ollama `Modelfile` ``` FROM ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf FROM ~/models/ibm-granite/mmproj-granite-docling-258M TEMPLATE """{{- range $message := .Messages -}} {{- "<|start_of_role|>" }}{{ $message.Role }}{{ "<|end_of_role|>" -}} {{- $message.Content -}} {{- "<|end_of_text|>\n" -}} {{- end -}} {{- "<|start_of_role|>assistant<|end_of_role|>" -}}""" PARAMETER temperature 0.0 ``` ```sh ollama create granite-docling:258M ``` ### Results on `main` <img width="709" height="152" alt="sample-text-screenshot" src="https://github.com/user-attachments/assets/dc2d7c20-efd4-479d-81d4-e68134476d28" /> ```sh ollama run granite-docling:258M "/Users/ghart/Pictures/sample-text-screenshot.png Convert this image to markdown." Added image '/Users/ghart/Pictures/sample-text-screenshot.png' <doctag><section_header_level_1><loc_10><loc_202><loc_300><loc_223>2.1.4.3.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20<doctag><section_header_level_1><loc_10><loc_202><loc_300><loc_223>2.1.4.3.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.96.97.98.90.61.62.63.64.65.66.67.68.69.70.71.72.73.74.75.76.77.78.79.80.81.82.83.84.85.86.87.88.89.90.91.92.93.94.95.96.97.98.99.100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116.117.118.119.120.121.122.123.124.125.126.127.128..100.101.102.103.104.105.106.107.108.109.110.111.112.113.114.115.116.117.118.119.120.121.122.123.124.125.126.127.128.129.130.131.132.133.134.135.136.137.138.139.140.141.142.143.144.145.146.147.148^C ``` ### Results with this PR ```sh OLLAMA_HOST=http://localhost:22434 ./ollama run granite-docling:258M "/Users/ghart/Pictures/sample-text-screenshot.png Convert this image to markdown." Added image '/Users/ghart/Pictures/sample-text-screenshot.png' <doctag><section_header_level_1><loc_10><loc_28><loc_324><loc_130>Model: GraniteDocling #16112</section_header_level_1> <text><loc_18><loc_194><loc_56><loc_243>1 : Draft</text> <text><loc_38><loc_389><loc_101><loc_427>Conversation 11</text> <text><loc_148><loc_393><loc_204><loc_428>3 Commits</text> <text><loc_248><loc_392><loc_298><loc_429>4 Checks</text> <text><loc_350><loc_388><loc_446><loc_427>Files changed 4</text> <picture><loc_349><loc_384><loc_449><loc_428><logo></picture> </doctag> ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 17:33:40 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#24411