[PR #12186] [MERGED] Hybrid and recurrent memory estimates #13731

Closed
opened 2026-04-13 00:34:17 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12186
Author: @gabe-l-hart
Created: 9/4/2025
Status: Merged
Merged: 9/8/2025
Merged by: @jessegross

Base: mainHead: GraniteHybridMemoryEstimates


📝 Commits (8)

  • 653a78f feat: Add a debug log showing the estimated KV sizes from GraphSize
  • b772a0c feat: Add support for parsing HeadCount and HeadCountKV as arrays
  • c81f7a1 feat: Add getters for SSM parameters
  • dddefb4 feat: Add support for recurrent layers in kv size estimate
  • e53a65f fix: Use F32 for recurrent layer size estimates
  • 8184a74 fix: Use headsKVL instead of headsL for attn size estimate
  • 857cc5f fix: Also use the recurrent logic for pure recurrent models which have no nHeads
  • 2f53a55 fix: Remove duplicate log line

📊 Changes

1 file changed (+121 additions, -14 deletions)

View changed files

📝 fs/ggml/ggml.go (+121 -14)

📄 Description

Description

This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers.

The logic for the sizing of the recurrent layers comes from the llama.cpp implementation

        ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size);
        ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size);

Testing

Before

ollama run gabegoodhart/granite4-preview:tiny
# NAME                                  ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
# gabegoodhart/granite4-preview:tiny    2ea87d60356a    8.8 GB    100% GPU     16384      4 minutes from now

>>> /set parameter num_ctx 131072
# NAME                                  ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
# gabegoodhart/granite4-preview:tiny    2ea87d60356a    37 GB    100% GPU     131072     4 minutes from now

After

ollama run gabegoodhart/granite4-preview:tiny
# NAME                                  ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
# gabegoodhart/granite4-preview:tiny    2ea87d60356a    4.9 GB    100% GPU     4096       4 minutes from now

>>> /set parameter num_ctx 131072
# NAME                                  ID              SIZE     PROCESSOR    CONTEXT    UNTIL     
# gabegoodhart/granite4-preview:tiny    2ea87d60356a    8.0 GB    100% GPU     131072     4 minutes from now

>>> /set parameter num_ctx 524288
# NAME                                  ID              SIZE     PROCESSOR    CONTEXT    UNTIL     
# gabegoodhart/granite4-preview:tiny    2ea87d60356a    17 GB    100% GPU     524288     4 minutes from now

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12186 **Author:** [@gabe-l-hart](https://github.com/gabe-l-hart) **Created:** 9/4/2025 **Status:** ✅ Merged **Merged:** 9/8/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `GraniteHybridMemoryEstimates` --- ### 📝 Commits (8) - [`653a78f`](https://github.com/ollama/ollama/commit/653a78ffe721f4cb9a18f965f26fccf7121eb1ac) feat: Add a debug log showing the estimated KV sizes from GraphSize - [`b772a0c`](https://github.com/ollama/ollama/commit/b772a0c0fda46459169ebc0940fb5fcc1c7d4a86) feat: Add support for parsing HeadCount and HeadCountKV as arrays - [`c81f7a1`](https://github.com/ollama/ollama/commit/c81f7a1da98e4f9e3dd449f7dbf2b46f7fe7638e) feat: Add getters for SSM parameters - [`dddefb4`](https://github.com/ollama/ollama/commit/dddefb4523be18f8a976297bf1b2b0ece04053f6) feat: Add support for recurrent layers in kv size estimate - [`e53a65f`](https://github.com/ollama/ollama/commit/e53a65fd6bc025b283b4dcca379d2aa9da303855) fix: Use F32 for recurrent layer size estimates - [`8184a74`](https://github.com/ollama/ollama/commit/8184a740de5f554d8425af7a020af043994570bb) fix: Use headsKVL instead of headsL for attn size estimate - [`857cc5f`](https://github.com/ollama/ollama/commit/857cc5f66f480e1a1c313b6f5360ad75c6925c96) fix: Also use the recurrent logic for pure recurrent models which have no nHeads - [`2f53a55`](https://github.com/ollama/ollama/commit/2f53a55fce0cf01acbc6419bdd7c28ff9603a1e3) fix: Remove duplicate log line ### 📊 Changes **1 file changed** (+121 additions, -14 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+121 -14) </details> ### 📄 Description ## Description This PR updates the memory size estimate logic to better handle recurrent and hybrid-recurrent models which are currently being badly overestimated because the default logic assumes full attention for all layers. The logic for the sizing of the recurrent layers comes from [the `llama.cpp` implementation](https://github.com/ggml-org/llama.cpp/blob/master/src/llama-memory-recurrent.cpp#L87) ```c++ ggml_tensor * r = ggml_new_tensor_1d(ctx, type_r, hparams.n_embd_r()*mem_size); ggml_tensor * s = ggml_new_tensor_1d(ctx, type_s, hparams.n_embd_s()*mem_size); ``` ## Testing **Before** ```sh ollama run gabegoodhart/granite4-preview:tiny # NAME ID SIZE PROCESSOR CONTEXT UNTIL # gabegoodhart/granite4-preview:tiny 2ea87d60356a 8.8 GB 100% GPU 16384 4 minutes from now >>> /set parameter num_ctx 131072 # NAME ID SIZE PROCESSOR CONTEXT UNTIL # gabegoodhart/granite4-preview:tiny 2ea87d60356a 37 GB 100% GPU 131072 4 minutes from now ``` **After** ```sh ollama run gabegoodhart/granite4-preview:tiny # NAME ID SIZE PROCESSOR CONTEXT UNTIL # gabegoodhart/granite4-preview:tiny 2ea87d60356a 4.9 GB 100% GPU 4096 4 minutes from now >>> /set parameter num_ctx 131072 # NAME ID SIZE PROCESSOR CONTEXT UNTIL # gabegoodhart/granite4-preview:tiny 2ea87d60356a 8.0 GB 100% GPU 131072 4 minutes from now >>> /set parameter num_ctx 524288 # NAME ID SIZE PROCESSOR CONTEXT UNTIL # gabegoodhart/granite4-preview:tiny 2ea87d60356a 17 GB 100% GPU 524288 4 minutes from now ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:34:17 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13731