[GH-ISSUE #9151] Feature Request: Integrate HiP Attention for Extended Context Length #5954

Closed
opened 2026-04-12 17:18:14 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @MubarakHAlketbi on GitHub (Feb 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9151

Description:

This issue requests the integration of HiP (Hierarchically Pruned) Attention into Ollama to enable significantly extended context lengths for supported models. HiP Attention is a training-free method that allows for sub-quadratic cost in Transformer models, making it possible to handle very long contexts with reduced computational resources.

Motivation:

Ollama users frequently need to process long documents, codebases, or conversations. Current context length limitations can hinder the effectiveness of LLMs in these scenarios. Integrating HiP Attention would provide the following benefits:

  • Greatly Extended Context Length: HiP Attention has demonstrated the ability to handle context lengths of up to 3 million tokens on a single L40S 48GB GPU (as per the research paper). This is a substantial improvement over typical context limits.
  • Improved Performance: The sub-quadratic cost of HiP Attention translates to faster inference speeds, especially for long contexts. The paper claims an estimated 7.24x speedup.
  • Reduced Memory Requirements: By efficiently managing the Key-Value (KV) cache, HiP Attention can reduce the memory footprint required for long-context processing. This allows for larger contexts on existing hardware.
  • Training-Free Approach: A key advantage of HiP Attention is that it doesn't require model retraining. This makes it easier to integrate and apply to existing models supported by Ollama.
  • SGlang Integration: HiP Attention has integration with SGlang.

Proposed Implementation (Suggestions):

  1. Integrate the HiP Attention Library: The core implementation is available at DeepAuto-AI/hip-attention. This repository provides Python bindings and CUDA kernels for implementing HiP Attention. The installation instructions (building from source or using Docker) are provided in the repository.

  2. Model Compatibility: Determine which models within Ollama's supported model set are compatible with HiP Attention. The initial focus could be on models commonly used for long-context tasks.

  3. Configuration Options: Expose configuration options to users, allowing them to:

    • Enable/disable HiP Attention.
    • Control parameters related to the pruning hierarchy (if applicable and exposed by the HiP Attention library).
    • Potentially manage KV cache offloading (if using the ainl-hip-offload version or future versions with this feature).
  4. Performance Testing: Thoroughly test the integration to ensure performance gains and stability, especially with very long contexts. Compare performance with and without HiP Attention enabled.

  5. Licensing Considerations: HiP Attention is currently under the FSL-1.1-MIT license, which is free for non-commercial use but transitions to MIT after two years. Ensure that Ollama's usage complies with this license. This is a crucial point to address.

Relevant Links:

Conclusion:

Adding support for HiP Attention would be a significant enhancement to Ollama, enabling users to work with much longer contexts more efficiently. This would open up new possibilities for using LLMs in various applications that require processing large amounts of text. I believe this feature would be highly valuable to the Ollama community.

Originally created by @MubarakHAlketbi on GitHub (Feb 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9151 **Description:** This issue requests the integration of HiP (Hierarchically Pruned) Attention into Ollama to enable significantly extended context lengths for supported models. HiP Attention is a training-free method that allows for sub-quadratic cost in Transformer models, making it possible to handle very long contexts with reduced computational resources. **Motivation:** Ollama users frequently need to process long documents, codebases, or conversations. Current context length limitations can hinder the effectiveness of LLMs in these scenarios. Integrating HiP Attention would provide the following benefits: * **Greatly Extended Context Length:** HiP Attention has demonstrated the ability to handle context lengths of up to 3 million tokens on a single L40S 48GB GPU (as per the research paper). This is a substantial improvement over typical context limits. * **Improved Performance:** The sub-quadratic cost of HiP Attention translates to faster inference speeds, especially for long contexts. The paper claims an estimated 7.24x speedup. * **Reduced Memory Requirements:** By efficiently managing the Key-Value (KV) cache, HiP Attention can reduce the memory footprint required for long-context processing. This allows for larger contexts on existing hardware. * **Training-Free Approach:** A key advantage of HiP Attention is that it doesn't require model retraining. This makes it easier to integrate and apply to existing models supported by Ollama. * **SGlang Integration:** HiP Attention has integration with SGlang. **Proposed Implementation (Suggestions):** 1. **Integrate the HiP Attention Library:** The core implementation is available at [DeepAuto-AI/hip-attention](https://github.com/DeepAuto-AI/hip-attention). This repository provides Python bindings and CUDA kernels for implementing HiP Attention. The installation instructions (building from source or using Docker) are provided in the repository. 2. **Model Compatibility:** Determine which models within Ollama's supported model set are compatible with HiP Attention. The initial focus could be on models commonly used for long-context tasks. 3. **Configuration Options:** Expose configuration options to users, allowing them to: * Enable/disable HiP Attention. * Control parameters related to the pruning hierarchy (if applicable and exposed by the HiP Attention library). * Potentially manage KV cache offloading (if using the `ainl-hip-offload` version or future versions with this feature). 4. **Performance Testing:** Thoroughly test the integration to ensure performance gains and stability, especially with very long contexts. Compare performance with and without HiP Attention enabled. 5. **Licensing Considerations:** HiP Attention is currently under the [FSL-1.1-MIT](https://fsl.software/) license, which is free for non-commercial use but transitions to MIT after two years. Ensure that Ollama's usage complies with this license. *This is a crucial point to address.* **Relevant Links:** * **HiP Attention GitHub Repository:** [https://github.com/DeepAuto-AI/hip-attention](https://github.com/DeepAuto-AI/hip-attention) * **ArXiv Paper (latest):** [https://arxiv.org/abs/2406.09827](https://arxiv.org/abs/2406.09827) * **ICLR 2025 Paper:** [https://openreview.net/forum?id=PTcMzQgKmn](https://openreview.net/forum?id=PTcMzQgKmn) * **SGlang Integration:** [https://github.com/DeepAuto-AI/sglang](https://github.com/DeepAuto-AI/sglang) **Conclusion:** Adding support for HiP Attention would be a significant enhancement to Ollama, enabling users to work with much longer contexts more efficiently. This would open up new possibilities for using LLMs in various applications that require processing large amounts of text. I believe this feature would be highly valuable to the Ollama community.
GiteaMirror added the feature request label 2026-04-12 17:18:14 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 16, 2025):

Probably more suited as a pitch to llama.cpp.

<!-- gh-comment-id:2661452541 --> @rick-github commented on GitHub (Feb 16, 2025): Probably more suited as a pitch to [llama.cpp](https://github.com/ggml-org/llama.cpp/issues).
Author
Owner

@MubarakHAlketbi commented on GitHub (Feb 16, 2025):

Probably more suited as a pitch to llama.cpp.

discussion.

<!-- gh-comment-id:2661458241 --> @MubarakHAlketbi commented on GitHub (Feb 16, 2025): > Probably more suited as a pitch to [llama.cpp](https://github.com/ggml-org/llama.cpp/issues). [discussion](https://github.com/ggml-org/llama.cpp/discussions/11910).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5954