[PR #11635] [MERGED] cuda: leverage JIT for smaller footprint #18868

Closed
opened 2026-04-16 06:50:02 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11635
Author: @dhiltgen
Created: 8/1/2025
Status: Merged
Merged: 8/13/2025
Merged by: @dhiltgen

Base: mainHead: cuda_jit


📝 Commits (1)

  • bdb62e3 cuda: leverage JIT for smaller footprint

📊 Changes

1 file changed (+3 additions, -3 deletions)

View changed files

📝 CMakePresets.json (+3 -3)

📄 Description

Prior to this change our official binaries contained both JIT PTX code and the cubin binary code for our chosen compute capabilities. This change switches to only compile the PTX code and rely on JIT at runtime for generating the cubin specific to the users GPU. The cubins are cached on the users system, so they should only see a small lag on the very first model load for a given Ollama release. This also adds the first generation of Blackwell GPUs so they aren't reliant on the Hopper PTX.

This change reduces the ggml-cuda.dll from 1.2G to 460M

I also removed CC 8.7 as that appears to be only a Jetson CC and unused on x86.

Testing on a dual 4060 Windows system, loading gpt-oss:20b takes 6.01s before this change. With this change, on the very first load, it takes 6.26s, then all subsequent loads take ~6.0s. Token rate is unaffected.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11635 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 8/1/2025 **Status:** ✅ Merged **Merged:** 8/13/2025 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `cuda_jit` --- ### 📝 Commits (1) - [`bdb62e3`](https://github.com/ollama/ollama/commit/bdb62e347b1a7b2d31a2a30766fc8249580d5fd9) cuda: leverage JIT for smaller footprint ### 📊 Changes **1 file changed** (+3 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `CMakePresets.json` (+3 -3) </details> ### 📄 Description Prior to this change our official binaries contained both JIT PTX code and the cubin binary code for our chosen compute capabilities. This change switches to only compile the PTX code and rely on JIT at runtime for generating the cubin specific to the users GPU. The cubins are cached on the users system, so they should only see a small lag on the very first model load for a given Ollama release. This also adds the first generation of Blackwell GPUs so they aren't reliant on the Hopper PTX. This change reduces the ggml-cuda.dll from 1.2G to 460M I also removed CC 8.7 as that appears to be only a Jetson CC and unused on x86. Testing on a dual 4060 Windows system, loading gpt-oss:20b takes 6.01s before this change. With this change, on the very first load, it takes 6.26s, then all subsequent loads take ~6.0s. Token rate is unaffected. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:50:02 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#18868