[PR #11635] cuda: leverage JIT for smaller footprint #13597

Closed
opened 2026-04-13 00:30:54 -05:00 by GiteaMirror · 0 comments
Owner

Original Pull Request: https://github.com/ollama/ollama/pull/11635

State: closed
Merged: Yes


Prior to this change our official binaries contained both JIT PTX code and the cubin binary code for our chosen compute capabilities. This change switches to only compile the PTX code and rely on JIT at runtime for generating the cubin specific to the users GPU. The cubins are cached on the users system, so they should only see a small lag on the very first model load for a given Ollama release. This also adds the first generation of Blackwell GPUs so they aren't reliant on the Hopper PTX.

This change reduces the ggml-cuda.dll from 1.2G to 460M

I also removed CC 8.7 as that appears to be only a Jetson CC and unused on x86.

Testing on a dual 4060 Windows system, loading gpt-oss:20b takes 6.01s before this change. With this change, on the very first load, it takes 6.26s, then all subsequent loads take ~6.0s. Token rate is unaffected.

**Original Pull Request:** https://github.com/ollama/ollama/pull/11635 **State:** closed **Merged:** Yes --- Prior to this change our official binaries contained both JIT PTX code and the cubin binary code for our chosen compute capabilities. This change switches to only compile the PTX code and rely on JIT at runtime for generating the cubin specific to the users GPU. The cubins are cached on the users system, so they should only see a small lag on the very first model load for a given Ollama release. This also adds the first generation of Blackwell GPUs so they aren't reliant on the Hopper PTX. This change reduces the ggml-cuda.dll from 1.2G to 460M I also removed CC 8.7 as that appears to be only a Jetson CC and unused on x86. Testing on a dual 4060 Windows system, loading gpt-oss:20b takes 6.01s before this change. With this change, on the very first load, it takes 6.26s, then all subsequent loads take ~6.0s. Token rate is unaffected.
GiteaMirror added the pull-request label 2026-04-13 00:30:55 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13597