[GH-ISSUE #9953] Spread model only for same CUDA version #68573

Closed
opened 2026-05-04 14:28:18 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @galvanoid on GitHub (Mar 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9953

My setup is 3x3060s and 1 Tesla M10.
In previous versions of Ollam, when I loaded a model, it was distributed among the 3x3060s. If it didn't fit on them, it used a GPU, but it never distributed the load across all the GPUs.
This allowed me to have the Tesla M10 available to load another independent model or to load the embedding model.
Now, when I load a model, it always uses all available cards, including the Tesla M10, which slows down inference.
For example, if I previously loaded a 32b model, it only used the RTX 3060s, with decent inference speed. Now, if I load a 32b model, it is distributed between the 3060s and the M10, so the inference speed is much slower.
Is there a way to revert (maybe environmet variable) to the previous system, where models are distributed among cards using the same CUDA version?

Originally created by @galvanoid on GitHub (Mar 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9953 My setup is 3x3060s and 1 Tesla M10. In previous versions of Ollam, when I loaded a model, it was distributed among the 3x3060s. If it didn't fit on them, it used a GPU, but it never distributed the load across all the GPUs. This allowed me to have the Tesla M10 available to load another independent model or to load the embedding model. Now, when I load a model, it always uses all available cards, including the Tesla M10, which slows down inference. For example, if I previously loaded a 32b model, it only used the RTX 3060s, with decent inference speed. Now, if I load a 32b model, it is distributed between the 3060s and the M10, so the inference speed is much slower. Is there a way to revert (maybe environmet variable) to the previous system, where models are distributed among cards using the same CUDA version?
Author
Owner

@rick-github commented on GitHub (Mar 23, 2025):

ollama doesn't have mechanism for this at the moment. There's a instance management ticket at #3902 but no progress so far. The only way to do this now would be to run multiple servers, bind the different GPUs with CUDA_VISIBLE_DEVICES and either have each client select the server they want to talk to, or run a proxy in front that routes requests based on the contents of the "model" field.

<!-- gh-comment-id:2746309431 --> @rick-github commented on GitHub (Mar 23, 2025): ollama doesn't have mechanism for this at the moment. There's a instance management ticket at #3902 but no progress so far. The only way to do this now would be to run multiple servers, bind the different GPUs with CUDA_VISIBLE_DEVICES and either have each client select the server they want to talk to, or run a proxy in front that routes requests based on the contents of the "model" field.
Author
Owner

@galvanoid commented on GitHub (Mar 23, 2025):

But in versions prior to v0.5.13, it worked like this. If I load a large model, it's split only between the RTX3060s, not the M10s. So I can use the M10 to load other small models.

<!-- gh-comment-id:2746480106 --> @galvanoid commented on GitHub (Mar 23, 2025): But in versions prior to v0.5.13, it worked like this. If I load a large model, it's split only between the RTX3060s, not the M10s. So I can use the M10 to load other small models.
Author
Owner

@rick-github commented on GitHub (Mar 23, 2025):

CUDA v12 enabled old architectures in a499390648. There's unfortunately currently no way to influence this at run time. If you want a version of ollama that doesn't build for arch 50 for cuda_v12, you can clone the repo, tweak CMAKE_CUDA_ARCHITECTURES, and build.

<!-- gh-comment-id:2746529058 --> @rick-github commented on GitHub (Mar 23, 2025): CUDA v12 enabled old architectures in https://github.com/ollama/ollama/commit/a499390648e9184211a1e9d196cdb20b48355591. There's unfortunately currently no way to influence this at run time. If you want a version of ollama that doesn't build for arch 50 for cuda_v12, you can clone the repo, tweak CMAKE_CUDA_ARCHITECTURES, and build.
Author
Owner

@galvanoid commented on GitHub (Mar 24, 2025):

CUDA v12 enabled old architectures in a499390. There's unfortunately currently no way to influence this at run time. If you want a version of ollama that doesn't build for arch 50 for cuda_v12, you can clone the repo, tweak CMAKE_CUDA_ARCHITECTURES, and build.

Thanks, I'll try that.

<!-- gh-comment-id:2747641512 --> @galvanoid commented on GitHub (Mar 24, 2025): > CUDA v12 enabled old architectures in [a499390](https://github.com/ollama/ollama/commit/a499390648e9184211a1e9d196cdb20b48355591). There's unfortunately currently no way to influence this at run time. If you want a version of ollama that doesn't build for arch 50 for cuda_v12, you can clone the repo, tweak CMAKE_CUDA_ARCHITECTURES, and build. Thanks, I'll try that.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68573