[GH-ISSUE #9726] Decouple Model Loading and Inference to Allow Dynamic Thread Configuration without Model Reload #6357

Open
opened 2026-04-12 17:52:07 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @luckycv on GitHub (Mar 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9726

Description:

Currently, whenever we adjust configuration settings such as numThreads during inference, the system reloads the entire model. This can significantly slow down inference, especially when dealing with large models, since reloading the model is a time-consuming process.

Expected Behavior:

We would like to have the ability to modify inference-related configurations (such as numThreads) dynamically without the need to reload the model. This would allow us to adjust resource usage for inference on the fly without experiencing the overhead of reloading the model each time a configuration change is needed.

Proposed Solution:

Model Loading: Keep the model loading process independent of inference configurations. The model should be loaded once, and all configurations related to inference (e.g., thread count, batch size) should be changeable during the inference process.

Dynamic Configuration: Allow dynamic adjustment of inference parameters, such as numThreads, after the model is loaded, without requiring a reload of the model weights or the entire model state.

Use Case: This would benefit users running inference on large models, where adjusting resources (e.g., CPU threads) is often necessary for optimal performance. Having the ability to modify these settings without reloading the model would greatly improve both performance and user experience.

Additional Context:

Currently, the inference process and model loading are tightly coupled, making it difficult to adjust parameters like numThreads dynamically.
A solution to decouple these two processes would improve efficiency and reduce unnecessary overhead.

Originally created by @luckycv on GitHub (Mar 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9726 **Description:** Currently, whenever we adjust configuration settings such as numThreads during inference, the system reloads the entire model. This can significantly slow down inference, especially when dealing with large models, since reloading the model is a time-consuming process. **Expected Behavior:** We would like to have the ability to modify inference-related configurations (such as numThreads) dynamically without the need to reload the model. This would allow us to adjust resource usage for inference on the fly without experiencing the overhead of reloading the model each time a configuration change is needed. **Proposed Solution:** Model Loading: Keep the model loading process independent of inference configurations. The model should be loaded once, and all configurations related to inference (e.g., thread count, batch size) should be changeable during the inference process. Dynamic Configuration: Allow dynamic adjustment of inference parameters, such as numThreads, after the model is loaded, without requiring a reload of the model weights or the entire model state. Use Case: This would benefit users running inference on large models, where adjusting resources (e.g., CPU threads) is often necessary for optimal performance. Having the ability to modify these settings without reloading the model would greatly improve both performance and user experience. **Additional Context:** Currently, the inference process and model loading are tightly coupled, making it difficult to adjust parameters like numThreads dynamically. A solution to decouple these two processes would improve efficiency and reduce unnecessary overhead.
GiteaMirror added the feature request label 2026-04-12 17:52:07 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6357