[GH-ISSUE #14684] Feature Request: Hot-swappable model loading without server restart #56016

Closed
opened 2026-04-29 10:08:46 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @fuleinist on GitHub (Mar 7, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14684

Problem

Currently, switching between different LLM models in Ollama requires stopping and restarting the Ollama server. This is problematic because:

  1. Time-consuming: Each model switch requires waiting for the server to fully restart
  2. Memory inefficient: All loaded models are cleared from VRAM on restart
  3. Disruptive: Active requests/interactions are interrupted during restart
  4. Poor DX: Developers working with multiple models (e.g., coding assistant + chat) must constantly restart

Proposed Solution

Implement hot-swappable model loading that allows:

  1. Multiple models loaded simultaneously: Keep several models in memory (within VRAM limits)
  2. Dynamic model switching: Switch the active model via API without restarting
  3. Model lifecycle management: Load/unload specific models on demand
  4. Memory-aware scheduling: Automatically manage VRAM based on available resources

Use Cases

  • Multi-model workflows: Use a coding model for code assistance and a chat model for conversation without restart
  • Development pipelines: Quickly switch between different model versions during testing
  • Resource optimization: Keep a lightweight model always loaded while loading heavier models on-demand
  • Reduced latency: Eliminate cold-start delays when switching between frequently-used models

Implementation Suggestions

API Endpoints:

    • Load a model into memory
    • Unload a model from memory
    • List currently loaded models
  • Add parameter to that auto-loads if not loaded

Configuration:

    • Maximum concurrent models (default: 2)
    • Max VRAM per model or total

Memory Management:

  • Implement LRU eviction when VRAM is exhausted
  • Add priority system for sticky models that shouldn't be evicted
  • Expose memory usage stats via

Example Workflow

# Pre-load multiple models
curl -X POST http://localhost:11434/api/model/load -d '{"model": "codellama"}'
curl -X POST http://localhost:11434/api/model/load -d '{"model": "llama3"}'

# List loaded models
curl http://localhost:11434/api/models/loaded
# [{"name": "codellama", "memory": "4GB"}, {"name": "llama3", "memory": "8GB"}]

# Chat using llama3 (auto-selected or specified)
curl http://localhost:11434/api/chat -d '{"model": "llama3", "messages": [...]}'

# Unload codellama when you need VRAM for other tasks
curl -X POST http://localhost:11434/api/model/unload -d '{"model": "codellama"}'

This feature would significantly improve developer experience and make Ollama more suitable for production multi-model deployments.

Originally created by @fuleinist on GitHub (Mar 7, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14684 ## Problem Currently, switching between different LLM models in Ollama requires stopping and restarting the Ollama server. This is problematic because: 1. **Time-consuming**: Each model switch requires waiting for the server to fully restart 2. **Memory inefficient**: All loaded models are cleared from VRAM on restart 3. **Disruptive**: Active requests/interactions are interrupted during restart 4. **Poor DX**: Developers working with multiple models (e.g., coding assistant + chat) must constantly restart ## Proposed Solution Implement hot-swappable model loading that allows: 1. **Multiple models loaded simultaneously**: Keep several models in memory (within VRAM limits) 2. **Dynamic model switching**: Switch the active model via API without restarting 3. **Model lifecycle management**: Load/unload specific models on demand 4. **Memory-aware scheduling**: Automatically manage VRAM based on available resources ## Use Cases - **Multi-model workflows**: Use a coding model for code assistance and a chat model for conversation without restart - **Development pipelines**: Quickly switch between different model versions during testing - **Resource optimization**: Keep a lightweight model always loaded while loading heavier models on-demand - **Reduced latency**: Eliminate cold-start delays when switching between frequently-used models ## Implementation Suggestions ### API Endpoints: - - Load a model into memory - - Unload a model from memory - - List currently loaded models - Add parameter to that auto-loads if not loaded ### Configuration: - - Maximum concurrent models (default: 2) - - Max VRAM per model or total ### Memory Management: - Implement LRU eviction when VRAM is exhausted - Add priority system for sticky models that shouldn't be evicted - Expose memory usage stats via ## Example Workflow ```bash # Pre-load multiple models curl -X POST http://localhost:11434/api/model/load -d '{"model": "codellama"}' curl -X POST http://localhost:11434/api/model/load -d '{"model": "llama3"}' # List loaded models curl http://localhost:11434/api/models/loaded # [{"name": "codellama", "memory": "4GB"}, {"name": "llama3", "memory": "8GB"}] # Chat using llama3 (auto-selected or specified) curl http://localhost:11434/api/chat -d '{"model": "llama3", "messages": [...]}' # Unload codellama when you need VRAM for other tasks curl -X POST http://localhost:11434/api/model/unload -d '{"model": "codellama"}' ``` This feature would significantly improve developer experience and make Ollama more suitable for production multi-model deployments.
Author
Owner

@rick-github commented on GitHub (Mar 7, 2026):

Ollama supports multiple loaded models.

<!-- gh-comment-id:4016059927 --> @rick-github commented on GitHub (Mar 7, 2026): Ollama supports multiple loaded models.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56016