[PR #9634] [CLOSED] server: allow running embed models in parallel #18290

Closed
opened 2026-04-16 06:30:59 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9634
Author: @BruceMacD
Created: 3/10/2025
Status: Closed

Base: mainHead: brucemacd/parallel-embed-models


📝 Commits (1)

  • 12a8b00 server: allow running embed models in parallel

📊 Changes

1 file changed (+0 additions, -5 deletions)

View changed files

📝 server/sched.go (+0 -5)

📄 Description

The ability to run embedding models in parallel with other types of models was removed due to limitations in server slot loading in a past version of the server. This slot loading system is no longer used, and embedding models can run in parallel with chat models.

Without running embedding and chat models in parallel doing retrieval-augmented-generation can be much slower. As the chat model will need to be unloaded from memory before the embedding model can be run, and then reloaded when a chat request is sent.

Original PR for this bug, for reference: https://github.com/ollama/ollama/pull/6467


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9634 **Author:** [@BruceMacD](https://github.com/BruceMacD) **Created:** 3/10/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `brucemacd/parallel-embed-models` --- ### 📝 Commits (1) - [`12a8b00`](https://github.com/ollama/ollama/commit/12a8b00b34dfd64b3f59234b3ae4d0ba4a0e0937) server: allow running embed models in parallel ### 📊 Changes **1 file changed** (+0 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `server/sched.go` (+0 -5) </details> ### 📄 Description The ability to run embedding models in parallel with other types of models was removed due to limitations in server slot loading in a past version of the server. This slot loading system is no longer used, and embedding models can run in parallel with chat models. Without running embedding and chat models in parallel doing retrieval-augmented-generation can be much slower. As the chat model will need to be unloaded from memory before the embedding model can be run, and then reloaded when a chat request is sent. Original PR for this bug, for reference: https://github.com/ollama/ollama/pull/6467 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:30:59 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#18290