[PR #4218] [MERGED] Enable concurrency by default #21958

Closed
opened 2026-04-19 15:58:54 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/4218
Author: @dhiltgen
Created: 5/7/2024
Status: Merged
Merged: 7/1/2024
Merged by: @dhiltgen

Base: mainHead: auto_parallel


📝 Commits (3)

  • 17b7186 Enable concurrency by default
  • 9929751 Disable concurrency for AMD + Windows
  • 642cee1 Sort the ps output

📊 Changes

7 files changed (+175 additions, -73 deletions)

View changed files

📝 envconfig/config.go (+8 -8)
📝 gpu/amd_windows.go (+3 -2)
📝 gpu/types.go (+5 -0)
📝 llm/server.go (+3 -10)
📝 server/routes.go (+5 -0)
📝 server/sched.go (+100 -24)
📝 server/sched_test.go (+51 -29)

📄 Description

This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.

Corresponding Doc update to merge after this #5364


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/4218 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 5/7/2024 **Status:** ✅ Merged **Merged:** 7/1/2024 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `auto_parallel` --- ### 📝 Commits (3) - [`17b7186`](https://github.com/ollama/ollama/commit/17b7186cd759337fa98b626e82de150f3789b040) Enable concurrency by default - [`9929751`](https://github.com/ollama/ollama/commit/9929751cc8b415e7b83d5151742dad734e8b5efc) Disable concurrency for AMD + Windows - [`642cee1`](https://github.com/ollama/ollama/commit/642cee13426c994f90d5a95376025fe9a223891a) Sort the ps output ### 📊 Changes **7 files changed** (+175 additions, -73 deletions) <details> <summary>View changed files</summary> 📝 `envconfig/config.go` (+8 -8) 📝 `gpu/amd_windows.go` (+3 -2) 📝 `gpu/types.go` (+5 -0) 📝 `llm/server.go` (+3 -10) 📝 `server/routes.go` (+5 -0) 📝 `server/sched.go` (+100 -24) 📝 `server/sched_test.go` (+51 -29) </details> ### 📄 Description This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM. Corresponding Doc update to merge after this #5364 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 15:58:55 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#21958