[PR #3418] [MERGED] Request and model concurrency #21688

Closed
opened 2026-04-19 15:47:14 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/3418
Author: @dhiltgen
Created: 3/30/2024
Status: Merged
Merged: 4/23/2024
Merged by: @dhiltgen

Base: mainHead: concurrency


📝 Commits (2)

  • 34b9db5 Request and model concurrency
  • f2ea847 Local unicode test case

📊 Changes

30 files changed (+2591 additions, -1363 deletions)

View changed files

📝 api/client.go (+7 -0)
📝 format/bytes.go (+1 -0)
📝 gpu/amd_common.go (+56 -14)
📝 gpu/amd_hip_windows.go (+2 -2)
📝 gpu/amd_linux.go (+174 -285)
📝 gpu/amd_windows.go (+80 -74)
📝 gpu/assets.go (+2 -2)
gpu/cuda_common.go (+22 -0)
📝 gpu/gpu.go (+82 -145)
📝 gpu/gpu_darwin.go (+23 -34)
📝 gpu/gpu_info.h (+8 -4)
📝 gpu/gpu_info_cpu.c (+6 -2)
📝 gpu/gpu_info_cudart.c (+63 -82)
📝 gpu/gpu_info_cudart.h (+96 -10)
gpu/gpu_info_nvml.c (+0 -221)
gpu/gpu_info_nvml.h (+0 -57)
📝 gpu/gpu_test.go (+6 -13)
📝 gpu/types.go (+43 -6)
📝 integration/basic_test.go (+44 -2)
integration/concurrency_test.go (+225 -0)

...and 10 more files

📄 Description

This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. This change is designed to be "opt in" initially, so the default behavior mimics the current sequential implementation (1 request at a time, and only a single model), but can be changed by setting environment variables for the server. In the future we will adjust the default settings to enable concurrency by default.

By default, this change supports 1 concurrent requests to a loaded model, which can be adjusted via OLLAMA_NUM_PARALLEL.

By default, this change supports 1 loaded model at a time, which can be adjusted via OLLAMA_MAX_LOADED_MODELS . Set to zero for fully dynamic based on VRAM capacity, or a fixed number greater than 1 to limit the total number of loaded models regardless of VRAM capacity. In the >1 scenario, we'll still perform VRAM prediction calculations to see if the new model will completely fit into available VRAM, but will prevent loading more than the specified number of runners even if they would have fit.

This change also adjusts our GPU selection algorithm in multi-GPU scenarios. If we can fit the model into a single GPU, we'll favor that over spreading it to multiple GPUs.

Note: system memory capacity is not taken into consideration in this change, so for CPU mode, setting max runners to zero could result in memory exhaustion and paging/swapping.

Fixes #2109
Fixes #1656
Fixes #1514
Fixes #3304
Fixes #961
Fixes #3507


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/3418 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 3/30/2024 **Status:** ✅ Merged **Merged:** 4/23/2024 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `concurrency` --- ### 📝 Commits (2) - [`34b9db5`](https://github.com/ollama/ollama/commit/34b9db5afc43b352c5ef04fe6ef52684bfdd57b5) Request and model concurrency - [`f2ea847`](https://github.com/ollama/ollama/commit/f2ea8470e5e11f4f251ad0a59bd218a3bc972c24) Local unicode test case ### 📊 Changes **30 files changed** (+2591 additions, -1363 deletions) <details> <summary>View changed files</summary> 📝 `api/client.go` (+7 -0) 📝 `format/bytes.go` (+1 -0) 📝 `gpu/amd_common.go` (+56 -14) 📝 `gpu/amd_hip_windows.go` (+2 -2) 📝 `gpu/amd_linux.go` (+174 -285) 📝 `gpu/amd_windows.go` (+80 -74) 📝 `gpu/assets.go` (+2 -2) ➕ `gpu/cuda_common.go` (+22 -0) 📝 `gpu/gpu.go` (+82 -145) 📝 `gpu/gpu_darwin.go` (+23 -34) 📝 `gpu/gpu_info.h` (+8 -4) 📝 `gpu/gpu_info_cpu.c` (+6 -2) 📝 `gpu/gpu_info_cudart.c` (+63 -82) 📝 `gpu/gpu_info_cudart.h` (+96 -10) ➖ `gpu/gpu_info_nvml.c` (+0 -221) ➖ `gpu/gpu_info_nvml.h` (+0 -57) 📝 `gpu/gpu_test.go` (+6 -13) 📝 `gpu/types.go` (+43 -6) 📝 `integration/basic_test.go` (+44 -2) ➕ `integration/concurrency_test.go` (+225 -0) _...and 10 more files_ </details> ### 📄 Description This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. This change is designed to be "opt in" initially, so the default behavior mimics the current sequential implementation (1 request at a time, and only a single model), but can be changed by setting environment variables for the server. In the future we will adjust the default settings to enable concurrency by default. By default, this change supports 1 concurrent requests to a loaded model, which can be adjusted via `OLLAMA_NUM_PARALLEL`. By default, this change supports 1 loaded model at a time, which can be adjusted via `OLLAMA_MAX_LOADED_MODELS `. Set to zero for fully dynamic based on VRAM capacity, or a fixed number greater than 1 to limit the total number of loaded models regardless of VRAM capacity. In the >1 scenario, we'll still perform VRAM prediction calculations to see if the new model will completely fit into available VRAM, but will prevent loading more than the specified number of runners even if they would have fit. This change also adjusts our GPU selection algorithm in multi-GPU scenarios. If we can fit the model into a single GPU, we'll favor that over spreading it to multiple GPUs. Note: system memory capacity is not taken into consideration in this change, so for CPU mode, setting max runners to zero could result in memory exhaustion and paging/swapping. Fixes #2109 Fixes #1656 Fixes #1514 Fixes #3304 Fixes #961 Fixes #3507 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 15:47:14 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#21688