The rerank model cannot run on the GPU, causing it to be very slow. #2501

New Issue

2025-11-11T15:08:41-06:00

GiteaMirror commented

2025-11-11 15:08:41 -06:00

Originally created by @zero456 on GitHub (Oct 29, 2024).

Computer:
2X Intel(R) Xeon(R) Gold 6242R CPU
64.0 GB RAM
NVIDIA Quadro RTX 6000

Open-WebUI settings:
Engine: Ollama
Embedding Batch Size = 12
Hybrid Search: Enabled
Embed Model: bge-m3:latest
Rerank Model: baai/bge-reranker-v2-m3 (downloaded from Hugging Face)

After running the query, from the backend, we observed that the embedding process completes very quickly, with brief CUDA GPU utilization. Then, the CPU utilization increases significantly to 60%~100% and remains high for a prolonged period until the answer is generated.
Based on these observations, we suspect that the rerank model is running on the CPU. Is it possible to modify it to run on the GPU to improve the speed?

Originally created by @zero456 on GitHub (Oct 29, 2024). Computer: 2X Intel(R) Xeon(R) Gold 6242R CPU 64.0 GB RAM NVIDIA Quadro RTX 6000 Open-WebUI settings: Engine: Ollama Embedding Batch Size = 12 Hybrid Search: Enabled Embed Model: bge-m3:latest Rerank Model: baai/bge-reranker-v2-m3 (downloaded from Hugging Face) After running the query, from the backend, we observed that the embedding process completes very quickly, with brief CUDA GPU utilization. Then, the CPU utilization increases significantly to 60%~100% and remains high for a prolonged period until the answer is generated. Based on these observations, we suspect that the rerank model is running on the CPU. Is it possible to modify it to run on the GPU to improve the speed?

GiteaMirror closed this issue

2025-11-11 15:08:41 -06:00

GiteaMirror referenced this issue

2025-11-11 17:36:45 -06:00

[PR #2501] [MERGED] refac: speed up app mount by parallelizing API requests #7816

GiteaMirror referenced this issue

2026-04-20 03:18:14 -05:00

[PR #2501] [MERGED] refac: speed up app mount by parallelizing API requests #21020

GiteaMirror referenced this issue

2026-04-25 10:25:30 -05:00

[PR #2501] [MERGED] refac: speed up app mount by parallelizing API requests #36650

GiteaMirror referenced this issue