The rerank model cannot run on the GPU, causing it to be very slow. #2501

Closed
opened 2025-11-11 15:08:41 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @zero456 on GitHub (Oct 29, 2024).

Computer:
2X Intel(R) Xeon(R) Gold 6242R CPU
64.0 GB RAM
NVIDIA Quadro RTX 6000

Open-WebUI settings:
Engine: Ollama
Embedding Batch Size = 12
Hybrid Search: Enabled
Embed Model: bge-m3:latest
Rerank Model: baai/bge-reranker-v2-m3 (downloaded from Hugging Face)

After running the query, from the backend, we observed that the embedding process completes very quickly, with brief CUDA GPU utilization. Then, the CPU utilization increases significantly to 60%~100% and remains high for a prolonged period until the answer is generated.
Based on these observations, we suspect that the rerank model is running on the CPU. Is it possible to modify it to run on the GPU to improve the speed?

Originally created by @zero456 on GitHub (Oct 29, 2024). Computer: 2X Intel(R) Xeon(R) Gold 6242R CPU 64.0 GB RAM NVIDIA Quadro RTX 6000 Open-WebUI settings: Engine: Ollama Embedding Batch Size = 12 Hybrid Search: Enabled Embed Model: bge-m3:latest Rerank Model: baai/bge-reranker-v2-m3 (downloaded from Hugging Face) After running the query, from the backend, we observed that the embedding process completes very quickly, with brief CUDA GPU utilization. Then, the CPU utilization increases significantly to 60%~100% and remains high for a prolonged period until the answer is generated. Based on these observations, we suspect that the rerank model is running on the CPU. Is it possible to modify it to run on the GPU to improve the speed?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#2501