[GH-ISSUE #6651] The speed of using embedded models is much slower compared to xinference #4187

Open
opened 2026-04-12 15:07:08 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @yushengliao on GitHub (Sep 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6651

I use the BGE-M3 model and send the same request, especially with xinference taking about 10 seconds and ollama taking about 200 seconds.
I'm sure they all use GPUs.
I found that xinference allocates more video memory, while ollama's video memory usage remains basically unchanged. Perhaps this is the reason for the speed difference?

Originally created by @yushengliao on GitHub (Sep 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6651 I use the BGE-M3 model and send the same request, especially with xinference taking about 10 seconds and ollama taking about 200 seconds. I'm sure they all use GPUs. I found that xinference allocates more video memory, while ollama's video memory usage remains basically unchanged. Perhaps this is the reason for the speed difference?
GiteaMirror added the feature requestperformance labels 2026-04-12 15:07:08 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4187