mixtral:8x7b-instruct-v0.1-fp16 served on Ollama performs worse than the same model served on vllm with same configuration #2063

Closed
opened 2025-11-12 10:43:55 -06:00 by GiteaMirror · 6 comments
Owner

Originally created by @yilei-ding on GitHub (Mar 25, 2024).

hi, I've been comparing the inference speeds of serving unquantzied mixtral:8x7b-instruct-v0.1-fp16 between using the ollama and vllm platforms. I had set the temeparture to 0 and also set the same number of generated tokens, the mixtral model served on ollama performs very bad. I also checked that the [INST] and [/INST] was added to the prompt on ollama, same as vllm. But the model still performs very bad. Notably, ollama manages to run the model using just 2 A6000 GPUs (each with 48G memory), whereas both vllm and Hugging Face require 4 GPUs to handle the unquantized mixtral 8x7b model. This has led me to wonder if ollama applies any form of on-the-fly quantization?

Originally created by @yilei-ding on GitHub (Mar 25, 2024). hi, I've been comparing the inference speeds of serving **unquantzied** `mixtral:8x7b-instruct-v0.1-fp16` between using the ollama and vllm platforms. I had set the temeparture to 0 and also set the same number of generated tokens, the mixtral model served on ollama performs very bad. I also checked that the [INST] and [/INST] was added to the prompt on ollama, same as vllm. But the model still performs very bad. Notably, ollama manages to run the model using just 2 A6000 GPUs (each with 48G memory), whereas both vllm and Hugging Face require 4 GPUs to handle the **unquantized** mixtral 8x7b model. This has led me to wonder if ollama applies any form of on-the-fly quantization?
GiteaMirror added the buggpunvidia labels 2025-11-12 10:43:55 -06:00
Author
Owner

@igorschlum commented on GitHub (Mar 25, 2024):

hi @yilei-ding on witch OS are you running Ollama, with what amount of RAM Memory. Can you please share a prompt or a script to run several prompts, so we can replicate the issue?

@igorschlum commented on GitHub (Mar 25, 2024): hi @yilei-ding on witch OS are you running Ollama, with what amount of RAM Memory. Can you please share a prompt or a script to run several prompts, so we can replicate the issue?
Author
Owner

@igorschlum commented on GitHub (Apr 12, 2024):

Hi @yilei-ding did you tried with version 0.1.31 ? Could you please share your RAM, CPU, OS and script to try to reproduce the issue? With no more news and no other users reporting the same issue, the issues could be closed.

@igorschlum commented on GitHub (Apr 12, 2024): Hi @yilei-ding did you tried with version 0.1.31 ? Could you please share your RAM, CPU, OS and script to try to reproduce the issue? With no more news and no other users reporting the same issue, the issues could be closed.
Author
Owner

@flefevre commented on GitHub (Apr 19, 2024):

Could you share you vllm configuration and command line ?

@flefevre commented on GitHub (Apr 19, 2024): Could you share you vllm configuration and command line ?
Author
Owner

@pdevine commented on GitHub (May 16, 2024):

@yilei-ding the template for mixtral:8x7b-instruct-v0.1-fp16 was slightly off (there was an additional space at the beginning of the template) which may have been causing poor results. I've just pushed an update to the template, so you may want to try again.

Ollama doesn't do any quantization on the fly, but there was a change a month or so ago to the convert scripts which changed how the moes get converted (specifically it lumped the experts together in a different way w/ the up/down/gate attention layers). I'll try that out and see if there is a performance difference.

@pdevine commented on GitHub (May 16, 2024): @yilei-ding the template for `mixtral:8x7b-instruct-v0.1-fp16` was _slightly_ off (there was an additional space at the beginning of the template) which may have been causing poor results. I've just pushed an update to the template, so you may want to try again. Ollama doesn't do any quantization on the fly, but there was a change a month or so ago to the convert scripts which changed how the moes get converted (specifically it lumped the experts together in a different way w/ the up/down/gate attention layers). I'll try that out and see if there is a performance difference.
Author
Owner

@pdevine commented on GitHub (May 17, 2024):

OK, I have re-converted the fp16 version and I get comparable performance for both.

On the new version I get:

total duration:       1m28.047026667s
load duration:        2.070959ms
prompt eval count:    13 token(s)
prompt eval duration: 3.371297s
prompt eval rate:     3.86 tokens/s
eval count:           1132 token(s)
eval duration:        1m24.670792s
eval rate:            13.37 tokens/s

On mixtral:8x7b-instruct-v0.1-fp16 I get:

total duration:       1m20.200884042s
load duration:        4.080167ms
prompt eval count:    13 token(s)
prompt eval duration: 3.398857s
prompt eval rate:     3.82 tokens/s
eval count:           1031 token(s)
eval duration:        1m16.795729s
eval rate:            13.43 tokens/s

So there is effectively no difference between the two conversions. What I think may be happening is something is getting offloaded to the CPU? Can you update your ollama version and try the new ollama ps command when the model is loaded? It should say 100% GPU if it was loaded correctly onto the GPUs.

@pdevine commented on GitHub (May 17, 2024): OK, I have re-converted the fp16 version and I get comparable performance for both. On the new version I get: ``` total duration: 1m28.047026667s load duration: 2.070959ms prompt eval count: 13 token(s) prompt eval duration: 3.371297s prompt eval rate: 3.86 tokens/s eval count: 1132 token(s) eval duration: 1m24.670792s eval rate: 13.37 tokens/s ``` On `mixtral:8x7b-instruct-v0.1-fp16` I get: ``` total duration: 1m20.200884042s load duration: 4.080167ms prompt eval count: 13 token(s) prompt eval duration: 3.398857s prompt eval rate: 3.82 tokens/s eval count: 1031 token(s) eval duration: 1m16.795729s eval rate: 13.43 tokens/s ``` So there is effectively no difference between the two conversions. What I *think* may be happening is something is getting offloaded to the CPU? Can you update your ollama version and try the new `ollama ps` command when the model is loaded? It should say `100% GPU` if it was loaded correctly onto the GPUs.
Author
Owner

@dhiltgen commented on GitHub (Jul 25, 2024):

@yilei-ding if you're still seeing performance problems, please share more information about your setup and I'll reopen the issue. Share the ollama ps output so we can rule out partial offload as the cause of the performance problem.

@dhiltgen commented on GitHub (Jul 25, 2024): @yilei-ding if you're still seeing performance problems, please share more information about your setup and I'll reopen the issue. Share the `ollama ps` output so we can rule out partial offload as the cause of the performance problem.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#2063