[GH-ISSUE #1556] Delays and slowness when using mixtral #47362

New Issue

@coder543 commented on GitHub (Dec 17, 2023):

@djmaze that is strange, since I'm not encountering any unusual problems on my 3090.

total duration:       18.273218027s
prompt eval count:    1180 token(s)
prompt eval duration: 15.833678s
prompt eval rate:     74.52 tokens/s
eval count:           114 token(s)
eval duration:        2.391734s
eval rate:            47.66 tokens/s

Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations.

This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window.

EDIT: testing mistral (instead of mixtral), I am seeing this after a similar situation:

total duration:       2.244759039s
prompt eval count:    1211 token(s)
prompt eval duration: 421.415ms
prompt eval rate:     2873.65 tokens/s
eval count:           208 token(s)
eval duration:        1.774238s
eval rate:            117.23 tokens/s

The key differentiator is that the prompt eval rate is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now.

EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as ollama, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other ollama models suffer the same problems, they just seem to have a much higher prompt eval rate than mixtral, which helps to mask it.

@coder543 commented on GitHub (Dec 17, 2023): @djmaze that is strange, since I'm not encountering any unusual problems on my 3090. ``` total duration: 18.273218027s prompt eval count: 1180 token(s) prompt eval duration: 15.833678s prompt eval rate: 74.52 tokens/s eval count: 114 token(s) eval duration: 2.391734s eval rate: 47.66 tokens/s ``` Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations. This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window. EDIT: testing `mistral` (instead of `mixtral`), I am seeing this after a similar situation: ``` total duration: 2.244759039s prompt eval count: 1211 token(s) prompt eval duration: 421.415ms prompt eval rate: 2873.65 tokens/s eval count: 208 token(s) eval duration: 1.774238s eval rate: 117.23 tokens/s ``` The key differentiator is that the `prompt eval rate` is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now. EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as `ollama`, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other `ollama` models suffer the same problems, they just seem to have a much higher `prompt eval rate` than `mixtral`, which helps to mask it.

GiteaMirror commented

@kaykyr commented on GitHub (Dec 17, 2023):

I can confirm same issue here, even using both 3090/4090

@kaykyr commented on GitHub (Dec 17, 2023): I can confirm same issue here, even using both 3090/4090

GiteaMirror commented

@djmaze commented on GitHub (Dec 17, 2023):

I just tried out nous-hermes:70b-llama2-q2_K in order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows.

I also tried dolphin-mixtral:8x7b-v2.5-q4_K_M (a Mixtral finetune). It causes the same delays as I've seen with mixtral:8x7b-instruct.

From this I deduce that (at least for me) the problem is specific to the Mixtral models.

@djmaze commented on GitHub (Dec 17, 2023): I just tried out `nous-hermes:70b-llama2-q2_K` in order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows. I also tried `dolphin-mixtral:8x7b-v2.5-q4_K_M` (a Mixtral finetune). It causes the same delays as I've seen with `mixtral:8x7b-instruct`. From this I deduce that (at least for me) the problem is specific to the Mixtral models.

GiteaMirror commented

@coder543 commented on GitHub (Dec 17, 2023):

@djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the mistral output I showed, but it should still has the same fundamental issue that it does not cache the eval state.

@coder543 commented on GitHub (Dec 17, 2023): @djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the `mistral` output I showed, but it should still has the same fundamental issue that it does not cache the eval state.

GiteaMirror commented

@djmaze commented on GitHub (Dec 18, 2023):

@coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values):

Model	Offloaded layers (n)	Eval rate (/s)	Prompt eval rate (/s)
starling-lm:7b-alpha-q4_K_M	33/33	105	2700
mistral:7b-instruct-v0.2-q5_K_M	33/33	98	2200
nous-hermes:70b-llama2-q2_K	51/81	3	140
mixtral:8x7b-instruct-v0.1-q2_K	33/33	61	100
mixtral:8x7b-instruct-v0.1-q4_K_M	22/33	13	26

It seems interesting to me that although nous-hermes:70b-llama2-q2_K has a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate than mixtral:8x7b-instruct-v0.1-q4_K_M.

TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?

@djmaze commented on GitHub (Dec 18, 2023): @coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values): | Model | Offloaded layers (n) | Eval rate (/s) | Prompt eval rate (/s) | |--------|--------|--------|--------| | starling-lm:7b-alpha-q4_K_M | 33/33 | 105 | 2700 | | mistral:7b-instruct-v0.2-q5_K_M | 33/33 | 98 | 2200 | | nous-hermes:70b-llama2-q2_K | 51/81 | 3 | 140 | | mixtral:8x7b-instruct-v0.1-q2_K | 33/33 | 61 | 100 | | mixtral:8x7b-instruct-v0.1-q4_K_M | 22/33 | 13 | 26 | It seems interesting to me that although `nous-hermes:70b-llama2-q2_K` has a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate than `mixtral:8x7b-instruct-v0.1-q4_K_M`. TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?

GiteaMirror commented

@ghost commented on GitHub (Dec 19, 2023):

I installed Ollama with the curl ... | sh command on WSL, and running dolphin-mixtral:latest on 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?

@ghost commented on GitHub (Dec 19, 2023): I installed Ollama with the `curl ... | sh` command on WSL, and running `dolphin-mixtral:latest` on 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?

GiteaMirror commented

@djmaze commented on GitHub (Dec 19, 2023):

Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.

@djmaze commented on GitHub (Dec 19, 2023): Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.

GiteaMirror commented

2026-04-28 03:37:19 -05:00

@jamesbascle commented on GitHub (Dec 19, 2023):

I used

FROM dolphin-mixtral
PARAMETER num_gpu 33

to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.

@jamesbascle commented on GitHub (Dec 19, 2023): I used ``` FROM dolphin-mixtral PARAMETER num_gpu 33 ``` to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.

GiteaMirror commented

@phalexo commented on GitHub (Dec 20, 2023):

I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead?

And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?

@phalexo commented on GitHub (Dec 20, 2023): I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead? And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?

GiteaMirror commented

2026-04-28 03:37:19 -05:00

@gnusenpai commented on GitHub (Dec 21, 2023):

Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)

@gnusenpai commented on GitHub (Dec 21, 2023): Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)

GiteaMirror commented

2026-04-28 03:37:19 -05:00

@coder543 commented on GitHub (Dec 21, 2023):

For me, using llama.cpp directly, that PR appears to have raised prompt eval rate to about 325t/s:

print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second)
print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second)
print_timings: total time = 8611.28 ms

Still not as fast as other models, but a significant improvement

@coder543 commented on GitHub (Dec 21, 2023): For me, using llama.cpp directly, that PR appears to have raised `prompt eval rate` to about 325t/s: print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second) print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second) print_timings: total time = 8611.28 ms Still not as fast as other models, but a significant improvement

GiteaMirror commented

2026-04-28 03:37:20 -05:00

@djmaze commented on GitHub (Dec 21, 2023):

But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572

@djmaze commented on GitHub (Dec 21, 2023): But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572

GiteaMirror commented

2026-04-28 03:37:21 -05:00

@Confuze commented on GitHub (Dec 22, 2023):

I think I'm running into the same issue on v0.0.17 (installed from the ollama-cuda package on arch linux)

When running dolphin-mixtral with num_gpu set to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without the num_gpu parameter as well, no difference) According to nvidia-smi ollama isn't using the gpu (rtx 2070) at all.

This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used.

By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.

@Confuze commented on GitHub (Dec 22, 2023): I think I'm running into the same issue on `v0.0.17` (installed from the `ollama-cuda` package on arch linux) When running dolphin-mixtral with `num_gpu` set to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without the `num_gpu` parameter as well, no difference) According to `nvidia-smi` ollama isn't using the gpu (rtx 2070) at all. This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used. By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.

GiteaMirror commented

2026-04-28 03:37:21 -05:00

@coder543 commented on GitHub (Dec 22, 2023):

@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

@coder543 commented on GitHub (Dec 22, 2023): @Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

GiteaMirror commented

2026-04-28 03:37:21 -05:00

@phalexo commented on GitHub (Dec 22, 2023):

I don't think we should confuse two separate problems.

Sometimes there is really not enough VRAM.

Sometimes you run into cuBLAS 15 error which was introduced starting with
v0.1.12. Which also often looks like an OOM.

v0.1.11 didn't have this issue.

The only way to mitigate it, that I am aware of at the moment, is to build
with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower
than cuBLAS.

It really should be fixed.

On Fri, Dec 22, 2023, 2:31 PM Josh Leverette @.***>
wrote:

@Confuze https://github.com/Confuze you don’t have enough VRAM to run
Mixtral entirely on the GPU. ollama will be trying to load the model onto
the GPU, running out of memory, and then fall back to just running on the
CPU.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1556#issuecomment-1868013723,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZK5ORG3WVMSA64UT43YKXN2NAVCNFSM6AAAAABAXEH2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYTGNZSGM
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (Dec 22, 2023): I don't think we should confuse two separate problems. Sometimes there is really not enough VRAM. Sometimes you run into cuBLAS 15 error which was introduced starting with v0.1.12. Which also often looks like an OOM. v0.1.11 didn't have this issue. The only way to mitigate it, that I am aware of at the moment, is to build with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower than cuBLAS. It really should be fixed. On Fri, Dec 22, 2023, 2:31 PM Josh Leverette ***@***.***> wrote: > @Confuze <https://github.com/Confuze> you don’t have enough VRAM to run > Mixtral entirely on the GPU. ollama will be trying to load the model onto > the GPU, running out of memory, and then fall back to just running on the > CPU. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1556#issuecomment-1868013723>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZK5ORG3WVMSA64UT43YKXN2NAVCNFSM6AAAAABAXEH2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYTGNZSGM> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-28 03:37:22 -05:00

@Confuze commented on GitHub (Dec 22, 2023):

@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

I see, after looking at the logs, it seems like you are right.

CUDA error 2 at /build/ollama-cuda/src/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9148: out of memory

So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?

@Confuze commented on GitHub (Dec 22, 2023): > @Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU. I see, after looking at the logs, it seems like you are right. ``` CUDA error 2 at /build/ollama-cuda/src/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9148: out of memory ``` So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?

GiteaMirror commented

2026-04-28 03:37:22 -05:00

@coder543 commented on GitHub (Dec 22, 2023):

@Confuze the num_gpu parameter that you set to 10000 was trying to force more layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to call ollama create then start a new ollama run session after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.

@coder543 commented on GitHub (Dec 22, 2023): @Confuze the `num_gpu` parameter that you set to 10000 was trying to force _more_ layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to call `ollama create` then start a new `ollama run` session after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.

GiteaMirror commented

@madsamjp commented on GitHub (Dec 23, 2023):

@confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp.

I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama.

@madsamjp commented on GitHub (Dec 23, 2023): @confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp. I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama. ![image](https://github.com/jmorganca/ollama/assets/49611363/ee131d66-c4f8-48af-9ea1-c90799c7e863) ![image](https://github.com/jmorganca/ollama/assets/49611363/fbefe225-8ca5-46cf-a951-f11bf0fca5a2) ![image](https://github.com/jmorganca/ollama/assets/49611363/371947aa-aded-4509-a042-ce8e3961a0da)

GiteaMirror commented

@coder543 commented on GitHub (Dec 23, 2023):

@madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs.

I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest.

Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.

@coder543 commented on GitHub (Dec 23, 2023): @madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs. I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest. Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.

GiteaMirror commented

@madsamjp commented on GitHub (Dec 23, 2023):

@coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is not slow - about 10 tps.

My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible.

It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.

@madsamjp commented on GitHub (Dec 23, 2023): @coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is _not_ slow - about 10 tps. My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible. It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.

GiteaMirror commented

2026-04-28 03:37:24 -05:00

@iTestAndroid commented on GitHub (Dec 29, 2023):

This is my nvidia-smi output

Fri Dec 29 03:00:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000000:01:00.0 Off |                    0 |
| N/A   57C    P0              45W / 250W |   5786MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:04:00.0 Off |                  Off |
| N/A   70C    P0              32W /  70W |   5181MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000                Off | 00000000:05:00.0 Off |                  Off |
| 33%   48C    P2              62W / 260W |  13581MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000000:08:00.0 Off |                    0 |
| N/A   66C    P0              30W /  70W |   5171MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE-16GB           Off | 00000000:86:00.0 Off |                    0 |
| N/A   55C    P0              46W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE-16GB           Off | 00000000:89:00.0 Off |                    0 |
| N/A   51C    P0              41W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE-16GB           Off | 00000000:8A:00.0 Off |                    0 |
| N/A   53C    P0              40W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5782MiB |
|    1   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5176MiB |
|    2   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner    13578MiB |
|    3   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5166MiB |
|    4   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
|    5   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
|    6   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
+---------------------------------------------------------------------------------------+

It's responding very slowly for me. Some prompts takes 15 seconds. Any suggestions? I did the Modelfile trick with num_gpus set to 1000 but it's still doing 93 as I can see that when I run ps -aux

My server does have 768GB RAM and 2x Xeon CPU, still it's disappointingly slow to run mixtral

GiteaMirror commented

@coder543 commented on GitHub (Jan 6, 2024):

FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API."

I just tested it, and it appears that ollama is now caching the eval state between prompts.

The first prompt:

total duration:       6.250217789s
load duration:        262.05µs
prompt eval count:    16 token(s)
prompt eval duration: 253.355ms
prompt eval rate:     63.15 tokens/s
eval count:           299 token(s)
eval duration:        5.992618s
eval rate:            49.89 tokens/s

The second prompt:

total duration:       10.447833337s
load duration:        240.24µs
prompt eval count:    15 token(s)
prompt eval duration: 241.787ms
prompt eval rate:     62.04 tokens/s
eval count:           495 token(s)
eval duration:        10.203837s
eval rate:            48.51 tokens/s

And, for good measure, sending a third prompt to the same chat:

total duration:       535.952887ms
load duration:        433.381µs
prompt eval count:    18 token(s)
prompt eval duration: 303.442ms
prompt eval rate:     59.32 tokens/s
eval count:           12 token(s)
eval duration:        229.294ms
eval rate:            52.33 tokens/s

In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message.

So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.

@coder543 commented on GitHub (Jan 6, 2024): FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API." I just tested it, and it appears that ollama is now caching the eval state between prompts. The first prompt: ``` total duration: 6.250217789s load duration: 262.05µs prompt eval count: 16 token(s) prompt eval duration: 253.355ms prompt eval rate: 63.15 tokens/s eval count: 299 token(s) eval duration: 5.992618s eval rate: 49.89 tokens/s ``` The second prompt: ``` total duration: 10.447833337s load duration: 240.24µs prompt eval count: 15 token(s) prompt eval duration: 241.787ms prompt eval rate: 62.04 tokens/s eval count: 495 token(s) eval duration: 10.203837s eval rate: 48.51 tokens/s ``` And, for good measure, sending a third prompt to the same chat: ``` total duration: 535.952887ms load duration: 433.381µs prompt eval count: 18 token(s) prompt eval duration: 303.442ms prompt eval rate: 59.32 tokens/s eval count: 12 token(s) eval duration: 229.294ms eval rate: 52.33 tokens/s ``` In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message. So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.

GiteaMirror commented

2026-04-28 03:37:25 -05:00

@djmaze commented on GitHub (Jan 13, 2024):

As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely TabbyAPI. It uses ExllamaV2 as its backend.

A 3.5 bpw quant of mixtral with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp.

It might even be interesting to add ExllamaV2 as a backend for Ollama?

@djmaze commented on GitHub (Jan 13, 2024): As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely [TabbyAPI](https://github.com/theroyallab/tabbyAPI/). It uses ExllamaV2 as its backend. [A 3.5 bpw quant of mixtral](https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2) with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp. It might even be interesting to add ExllamaV2 as a backend for Ollama?

GiteaMirror commented

2026-04-28 03:37:26 -05:00

@Bearsaerker commented on GitHub (Jan 22, 2024):

This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew

@Bearsaerker commented on GitHub (Jan 22, 2024): This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew

GiteaMirror commented

2026-04-28 03:37:26 -05:00

@Bearsaerker commented on GitHub (Jan 23, 2024):

With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release

@Bearsaerker commented on GitHub (Jan 23, 2024): With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release

GiteaMirror commented

2026-04-28 03:37:27 -05:00

@pdevine commented on GitHub (Mar 12, 2024):

This should be fixed now. With a 4090, I see:

total duration:       2.664210661s
load duration:        438.566µs
prompt eval duration: 54.54ms
prompt eval rate:     0.00 tokens/s
eval count:           69 token(s)
eval duration:        2.608517s
eval rate:            26.45 tokens/s

Then:

total duration:       17.580922486s
load duration:        671.919µs
prompt eval count:    13 token(s)
prompt eval duration: 274.06ms
prompt eval rate:     47.43 tokens/s
eval count:           440 token(s)
eval duration:        17.303543s
eval rate:            25.43 tokens/s

And for the 3rd prompt:

total duration:       23.699967097s
load duration:        826.491µs
prompt eval count:    15 token(s)
prompt eval duration: 318.672ms
prompt eval rate:     47.07 tokens/s
eval count:           564 token(s)
eval duration:        23.372658s
eval rate:            24.13 tokens/s

Going to go ahead and close the issue.

@pdevine commented on GitHub (Mar 12, 2024): This should be fixed now. With a 4090, I see: ``` total duration: 2.664210661s load duration: 438.566µs prompt eval duration: 54.54ms prompt eval rate: 0.00 tokens/s eval count: 69 token(s) eval duration: 2.608517s eval rate: 26.45 tokens/s ``` Then: ``` total duration: 17.580922486s load duration: 671.919µs prompt eval count: 13 token(s) prompt eval duration: 274.06ms prompt eval rate: 47.43 tokens/s eval count: 440 token(s) eval duration: 17.303543s eval rate: 25.43 tokens/s ``` And for the 3rd prompt: ``` total duration: 23.699967097s load duration: 826.491µs prompt eval count: 15 token(s) prompt eval duration: 318.672ms prompt eval rate: 47.07 tokens/s eval count: 564 token(s) eval duration: 23.372658s eval rate: 24.13 tokens/s ``` Going to go ahead and close the issue.

GiteaMirror commented

2026-04-28 03:37:27 -05:00

@grafke commented on GitHub (Mar 13, 2024):

I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main?

Here are my results, the first prompt:

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =    7733.01 ms /  2163 tokens (    3.58 ms per token,   279.71 tokens per second)","n_prompt_tokens_processed":2163,"n_tokens_second":279.7100791451502,"slot_id":0,"t_prompt_processing":7733.007,"t_token":3.575130374479889,"task_id":312,"tid":"139930931721792","timestamp":1710342290}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2852.03 ms /   161 runs   (   17.71 ms per token,    56.45 tokens per second)","n_decoded":161,"n_tokens_second":56.45101909867707,"slot_id":0,"t_token":17.71447204968944,"t_token_generation":2852.03,"task_id":312,"tid":"139930931721792","timestamp":1710342290}

the second prompt

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =   13081.74 ms /  3813 tokens (    3.43 ms per token,   291.47 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.47492042918134,"slot_id":0,"t_prompt_processing":13081.743,"t_token":3.430826907946499,"task_id":166,"tid":"139930931721792","timestamp":1710342267}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2617.80 ms /   143 runs   (   18.31 ms per token,    54.63 tokens per second)","n_decoded":143,"n_tokens_second":54.62606358473802,"slot_id":0,"t_token":18.30627972027972,"t_token_generation":2617.798,"task_id":166,"tid":"139930931721792","timestamp":1710342267}

and the third prompt:

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =   13079.91 ms /  3813 tokens (    3.43 ms per token,   291.52 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.51570044846625,"slot_id":0,"t_prompt_processing":13079.913,"t_token":3.430346970889064,"task_id":476,"tid":"139930931721792","timestamp":1710342414}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2602.60 ms /   143 runs   (   18.20 ms per token,    54.95 tokens per second)","n_decoded":143,"n_tokens_second":54.94511827993346,"slot_id":0,"t_token":18.199979020979022,"t_token_generation":2602.597,"task_id":476,"tid":"139930931721792","timestamp":1710342414}

Generation is fast but the prompt eval time is suuuuper slow.
I'm using the option: "num_ctx": 32768
And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.

@grafke commented on GitHub (Mar 13, 2024): I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main? Here are my results, the first prompt: ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 7733.01 ms / 2163 tokens ( 3.58 ms per token, 279.71 tokens per second)","n_prompt_tokens_processed":2163,"n_tokens_second":279.7100791451502,"slot_id":0,"t_prompt_processing":7733.007,"t_token":3.575130374479889,"task_id":312,"tid":"139930931721792","timestamp":1710342290} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2852.03 ms / 161 runs ( 17.71 ms per token, 56.45 tokens per second)","n_decoded":161,"n_tokens_second":56.45101909867707,"slot_id":0,"t_token":17.71447204968944,"t_token_generation":2852.03,"task_id":312,"tid":"139930931721792","timestamp":1710342290} ``` the second prompt ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 13081.74 ms / 3813 tokens ( 3.43 ms per token, 291.47 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.47492042918134,"slot_id":0,"t_prompt_processing":13081.743,"t_token":3.430826907946499,"task_id":166,"tid":"139930931721792","timestamp":1710342267} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2617.80 ms / 143 runs ( 18.31 ms per token, 54.63 tokens per second)","n_decoded":143,"n_tokens_second":54.62606358473802,"slot_id":0,"t_token":18.30627972027972,"t_token_generation":2617.798,"task_id":166,"tid":"139930931721792","timestamp":1710342267} ``` and the third prompt: ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 13079.91 ms / 3813 tokens ( 3.43 ms per token, 291.52 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.51570044846625,"slot_id":0,"t_prompt_processing":13079.913,"t_token":3.430346970889064,"task_id":476,"tid":"139930931721792","timestamp":1710342414} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2602.60 ms / 143 runs ( 18.20 ms per token, 54.95 tokens per second)","n_decoded":143,"n_tokens_second":54.94511827993346,"slot_id":0,"t_token":18.199979020979022,"t_token_generation":2602.597,"task_id":476,"tid":"139930931721792","timestamp":1710342414} ``` Generation is fast but the prompt eval time is suuuuper slow. I'm using the option: "num_ctx": 32768 And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.

GiteaMirror commented

2026-04-28 03:37:27 -05:00

@pdevine commented on GitHub (Mar 13, 2024):

@grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power.

My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got:

prompt eval count:    829 token(s)
prompt eval duration: 4.738038s
prompt eval rate:     174.97 tokens/s

I think that's roughly tracking the speeds you're seeing?

@pdevine commented on GitHub (Mar 13, 2024): @grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power. My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got: ``` prompt eval count: 829 token(s) prompt eval duration: 4.738038s prompt eval rate: 174.97 tokens/s ``` I think that's roughly tracking the speeds you're seeing?

GiteaMirror commented

@grafke commented on GitHub (Mar 14, 2024):

@pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results.

I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama).
So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.

@grafke commented on GitHub (Mar 14, 2024): @pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results. I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama). So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.

GiteaMirror commented

@pdevine commented on GitHub (Mar 14, 2024):

Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.

@pdevine commented on GitHub (Mar 14, 2024): Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.

GiteaMirror commented

@pdevine commented on GitHub (Mar 14, 2024):

@grafke just thinking about that some more, you can make a call like:

curl http://localhost:11434/api/generate -d '{"model": "grf/mixtral_wa_q4_cp", "prompt": ""}'

Which will preload the model in memory so that when you make the next call it should be faster.

@pdevine commented on GitHub (Mar 14, 2024): @grafke just thinking about that some more, you can make a call like: ``` curl http://localhost:11434/api/generate -d '{"model": "grf/mixtral_wa_q4_cp", "prompt": ""}' ``` Which will preload the model in memory so that when you make the next call it should be faster.

GiteaMirror commented