[GH-ISSUE #1556] Delays and slowness when using mixtral #47362

Closed
opened 2026-04-28 03:37:13 -05:00 by GiteaMirror · 41 comments
Owner

Originally created by @djmaze on GitHub (Dec 15, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1556

It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline).

Using CUDA on a RTX 3090. Tried out mixtral:8x7b-instruct-v0.1-q4_K_M (with CPU offloading) as well as mixtral:8x7b-instruct-v0.1-q2_K (completely in VRAM).

As a comparison, I tried starling-lm:7b-alpha-q4_K_M, which seems not to exhibit any of these problems.

Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)

Originally created by @djmaze on GitHub (Dec 15, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1556 It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline). Using CUDA on a RTX 3090. Tried out `mixtral:8x7b-instruct-v0.1-q4_K_M` (with CPU offloading) as well as `mixtral:8x7b-instruct-v0.1-q2_K` (completely in VRAM). As a comparison, I tried `starling-lm:7b-alpha-q4_K_M`, which seems not to exhibit any of these problems. Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)
Author
Owner

@madsamjp commented on GitHub (Dec 16, 2023):

Can confirm I'm also having this issue. I'm running dolphin-mixtral:8x7b-v2.5-q5_K_M with 22 layers offloaded to GPU (RTX 4090). First response takes 2 secs, second response 26 secs, 3rd 37 secs and 4th 49 secs. By the 4th response there are 888 tokens in the context window.

Eval rate is a respectable ~10tps, but with a > 1 minute prompt eval by the 5th response, it's unusable.

<!-- gh-comment-id:1858796274 --> @madsamjp commented on GitHub (Dec 16, 2023): Can confirm I'm also having this issue. I'm running dolphin-mixtral:8x7b-v2.5-q5_K_M with 22 layers offloaded to GPU (RTX 4090). First response takes 2 secs, second response 26 secs, 3rd 37 secs and 4th 49 secs. By the 4th response there are 888 tokens in the context window. Eval rate is a respectable ~10tps, but with a > 1 minute prompt eval by the 5th response, it's unusable.
Author
Owner

@easp commented on GitHub (Dec 17, 2023):

Yeah, big issue on Apple Silicon Macs, too. I've seen references to this being a known problem for mixtral on llamacpp right now, but I can't find an actual issue about it on the llama.cpp github.

<!-- gh-comment-id:1858993869 --> @easp commented on GitHub (Dec 17, 2023): Yeah, big issue on Apple Silicon Macs, too. I've seen references to this being a known problem for mixtral on llamacpp right now, but I can't find an actual issue about it on the llama.cpp github.
Author
Owner

@phalexo commented on GitHub (Dec 17, 2023):

Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?

<!-- gh-comment-id:1858995188 --> @phalexo commented on GitHub (Dec 17, 2023): Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?
Author
Owner

@easp commented on GitHub (Dec 17, 2023):

That's just the readline history. It's just commands entered in the REPL.

<!-- gh-comment-id:1859005939 --> @easp commented on GitHub (Dec 17, 2023): That's just the readline history. It's just commands entered in the REPL.
Author
Owner

@easp commented on GitHub (Dec 17, 2023):

Looks like this recently merged llama.cpp PR may improve prompt-processing speed with Mixtral: https://github.com/ggerganov/llama.cpp/pull/4480

<!-- gh-comment-id:1859224546 --> @easp commented on GitHub (Dec 17, 2023): Looks like this recently merged llama.cpp PR may improve prompt-processing speed with Mixtral: https://github.com/ggerganov/llama.cpp/pull/4480
Author
Owner

@coder543 commented on GitHub (Dec 17, 2023):

The default mixtral Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the q3_K_S model can be completely offloaded to the GPU, which speeds things up dramatically:

Make a Modelfile:

FROM mixtral:8x7b-instruct-v0.1-q3_K_S
PARAMETER num_gpu 33

Then run ollama create mixtral_gpu -f ./Modelfile

Then you can run ollama run mixtral_gpu and see how it does.

<!-- gh-comment-id:1859235636 --> @coder543 commented on GitHub (Dec 17, 2023): The default `mixtral` Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the `q3_K_S` model can be completely offloaded to the GPU, which speeds things up dramatically: Make a `Modelfile`: ``` FROM mixtral:8x7b-instruct-v0.1-q3_K_S PARAMETER num_gpu 33 ``` Then run `ollama create mixtral_gpu -f ./Modelfile` Then you can run `ollama run mixtral_gpu` and see how it does.
Author
Owner

@coder543 commented on GitHub (Dec 17, 2023):

I also wonder if it would be possible for ollama to keep the eval state between prompts, rather than re-processing the entire context window for each new message. I understand ollama is trying to run a model server so there could be requests coming from more than one session at a time, but maybe it's possible to only clear the state and start from scratch if a request from a different session is received? This is all a little beyond my expertise, so I could be completely wrong.

<!-- gh-comment-id:1859235929 --> @coder543 commented on GitHub (Dec 17, 2023): I also wonder if it would be possible for ollama to keep the eval state between prompts, rather than re-processing the entire context window for each new message. I understand ollama is trying to run a model server so there could be requests coming from more than one session at a time, but maybe it's possible to only clear the state and start from scratch if a request from a different session is received? This is all a little beyond my expertise, so I could be completely wrong.
Author
Owner

@phalexo commented on GitHub (Dec 17, 2023):

The default mixtral Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the q3_K_S model can be completely offloaded to the GPU, which speeds things up dramatically:

Make a Modelfile:

FROM mixtral:8x7b-instruct-v0.1-q3_K_S
PARAMETER num_gpu 33

Then run ollama create mixtral_gpu -f ./Modelfile

Then you can run ollama run mixtral_gpu and see how it does.

Using llama.cpp directly in interactive mode does not appear to have any major delays. It takes merely a second or two to start answering even after a relatively long conversation.

Looks like latency is specific to ollama.

<!-- gh-comment-id:1859242550 --> @phalexo commented on GitHub (Dec 17, 2023): > The default `mixtral` Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the `q3_K_S` model can be completely offloaded to the GPU, which speeds things up dramatically: > > Make a `Modelfile`: > > ``` > FROM mixtral:8x7b-instruct-v0.1-q3_K_S > PARAMETER num_gpu 33 > ``` > > Then run `ollama create mixtral_gpu -f ./Modelfile` > > Then you can run `ollama run mixtral_gpu` and see how it does. Using llama.cpp directly in interactive mode does not appear to have any major delays. It takes merely a second or two to start answering even after a relatively long conversation. Looks like latency is specific to ollama.
Author
Owner

@djmaze commented on GitHub (Dec 17, 2023):

@coder543 As stated in my initial post, I even tried the q2_k version, loading all 33 layers into the GPU. Still, the token generation is quite slow and the delay before the token generation starts increases on every prompt as the context grows.

As also stated, when using llama.cpp or a totally different model, there are no delays and the token generation (for the same model) is significantly faster.

<!-- gh-comment-id:1859253371 --> @djmaze commented on GitHub (Dec 17, 2023): @coder543 As stated in my initial post, I even tried the `q2_k` version, loading all 33 layers into the GPU. Still, the token generation is quite slow and the delay before the token generation starts increases on every prompt as the context grows. As also stated, when using llama.cpp or a totally different model, there are no delays and the token generation (for the same model) is significantly faster.
Author
Owner

@coder543 commented on GitHub (Dec 17, 2023):

@djmaze that is strange, since I'm not encountering any unusual problems on my 3090.

total duration:       18.273218027s
prompt eval count:    1180 token(s)
prompt eval duration: 15.833678s
prompt eval rate:     74.52 tokens/s
eval count:           114 token(s)
eval duration:        2.391734s
eval rate:            47.66 tokens/s

Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations.

This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window.

EDIT: testing mistral (instead of mixtral), I am seeing this after a similar situation:

total duration:       2.244759039s
prompt eval count:    1211 token(s)
prompt eval duration: 421.415ms
prompt eval rate:     2873.65 tokens/s
eval count:           208 token(s)
eval duration:        1.774238s
eval rate:            117.23 tokens/s

The key differentiator is that the prompt eval rate is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now.

EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as ollama, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other ollama models suffer the same problems, they just seem to have a much higher prompt eval rate than mixtral, which helps to mask it.

<!-- gh-comment-id:1859254599 --> @coder543 commented on GitHub (Dec 17, 2023): @djmaze that is strange, since I'm not encountering any unusual problems on my 3090. ``` total duration: 18.273218027s prompt eval count: 1180 token(s) prompt eval duration: 15.833678s prompt eval rate: 74.52 tokens/s eval count: 114 token(s) eval duration: 2.391734s eval rate: 47.66 tokens/s ``` Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations. This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window. EDIT: testing `mistral` (instead of `mixtral`), I am seeing this after a similar situation: ``` total duration: 2.244759039s prompt eval count: 1211 token(s) prompt eval duration: 421.415ms prompt eval rate: 2873.65 tokens/s eval count: 208 token(s) eval duration: 1.774238s eval rate: 117.23 tokens/s ``` The key differentiator is that the `prompt eval rate` is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now. EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as `ollama`, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other `ollama` models suffer the same problems, they just seem to have a much higher `prompt eval rate` than `mixtral`, which helps to mask it.
Author
Owner

@kaykyr commented on GitHub (Dec 17, 2023):

I can confirm same issue here, even using both 3090/4090

<!-- gh-comment-id:1859273286 --> @kaykyr commented on GitHub (Dec 17, 2023): I can confirm same issue here, even using both 3090/4090
Author
Owner

@djmaze commented on GitHub (Dec 17, 2023):

I just tried out nous-hermes:70b-llama2-q2_K in order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows.

I also tried dolphin-mixtral:8x7b-v2.5-q4_K_M (a Mixtral finetune). It causes the same delays as I've seen with mixtral:8x7b-instruct.

From this I deduce that (at least for me) the problem is specific to the Mixtral models.

<!-- gh-comment-id:1859296089 --> @djmaze commented on GitHub (Dec 17, 2023): I just tried out `nous-hermes:70b-llama2-q2_K` in order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows. I also tried `dolphin-mixtral:8x7b-v2.5-q4_K_M` (a Mixtral finetune). It causes the same delays as I've seen with `mixtral:8x7b-instruct`. From this I deduce that (at least for me) the problem is specific to the Mixtral models.
Author
Owner

@coder543 commented on GitHub (Dec 17, 2023):

@djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the mistral output I showed, but it should still has the same fundamental issue that it does not cache the eval state.

<!-- gh-comment-id:1859297375 --> @coder543 commented on GitHub (Dec 17, 2023): @djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the `mistral` output I showed, but it should still has the same fundamental issue that it does not cache the eval state.
Author
Owner

@djmaze commented on GitHub (Dec 18, 2023):

@coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values):

Model Offloaded layers (n) Eval rate (/s) Prompt eval rate (/s)
starling-lm:7b-alpha-q4_K_M 33/33 105 2700
mistral:7b-instruct-v0.2-q5_K_M 33/33 98 2200
nous-hermes:70b-llama2-q2_K 51/81 3 140
mixtral:8x7b-instruct-v0.1-q2_K 33/33 61 100
mixtral:8x7b-instruct-v0.1-q4_K_M 22/33 13 26

It seems interesting to me that although nous-hermes:70b-llama2-q2_K has a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate than mixtral:8x7b-instruct-v0.1-q4_K_M.

TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?

<!-- gh-comment-id:1861874872 --> @djmaze commented on GitHub (Dec 18, 2023): @coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values): | Model | Offloaded layers (n) | Eval rate (/s) | Prompt eval rate (/s) | |--------|--------|--------|--------| | starling-lm:7b-alpha-q4_K_M | 33/33 | 105 | 2700 | | mistral:7b-instruct-v0.2-q5_K_M | 33/33 | 98 | 2200 | | nous-hermes:70b-llama2-q2_K | 51/81 | 3 | 140 | | mixtral:8x7b-instruct-v0.1-q2_K | 33/33 | 61 | 100 | | mixtral:8x7b-instruct-v0.1-q4_K_M | 22/33 | 13 | 26 | It seems interesting to me that although `nous-hermes:70b-llama2-q2_K` has a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate than `mixtral:8x7b-instruct-v0.1-q4_K_M`. TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?
Author
Owner

@ghost commented on GitHub (Dec 19, 2023):

I installed Ollama with the curl ... | sh command on WSL, and running dolphin-mixtral:latest on 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?

<!-- gh-comment-id:1862190975 --> @ghost commented on GitHub (Dec 19, 2023): I installed Ollama with the `curl ... | sh` command on WSL, and running `dolphin-mixtral:latest` on 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?
Author
Owner

@djmaze commented on GitHub (Dec 19, 2023):

Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.

<!-- gh-comment-id:1862434985 --> @djmaze commented on GitHub (Dec 19, 2023): Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.
Author
Owner

@jamesbascle commented on GitHub (Dec 19, 2023):

I used

FROM dolphin-mixtral
PARAMETER num_gpu 33

to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.

<!-- gh-comment-id:1863202554 --> @jamesbascle commented on GitHub (Dec 19, 2023): I used ``` FROM dolphin-mixtral PARAMETER num_gpu 33 ``` to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.
Author
Owner

@phalexo commented on GitHub (Dec 20, 2023):

I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead?

And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?

<!-- gh-comment-id:1864813210 --> @phalexo commented on GitHub (Dec 20, 2023): I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead? And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?
Author
Owner

@gnusenpai commented on GitHub (Dec 21, 2023):

Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)

<!-- gh-comment-id:1865298911 --> @gnusenpai commented on GitHub (Dec 21, 2023): Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)
Author
Owner

@coder543 commented on GitHub (Dec 21, 2023):

For me, using llama.cpp directly, that PR appears to have raised prompt eval rate to about 325t/s:

print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second)
print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second)
print_timings: total time = 8611.28 ms

Still not as fast as other models, but a significant improvement

<!-- gh-comment-id:1865479311 --> @coder543 commented on GitHub (Dec 21, 2023): For me, using llama.cpp directly, that PR appears to have raised `prompt eval rate` to about 325t/s: print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second) print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second) print_timings: total time = 8611.28 ms Still not as fast as other models, but a significant improvement
Author
Owner

@djmaze commented on GitHub (Dec 21, 2023):

But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572

<!-- gh-comment-id:1867031372 --> @djmaze commented on GitHub (Dec 21, 2023): But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572
Author
Owner

@Confuze commented on GitHub (Dec 22, 2023):

I think I'm running into the same issue on v0.0.17 (installed from the ollama-cuda package on arch linux)

When running dolphin-mixtral with num_gpu set to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without the num_gpu parameter as well, no difference) According to nvidia-smi ollama isn't using the gpu (rtx 2070) at all.

This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used.

By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.

<!-- gh-comment-id:1868012598 --> @Confuze commented on GitHub (Dec 22, 2023): I think I'm running into the same issue on `v0.0.17` (installed from the `ollama-cuda` package on arch linux) When running dolphin-mixtral with `num_gpu` set to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without the `num_gpu` parameter as well, no difference) According to `nvidia-smi` ollama isn't using the gpu (rtx 2070) at all. This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used. By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.
Author
Owner

@coder543 commented on GitHub (Dec 22, 2023):

@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

<!-- gh-comment-id:1868013723 --> @coder543 commented on GitHub (Dec 22, 2023): @Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.
Author
Owner

@phalexo commented on GitHub (Dec 22, 2023):

I don't think we should confuse two separate problems.

Sometimes there is really not enough VRAM.

Sometimes you run into cuBLAS 15 error which was introduced starting with
v0.1.12. Which also often looks like an OOM.

v0.1.11 didn't have this issue.

The only way to mitigate it, that I am aware of at the moment, is to build
with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower
than cuBLAS.

It really should be fixed.

On Fri, Dec 22, 2023, 2:31 PM Josh Leverette @.***>
wrote:

@Confuze https://github.com/Confuze you don’t have enough VRAM to run
Mixtral entirely on the GPU. ollama will be trying to load the model onto
the GPU, running out of memory, and then fall back to just running on the
CPU.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1556#issuecomment-1868013723,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZK5ORG3WVMSA64UT43YKXN2NAVCNFSM6AAAAABAXEH2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYTGNZSGM
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1868021890 --> @phalexo commented on GitHub (Dec 22, 2023): I don't think we should confuse two separate problems. Sometimes there is really not enough VRAM. Sometimes you run into cuBLAS 15 error which was introduced starting with v0.1.12. Which also often looks like an OOM. v0.1.11 didn't have this issue. The only way to mitigate it, that I am aware of at the moment, is to build with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower than cuBLAS. It really should be fixed. On Fri, Dec 22, 2023, 2:31 PM Josh Leverette ***@***.***> wrote: > @Confuze <https://github.com/Confuze> you don’t have enough VRAM to run > Mixtral entirely on the GPU. ollama will be trying to load the model onto > the GPU, running out of memory, and then fall back to just running on the > CPU. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1556#issuecomment-1868013723>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZK5ORG3WVMSA64UT43YKXN2NAVCNFSM6AAAAABAXEH2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYTGNZSGM> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@Confuze commented on GitHub (Dec 22, 2023):

@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

I see, after looking at the logs, it seems like you are right.

CUDA error 2 at /build/ollama-cuda/src/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9148: out of memory

So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?

<!-- gh-comment-id:1868050281 --> @Confuze commented on GitHub (Dec 22, 2023): > @Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU. I see, after looking at the logs, it seems like you are right. ``` CUDA error 2 at /build/ollama-cuda/src/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9148: out of memory ``` So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?
Author
Owner

@coder543 commented on GitHub (Dec 22, 2023):

@Confuze the num_gpu parameter that you set to 10000 was trying to force more layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to call ollama create then start a new ollama run session after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.

<!-- gh-comment-id:1868065381 --> @coder543 commented on GitHub (Dec 22, 2023): @Confuze the `num_gpu` parameter that you set to 10000 was trying to force _more_ layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to call `ollama create` then start a new `ollama run` session after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.
Author
Owner

@madsamjp commented on GitHub (Dec 23, 2023):

@confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp.

I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama.
image
image
image

<!-- gh-comment-id:1868359978 --> @madsamjp commented on GitHub (Dec 23, 2023): @confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp. I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama. ![image](https://github.com/jmorganca/ollama/assets/49611363/ee131d66-c4f8-48af-9ea1-c90799c7e863) ![image](https://github.com/jmorganca/ollama/assets/49611363/fbefe225-8ca5-46cf-a951-f11bf0fca5a2) ![image](https://github.com/jmorganca/ollama/assets/49611363/371947aa-aded-4509-a042-ce8e3961a0da)
Author
Owner

@coder543 commented on GitHub (Dec 23, 2023):

@madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs.

I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest.

Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.

<!-- gh-comment-id:1868360410 --> @coder543 commented on GitHub (Dec 23, 2023): @madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs. I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest. Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.
Author
Owner

@madsamjp commented on GitHub (Dec 23, 2023):

@coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is not slow - about 10 tps.

My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible.

It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.

<!-- gh-comment-id:1868364590 --> @madsamjp commented on GitHub (Dec 23, 2023): @coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is _not_ slow - about 10 tps. My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible. It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.
Author
Owner

@iTestAndroid commented on GitHub (Dec 29, 2023):

This is my nvidia-smi output

Fri Dec 29 03:00:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000000:01:00.0 Off |                    0 |
| N/A   57C    P0              45W / 250W |   5786MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:04:00.0 Off |                  Off |
| N/A   70C    P0              32W /  70W |   5181MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000                Off | 00000000:05:00.0 Off |                  Off |
| 33%   48C    P2              62W / 260W |  13581MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000000:08:00.0 Off |                    0 |
| N/A   66C    P0              30W /  70W |   5171MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE-16GB           Off | 00000000:86:00.0 Off |                    0 |
| N/A   55C    P0              46W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE-16GB           Off | 00000000:89:00.0 Off |                    0 |
| N/A   51C    P0              41W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE-16GB           Off | 00000000:8A:00.0 Off |                    0 |
| N/A   53C    P0              40W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5782MiB |
|    1   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5176MiB |
|    2   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner    13578MiB |
|    3   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5166MiB |
|    4   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
|    5   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
|    6   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
+---------------------------------------------------------------------------------------+

It's responding very slowly for me. Some prompts takes 15 seconds. Any suggestions? I did the Modelfile trick with num_gpus set to 1000 but it's still doing 93 as I can see that when I run ps -aux

My server does have 768GB RAM and 2x Xeon CPU, still it's disappointingly slow to run mixtral

<!-- gh-comment-id:1871694472 --> @iTestAndroid commented on GitHub (Dec 29, 2023): This is my nvidia-smi output ``` Fri Dec 29 03:00:09 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-PCIE-16GB Off | 00000000:01:00.0 Off | 0 | | N/A 57C P0 45W / 250W | 5786MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:04:00.0 Off | Off | | N/A 70C P0 32W / 70W | 5181MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Quadro RTX 8000 Off | 00000000:05:00.0 Off | Off | | 33% 48C P2 62W / 260W | 13581MiB / 49152MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 Tesla T4 Off | 00000000:08:00.0 Off | 0 | | N/A 66C P0 30W / 70W | 5171MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE-16GB Off | 00000000:86:00.0 Off | 0 | | N/A 55C P0 46W / 250W | 5350MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE-16GB Off | 00000000:89:00.0 Off | 0 | | N/A 51C P0 41W / 250W | 5350MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 Tesla V100-PCIE-16GB Off | 00000000:8A:00.0 Off | 0 | | N/A 53C P0 40W / 250W | 5350MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 5782MiB | | 1 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 5176MiB | | 2 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 13578MiB | | 3 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 5166MiB | | 4 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 5346MiB | | 5 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 5346MiB | | 6 N/A N/A 32955 C ...p/gguf/build/cuda/bin/ollama-runner 5346MiB | +---------------------------------------------------------------------------------------+ ``` It's responding very slowly for me. Some prompts takes 15 seconds. Any suggestions? I did the Modelfile trick with num_gpus set to 1000 but it's still doing 93 as I can see that when I run `ps -aux` My server does have 768GB RAM and 2x Xeon CPU, still it's disappointingly slow to run mixtral
Author
Owner

@coder543 commented on GitHub (Jan 6, 2024):

FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API."

I just tested it, and it appears that ollama is now caching the eval state between prompts.

The first prompt:

total duration:       6.250217789s
load duration:        262.05µs
prompt eval count:    16 token(s)
prompt eval duration: 253.355ms
prompt eval rate:     63.15 tokens/s
eval count:           299 token(s)
eval duration:        5.992618s
eval rate:            49.89 tokens/s

The second prompt:

total duration:       10.447833337s
load duration:        240.24µs
prompt eval count:    15 token(s)
prompt eval duration: 241.787ms
prompt eval rate:     62.04 tokens/s
eval count:           495 token(s)
eval duration:        10.203837s
eval rate:            48.51 tokens/s

And, for good measure, sending a third prompt to the same chat:

total duration:       535.952887ms
load duration:        433.381µs
prompt eval count:    18 token(s)
prompt eval duration: 303.442ms
prompt eval rate:     59.32 tokens/s
eval count:           12 token(s)
eval duration:        229.294ms
eval rate:            52.33 tokens/s

In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message.

So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.

<!-- gh-comment-id:1879781559 --> @coder543 commented on GitHub (Jan 6, 2024): FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API." I just tested it, and it appears that ollama is now caching the eval state between prompts. The first prompt: ``` total duration: 6.250217789s load duration: 262.05µs prompt eval count: 16 token(s) prompt eval duration: 253.355ms prompt eval rate: 63.15 tokens/s eval count: 299 token(s) eval duration: 5.992618s eval rate: 49.89 tokens/s ``` The second prompt: ``` total duration: 10.447833337s load duration: 240.24µs prompt eval count: 15 token(s) prompt eval duration: 241.787ms prompt eval rate: 62.04 tokens/s eval count: 495 token(s) eval duration: 10.203837s eval rate: 48.51 tokens/s ``` And, for good measure, sending a third prompt to the same chat: ``` total duration: 535.952887ms load duration: 433.381µs prompt eval count: 18 token(s) prompt eval duration: 303.442ms prompt eval rate: 59.32 tokens/s eval count: 12 token(s) eval duration: 229.294ms eval rate: 52.33 tokens/s ``` In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message. So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.
Author
Owner

@djmaze commented on GitHub (Jan 13, 2024):

As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely TabbyAPI. It uses ExllamaV2 as its backend.

A 3.5 bpw quant of mixtral with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp.

It might even be interesting to add ExllamaV2 as a backend for Ollama?

<!-- gh-comment-id:1890789211 --> @djmaze commented on GitHub (Jan 13, 2024): As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely [TabbyAPI](https://github.com/theroyallab/tabbyAPI/). It uses ExllamaV2 as its backend. [A 3.5 bpw quant of mixtral](https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2) with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp. It might even be interesting to add ExllamaV2 as a backend for Ollama?
Author
Owner

@Bearsaerker commented on GitHub (Jan 22, 2024):

This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew

<!-- gh-comment-id:1904489368 --> @Bearsaerker commented on GitHub (Jan 22, 2024): This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew
Author
Owner

@Bearsaerker commented on GitHub (Jan 23, 2024):

With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release

<!-- gh-comment-id:1906065654 --> @Bearsaerker commented on GitHub (Jan 23, 2024): With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release
Author
Owner

@pdevine commented on GitHub (Mar 12, 2024):

This should be fixed now. With a 4090, I see:

total duration:       2.664210661s
load duration:        438.566µs
prompt eval duration: 54.54ms
prompt eval rate:     0.00 tokens/s
eval count:           69 token(s)
eval duration:        2.608517s
eval rate:            26.45 tokens/s

Then:

total duration:       17.580922486s
load duration:        671.919µs
prompt eval count:    13 token(s)
prompt eval duration: 274.06ms
prompt eval rate:     47.43 tokens/s
eval count:           440 token(s)
eval duration:        17.303543s
eval rate:            25.43 tokens/s

And for the 3rd prompt:

total duration:       23.699967097s
load duration:        826.491µs
prompt eval count:    15 token(s)
prompt eval duration: 318.672ms
prompt eval rate:     47.07 tokens/s
eval count:           564 token(s)
eval duration:        23.372658s
eval rate:            24.13 tokens/s

Going to go ahead and close the issue.

<!-- gh-comment-id:1989701409 --> @pdevine commented on GitHub (Mar 12, 2024): This should be fixed now. With a 4090, I see: ``` total duration: 2.664210661s load duration: 438.566µs prompt eval duration: 54.54ms prompt eval rate: 0.00 tokens/s eval count: 69 token(s) eval duration: 2.608517s eval rate: 26.45 tokens/s ``` Then: ``` total duration: 17.580922486s load duration: 671.919µs prompt eval count: 13 token(s) prompt eval duration: 274.06ms prompt eval rate: 47.43 tokens/s eval count: 440 token(s) eval duration: 17.303543s eval rate: 25.43 tokens/s ``` And for the 3rd prompt: ``` total duration: 23.699967097s load duration: 826.491µs prompt eval count: 15 token(s) prompt eval duration: 318.672ms prompt eval rate: 47.07 tokens/s eval count: 564 token(s) eval duration: 23.372658s eval rate: 24.13 tokens/s ``` Going to go ahead and close the issue.
Author
Owner

@grafke commented on GitHub (Mar 13, 2024):

I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main?

Here are my results, the first prompt:

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =    7733.01 ms /  2163 tokens (    3.58 ms per token,   279.71 tokens per second)","n_prompt_tokens_processed":2163,"n_tokens_second":279.7100791451502,"slot_id":0,"t_prompt_processing":7733.007,"t_token":3.575130374479889,"task_id":312,"tid":"139930931721792","timestamp":1710342290}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2852.03 ms /   161 runs   (   17.71 ms per token,    56.45 tokens per second)","n_decoded":161,"n_tokens_second":56.45101909867707,"slot_id":0,"t_token":17.71447204968944,"t_token_generation":2852.03,"task_id":312,"tid":"139930931721792","timestamp":1710342290}

the second prompt

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =   13081.74 ms /  3813 tokens (    3.43 ms per token,   291.47 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.47492042918134,"slot_id":0,"t_prompt_processing":13081.743,"t_token":3.430826907946499,"task_id":166,"tid":"139930931721792","timestamp":1710342267}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2617.80 ms /   143 runs   (   18.31 ms per token,    54.63 tokens per second)","n_decoded":143,"n_tokens_second":54.62606358473802,"slot_id":0,"t_token":18.30627972027972,"t_token_generation":2617.798,"task_id":166,"tid":"139930931721792","timestamp":1710342267}

and the third prompt:

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =   13079.91 ms /  3813 tokens (    3.43 ms per token,   291.52 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.51570044846625,"slot_id":0,"t_prompt_processing":13079.913,"t_token":3.430346970889064,"task_id":476,"tid":"139930931721792","timestamp":1710342414}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2602.60 ms /   143 runs   (   18.20 ms per token,    54.95 tokens per second)","n_decoded":143,"n_tokens_second":54.94511827993346,"slot_id":0,"t_token":18.199979020979022,"t_token_generation":2602.597,"task_id":476,"tid":"139930931721792","timestamp":1710342414}

Generation is fast but the prompt eval time is suuuuper slow.
I'm using the option: "num_ctx": 32768
And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.

<!-- gh-comment-id:1994633725 --> @grafke commented on GitHub (Mar 13, 2024): I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main? Here are my results, the first prompt: ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 7733.01 ms / 2163 tokens ( 3.58 ms per token, 279.71 tokens per second)","n_prompt_tokens_processed":2163,"n_tokens_second":279.7100791451502,"slot_id":0,"t_prompt_processing":7733.007,"t_token":3.575130374479889,"task_id":312,"tid":"139930931721792","timestamp":1710342290} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2852.03 ms / 161 runs ( 17.71 ms per token, 56.45 tokens per second)","n_decoded":161,"n_tokens_second":56.45101909867707,"slot_id":0,"t_token":17.71447204968944,"t_token_generation":2852.03,"task_id":312,"tid":"139930931721792","timestamp":1710342290} ``` the second prompt ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 13081.74 ms / 3813 tokens ( 3.43 ms per token, 291.47 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.47492042918134,"slot_id":0,"t_prompt_processing":13081.743,"t_token":3.430826907946499,"task_id":166,"tid":"139930931721792","timestamp":1710342267} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2617.80 ms / 143 runs ( 18.31 ms per token, 54.63 tokens per second)","n_decoded":143,"n_tokens_second":54.62606358473802,"slot_id":0,"t_token":18.30627972027972,"t_token_generation":2617.798,"task_id":166,"tid":"139930931721792","timestamp":1710342267} ``` and the third prompt: ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 13079.91 ms / 3813 tokens ( 3.43 ms per token, 291.52 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.51570044846625,"slot_id":0,"t_prompt_processing":13079.913,"t_token":3.430346970889064,"task_id":476,"tid":"139930931721792","timestamp":1710342414} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2602.60 ms / 143 runs ( 18.20 ms per token, 54.95 tokens per second)","n_decoded":143,"n_tokens_second":54.94511827993346,"slot_id":0,"t_token":18.199979020979022,"t_token_generation":2602.597,"task_id":476,"tid":"139930931721792","timestamp":1710342414} ``` Generation is fast but the prompt eval time is suuuuper slow. I'm using the option: "num_ctx": 32768 And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.
Author
Owner

@pdevine commented on GitHub (Mar 13, 2024):

@grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power.

My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got:

prompt eval count:    829 token(s)
prompt eval duration: 4.738038s
prompt eval rate:     174.97 tokens/s

I think that's roughly tracking the speeds you're seeing?

<!-- gh-comment-id:1995903830 --> @pdevine commented on GitHub (Mar 13, 2024): @grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power. My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got: ``` prompt eval count: 829 token(s) prompt eval duration: 4.738038s prompt eval rate: 174.97 tokens/s ``` I think that's roughly tracking the speeds you're seeing?
Author
Owner

@grafke commented on GitHub (Mar 14, 2024):

@pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results.

I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama).
So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.

<!-- gh-comment-id:1997331344 --> @grafke commented on GitHub (Mar 14, 2024): @pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results. I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama). So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.
Author
Owner

@pdevine commented on GitHub (Mar 14, 2024):

Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.

<!-- gh-comment-id:1997879059 --> @pdevine commented on GitHub (Mar 14, 2024): Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.
Author
Owner

@pdevine commented on GitHub (Mar 14, 2024):

@grafke just thinking about that some more, you can make a call like:

curl http://localhost:11434/api/generate -d '{"model": "grf/mixtral_wa_q4_cp", "prompt": ""}'

Which will preload the model in memory so that when you make the next call it should be faster.

<!-- gh-comment-id:1998197038 --> @pdevine commented on GitHub (Mar 14, 2024): @grafke just thinking about that some more, you can make a call like: ``` curl http://localhost:11434/api/generate -d '{"model": "grf/mixtral_wa_q4_cp", "prompt": ""}' ``` Which will preload the model in memory so that when you make the next call it should be faster.
Author
Owner

@grafke commented on GitHub (Mar 18, 2024):

@pdevine The model is preloaded in memory after I make the first request indeed (and it is slow). The requests I posted above are subsequent requests, when the model is in memory already. I guess this is due very long prompts (up to 2k-3k tokens). Shorter prompts indeed have a much shorter ttft.

<!-- gh-comment-id:2003268012 --> @grafke commented on GitHub (Mar 18, 2024): @pdevine The model is preloaded in memory after I make the first request indeed (and it is slow). The requests I posted above are subsequent requests, when the model is in memory already. I guess this is due very long prompts (up to 2k-3k tokens). Shorter prompts indeed have a much shorter ttft.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47362