[GH-ISSUE #1556] Delays and slowness when using mixtral #26611

New Issue

GiteaMirror · 2026-04-22T02:57:58-05:00

GiteaMirror commented

2026-04-22 02:57:58 -05:00

Originally created by @djmaze on GitHub (Dec 15, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1556

It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline).

Using CUDA on a RTX 3090. Tried out mixtral:8x7b-instruct-v0.1-q4_K_M (with CPU offloading) as well as mixtral:8x7b-instruct-v0.1-q2_K (completely in VRAM).

As a comparison, I tried starling-lm:7b-alpha-q4_K_M, which seems not to exhibit any of these problems.

Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)

Originally created by @djmaze on GitHub (Dec 15, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1556 It seems as the context grows, the delay until the first output is getting longer and longer, taking more than half a minute after a few prompts. Also, text generation seems much slower than with the latest llama.cpp (commandline). Using CUDA on a RTX 3090. Tried out `mixtral:8x7b-instruct-v0.1-q4_K_M` (with CPU offloading) as well as `mixtral:8x7b-instruct-v0.1-q2_K` (completely in VRAM). As a comparison, I tried `starling-lm:7b-alpha-q4_K_M`, which seems not to exhibit any of these problems. Sorry for the unprecise report, running out of time right now. Does anyone have a similar experience with Mixtral? Or is this expected behaviour with ollama? (First-time user here.)

GiteaMirror closed this issue

2026-04-22 02:57:58 -05:00

GiteaMirror commented

2026-04-22 02:57:59 -05:00

@madsamjp commented on GitHub (Dec 16, 2023):

Can confirm I'm also having this issue. I'm running dolphin-mixtral:8x7b-v2.5-q5_K_M with 22 layers offloaded to GPU (RTX 4090). First response takes 2 secs, second response 26 secs, 3rd 37 secs and 4th 49 secs. By the 4th response there are 888 tokens in the context window.

Eval rate is a respectable ~10tps, but with a > 1 minute prompt eval by the 5th response, it's unusable.

@madsamjp commented on GitHub (Dec 16, 2023): Can confirm I'm also having this issue. I'm running dolphin-mixtral:8x7b-v2.5-q5_K_M with 22 layers offloaded to GPU (RTX 4090). First response takes 2 secs, second response 26 secs, 3rd 37 secs and 4th 49 secs. By the 4th response there are 888 tokens in the context window. Eval rate is a respectable ~10tps, but with a > 1 minute prompt eval by the 5th response, it's unusable.

GiteaMirror commented

2026-04-22 02:58:00 -05:00

@easp commented on GitHub (Dec 17, 2023):

Yeah, big issue on Apple Silicon Macs, too. I've seen references to this being a known problem for mixtral on llamacpp right now, but I can't find an actual issue about it on the llama.cpp github.

@easp commented on GitHub (Dec 17, 2023): Yeah, big issue on Apple Silicon Macs, too. I've seen references to this being a known problem for mixtral on llamacpp right now, but I can't find an actual issue about it on the llama.cpp github.

GiteaMirror commented

2026-04-22 02:58:01 -05:00

@phalexo commented on GitHub (Dec 17, 2023):

Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?

@phalexo commented on GitHub (Dec 17, 2023): Ollama has a history file in the ~/.ollama folder. Does ollama constantly parse that cache?

GiteaMirror commented

2026-04-22 02:58:02 -05:00

@easp commented on GitHub (Dec 17, 2023):

That's just the readline history. It's just commands entered in the REPL.

@easp commented on GitHub (Dec 17, 2023): That's just the readline history. It's just commands entered in the REPL.

GiteaMirror commented

2026-04-22 02:58:02 -05:00

@easp commented on GitHub (Dec 17, 2023):

Looks like this recently merged llama.cpp PR may improve prompt-processing speed with Mixtral: https://github.com/ggerganov/llama.cpp/pull/4480

@easp commented on GitHub (Dec 17, 2023): Looks like this recently merged llama.cpp PR may improve prompt-processing speed with Mixtral: https://github.com/ggerganov/llama.cpp/pull/4480

GiteaMirror commented

2026-04-22 02:58:03 -05:00

@coder543 commented on GitHub (Dec 17, 2023):

The default mixtral Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the q3_K_S model can be completely offloaded to the GPU, which speeds things up dramatically:

Make a Modelfile:

FROM mixtral:8x7b-instruct-v0.1-q3_K_S
PARAMETER num_gpu 33

Then run ollama create mixtral_gpu -f ./Modelfile

Then you can run ollama run mixtral_gpu and see how it does.

@coder543 commented on GitHub (Dec 17, 2023): The default `mixtral` Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the `q3_K_S` model can be completely offloaded to the GPU, which speeds things up dramatically: Make a `Modelfile`: ``` FROM mixtral:8x7b-instruct-v0.1-q3_K_S PARAMETER num_gpu 33 ``` Then run `ollama create mixtral_gpu -f ./Modelfile` Then you can run `ollama run mixtral_gpu` and see how it does.

GiteaMirror commented

2026-04-22 02:58:04 -05:00

@coder543 commented on GitHub (Dec 17, 2023):

I also wonder if it would be possible for ollama to keep the eval state between prompts, rather than re-processing the entire context window for each new message. I understand ollama is trying to run a model server so there could be requests coming from more than one session at a time, but maybe it's possible to only clear the state and start from scratch if a request from a different session is received? This is all a little beyond my expertise, so I could be completely wrong.

@coder543 commented on GitHub (Dec 17, 2023): I also wonder if it would be possible for ollama to keep the eval state between prompts, rather than re-processing the entire context window for each new message. I understand ollama is trying to run a model server so there could be requests coming from more than one session at a time, but maybe it's possible to only clear the state and start from scratch if a request from a different session is received? This is all a little beyond my expertise, so I could be completely wrong.

GiteaMirror commented

2026-04-22 02:58:04 -05:00

@phalexo commented on GitHub (Dec 17, 2023):

The default mixtral Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the q3_K_S model can be completely offloaded to the GPU, which speeds things up dramatically:

Make a Modelfile:
FROM mixtral:8x7b-instruct-v0.1-q3_K_S
PARAMETER num_gpu 33
Then run ollama create mixtral_gpu -f ./Modelfile

Then you can run ollama run mixtral_gpu and see how it does.

Using llama.cpp directly in interactive mode does not appear to have any major delays. It takes merely a second or two to start answering even after a relatively long conversation.

Looks like latency is specific to ollama.

@phalexo commented on GitHub (Dec 17, 2023): > The default `mixtral` Modelfile only offloads like 22 layers, as noted previously. For people with 24GB of VRAM, I have found that the `q3_K_S` model can be completely offloaded to the GPU, which speeds things up dramatically: > > Make a `Modelfile`: > > ``` > FROM mixtral:8x7b-instruct-v0.1-q3_K_S > PARAMETER num_gpu 33 > ``` > > Then run `ollama create mixtral_gpu -f ./Modelfile` > > Then you can run `ollama run mixtral_gpu` and see how it does. Using llama.cpp directly in interactive mode does not appear to have any major delays. It takes merely a second or two to start answering even after a relatively long conversation. Looks like latency is specific to ollama.

GiteaMirror commented

2026-04-22 02:58:05 -05:00

@djmaze commented on GitHub (Dec 17, 2023):

@coder543 As stated in my initial post, I even tried the q2_k version, loading all 33 layers into the GPU. Still, the token generation is quite slow and the delay before the token generation starts increases on every prompt as the context grows.

As also stated, when using llama.cpp or a totally different model, there are no delays and the token generation (for the same model) is significantly faster.

@djmaze commented on GitHub (Dec 17, 2023): @coder543 As stated in my initial post, I even tried the `q2_k` version, loading all 33 layers into the GPU. Still, the token generation is quite slow and the delay before the token generation starts increases on every prompt as the context grows. As also stated, when using llama.cpp or a totally different model, there are no delays and the token generation (for the same model) is significantly faster.

GiteaMirror commented

2026-04-22 02:58:06 -05:00

@coder543 commented on GitHub (Dec 17, 2023):

@djmaze that is strange, since I'm not encountering any unusual problems on my 3090.

total duration:       18.273218027s
prompt eval count:    1180 token(s)
prompt eval duration: 15.833678s
prompt eval rate:     74.52 tokens/s
eval count:           114 token(s)
eval duration:        2.391734s
eval rate:            47.66 tokens/s

Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations.

This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window.

EDIT: testing mistral (instead of mixtral), I am seeing this after a similar situation:

total duration:       2.244759039s
prompt eval count:    1211 token(s)
prompt eval duration: 421.415ms
prompt eval rate:     2873.65 tokens/s
eval count:           208 token(s)
eval duration:        1.774238s
eval rate:            117.23 tokens/s

The key differentiator is that the prompt eval rate is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now.

EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as ollama, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other ollama models suffer the same problems, they just seem to have a much higher prompt eval rate than mixtral, which helps to mask it.

@coder543 commented on GitHub (Dec 17, 2023): @djmaze that is strange, since I'm not encountering any unusual problems on my 3090. ``` total duration: 18.273218027s prompt eval count: 1180 token(s) prompt eval duration: 15.833678s prompt eval rate: 74.52 tokens/s eval count: 114 token(s) eval duration: 2.391734s eval rate: 47.66 tokens/s ``` Here, there are nearly 1200 tokens in the context window of previous chat messages, and yet it is able to generate a response in less than 20 seconds. Yes, this is slower than it could be, but that seems to relate to what I mentioned in my previous comment about it not keeping the eval state between generations. This is not the terrible performance that other people are describing, where it is taking 50 seconds with less than 900 tokens in the context window. EDIT: testing `mistral` (instead of `mixtral`), I am seeing this after a similar situation: ``` total duration: 2.244759039s prompt eval count: 1211 token(s) prompt eval duration: 421.415ms prompt eval rate: 2873.65 tokens/s eval count: 208 token(s) eval duration: 1.774238s eval rate: 117.23 tokens/s ``` The key differentiator is that the `prompt eval rate` is obviously way higher. As someone else linked to a PR which improved prompt eval rate on the CPU, it isn't crazy to assume that the prompt eval rate on the GPU needs some improvements as well. You say llama.cpp is much faster at this, but I haven't actually observed any real difference. Doing more testing now. EDIT 2: yes, using llama.cpp server, it appears to be doing exactly what I mentioned: keeping the eval state in memory. It is processing prompt tokens at the same rate as `ollama`, it is just processing fewer of them because it does not appear to be re-evaluating the entire context window with each new prompt. The other `ollama` models suffer the same problems, they just seem to have a much higher `prompt eval rate` than `mixtral`, which helps to mask it.

GiteaMirror commented

2026-04-22 02:58:07 -05:00

@kaykyr commented on GitHub (Dec 17, 2023):

I can confirm same issue here, even using both 3090/4090

@kaykyr commented on GitHub (Dec 17, 2023): I can confirm same issue here, even using both 3090/4090

GiteaMirror commented

2026-04-22 02:58:07 -05:00

@djmaze commented on GitHub (Dec 17, 2023):

I just tried out nous-hermes:70b-llama2-q2_K in order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows.

I also tried dolphin-mixtral:8x7b-v2.5-q4_K_M (a Mixtral finetune). It causes the same delays as I've seen with mixtral:8x7b-instruct.

From this I deduce that (at least for me) the problem is specific to the Mixtral models.

@djmaze commented on GitHub (Dec 17, 2023): I just tried out `nous-hermes:70b-llama2-q2_K` in order to have a bigger model for comparison. With 51 of 81 layers offloaded to GPU, the token generation is quite slow, as expected. But I do not experience the initial delay, even when the context grows. I also tried `dolphin-mixtral:8x7b-v2.5-q4_K_M` (a Mixtral finetune). It causes the same delays as I've seen with `mixtral:8x7b-instruct`. From this I deduce that (at least for me) the problem is specific to the Mixtral models.

GiteaMirror commented

2026-04-22 02:58:08 -05:00

@coder543 commented on GitHub (Dec 17, 2023):

@djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the mistral output I showed, but it should still has the same fundamental issue that it does not cache the eval state.

@coder543 commented on GitHub (Dec 17, 2023): @djmaze please post the verbose output. Does it not show that the number of prompt eval tokens is growing? Presumably, it just has a much more optimized prompt eval rate, as with the `mistral` output I showed, but it should still has the same fundamental issue that it does not cache the eval state.

GiteaMirror commented

2026-04-22 02:58:09 -05:00

@djmaze commented on GitHub (Dec 18, 2023):

@coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values):

Model	Offloaded layers (n)	Eval rate (/s)	Prompt eval rate (/s)
starling-lm:7b-alpha-q4_K_M	33/33	105	2700
mistral:7b-instruct-v0.2-q5_K_M	33/33	98	2200
nous-hermes:70b-llama2-q2_K	51/81	3	140
mixtral:8x7b-instruct-v0.1-q2_K	33/33	61	100
mixtral:8x7b-instruct-v0.1-q4_K_M	22/33	13	26

It seems interesting to me that although nous-hermes:70b-llama2-q2_K has a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate than mixtral:8x7b-instruct-v0.1-q4_K_M.

TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?

@djmaze commented on GitHub (Dec 18, 2023): @coder543 (Sorry I was testing with the webui before, so I didn't have any values.) After I found out how to do it, I tested the prompt eval rate of several models with olllama now (approximate values): | Model | Offloaded layers (n) | Eval rate (/s) | Prompt eval rate (/s) | |--------|--------|--------|--------| | starling-lm:7b-alpha-q4_K_M | 33/33 | 105 | 2700 | | mistral:7b-instruct-v0.2-q5_K_M | 33/33 | 98 | 2200 | | nous-hermes:70b-llama2-q2_K | 51/81 | 3 | 140 | | mixtral:8x7b-instruct-v0.1-q2_K | 33/33 | 61 | 100 | | mixtral:8x7b-instruct-v0.1-q4_K_M | 22/33 | 13 | 26 | It seems interesting to me that although `nous-hermes:70b-llama2-q2_K` has a similar number of layers offloaded to the GPU and a much slower eval rate, it still shows a much higher prompt eval rate than `mixtral:8x7b-instruct-v0.1-q4_K_M`. TL/DR You seem to be right. The mixtral prompt eval rate, at least when only partially offloaded, looks abysmal. I wonder if that is because of the MoE architecture. Or does it also depend on the quantization?

GiteaMirror commented

2026-04-22 02:58:10 -05:00

@ghost commented on GitHub (Dec 19, 2023):

I installed Ollama with the curl ... | sh command on WSL, and running dolphin-mixtral:latest on 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?

@ghost commented on GitHub (Dec 19, 2023): I installed Ollama with the `curl ... | sh` command on WSL, and running `dolphin-mixtral:latest` on 64G RAM and 4080 16G VRAM. I don't really understand anything about running this stuff, but yeah, the more I talk to the AI, the longer every reply gets delayed. Is it something that can be fixed on software level, something I can do on my end?

GiteaMirror commented

2026-04-22 02:58:10 -05:00

@djmaze commented on GitHub (Dec 19, 2023):

Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.

@djmaze commented on GitHub (Dec 19, 2023): Either way, I support @coder543's wish for a prompt eval cache. There is already an issue at #1573 for that, maybe we can continue there.

GiteaMirror commented

2026-04-22 02:58:11 -05:00

@jamesbascle commented on GitHub (Dec 19, 2023):

I used

FROM dolphin-mixtral
PARAMETER num_gpu 33

to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.

@jamesbascle commented on GitHub (Dec 19, 2023): I used ``` FROM dolphin-mixtral PARAMETER num_gpu 33 ``` to try getting as much of it onto my 3090 as possible and got a bit of a speedup but it is still pretty slow and only gets slower as the conversation goes on.

GiteaMirror commented

2026-04-22 02:58:12 -05:00

@phalexo commented on GitHub (Dec 20, 2023):

I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead?

And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?

@phalexo commented on GitHub (Dec 20, 2023): I was wondering if there is any indication that someone is looking into this? Also, I am wondering what effect LLAMA_CUDA_FORCE_MMQ=on setting has on the performance. If the optimized cuBLAS kernels are not used then what is the performance penalty when using MMQ kernels instead? And why was ollama 0.1.11 and earlier working? Presumably it was using cuBLAS. What changed from 0.1.11 to 0.1.12 to make it stop working?

GiteaMirror commented

2026-04-22 02:58:13 -05:00

@gnusenpai commented on GitHub (Dec 21, 2023):

Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)

@gnusenpai commented on GitHub (Dec 21, 2023): Building ollama with https://github.com/ggerganov/llama.cpp/pull/4538 and (optionally, if you do CPU+GPU inference) https://github.com/ggerganov/llama.cpp/pull/4553 has made prompt eval significantly faster for me. (~60t/s vs. ~10t/s)

GiteaMirror commented

2026-04-22 02:58:14 -05:00

@coder543 commented on GitHub (Dec 21, 2023):

For me, using llama.cpp directly, that PR appears to have raised prompt eval rate to about 325t/s:

print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second)
print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second)
print_timings: total time = 8611.28 ms

Still not as fast as other models, but a significant improvement

@coder543 commented on GitHub (Dec 21, 2023): For me, using llama.cpp directly, that PR appears to have raised `prompt eval rate` to about 325t/s: print_timings: prompt eval time = 3444.74 ms / 1122 tokens ( 3.07 ms per token, 325.71 tokens per second) print_timings: eval time = 5166.55 ms / 205 runs ( 25.20 ms per token, 39.68 tokens per second) print_timings: total time = 8611.28 ms Still not as fast as other models, but a significant improvement

GiteaMirror commented

2026-04-22 02:58:14 -05:00

@djmaze commented on GitHub (Dec 21, 2023):

But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572

@djmaze commented on GitHub (Dec 21, 2023): But beware, it seems the quality might have dropped as a side effect: https://github.com/ggerganov/llama.cpp/issues/4572

GiteaMirror commented

2026-04-22 02:58:15 -05:00

@Confuze commented on GitHub (Dec 22, 2023):

I think I'm running into the same issue on v0.0.17 (installed from the ollama-cuda package on arch linux)

When running dolphin-mixtral with num_gpu set to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without the num_gpu parameter as well, no difference) According to nvidia-smi ollama isn't using the gpu (rtx 2070) at all.

This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used.

By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.

@Confuze commented on GitHub (Dec 22, 2023): I think I'm running into the same issue on `v0.0.17` (installed from the `ollama-cuda` package on arch linux) When running dolphin-mixtral with `num_gpu` set to 10000 just to be sure it's practically unusable, it takes the model about a minute to start responding to a single prompt in the first place and it generates the answer in a painfully slow manner. (I tried it without the `num_gpu` parameter as well, no difference) According to `nvidia-smi` ollama isn't using the gpu (rtx 2070) at all. This appears to be a problem related only to mixtral, as running others like llama2 results in perfect performance and my gpu being fully used. By reading through this issue I understand there's not much we can do on a user's level, right? Apologies if this comment makes no sense, I know nothing about this thing just wanted to generate the recipe for meth.

GiteaMirror commented

2026-04-22 02:58:16 -05:00

@coder543 commented on GitHub (Dec 22, 2023):

@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

@coder543 commented on GitHub (Dec 22, 2023): @Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

GiteaMirror commented

2026-04-22 02:58:16 -05:00

@phalexo commented on GitHub (Dec 22, 2023):

I don't think we should confuse two separate problems.

Sometimes there is really not enough VRAM.

Sometimes you run into cuBLAS 15 error which was introduced starting with
v0.1.12. Which also often looks like an OOM.

v0.1.11 didn't have this issue.

The only way to mitigate it, that I am aware of at the moment, is to build
with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower
than cuBLAS.

It really should be fixed.

On Fri, Dec 22, 2023, 2:31 PM Josh Leverette @.***>
wrote:

@Confuze https://github.com/Confuze you don’t have enough VRAM to run
Mixtral entirely on the GPU. ollama will be trying to load the model onto
the GPU, running out of memory, and then fall back to just running on the
CPU.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1556#issuecomment-1868013723,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZK5ORG3WVMSA64UT43YKXN2NAVCNFSM6AAAAABAXEH2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYTGNZSGM
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (Dec 22, 2023): I don't think we should confuse two separate problems. Sometimes there is really not enough VRAM. Sometimes you run into cuBLAS 15 error which was introduced starting with v0.1.12. Which also often looks like an OOM. v0.1.11 didn't have this issue. The only way to mitigate it, that I am aware of at the moment, is to build with LLAMA_CUDA_FORCE_MMQ=on but this solution, as far as I know, is slower than cuBLAS. It really should be fixed. On Fri, Dec 22, 2023, 2:31 PM Josh Leverette ***@***.***> wrote: > @Confuze <https://github.com/Confuze> you don’t have enough VRAM to run > Mixtral entirely on the GPU. ollama will be trying to load the model onto > the GPU, running out of memory, and then fall back to just running on the > CPU. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1556#issuecomment-1868013723>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZK5ORG3WVMSA64UT43YKXN2NAVCNFSM6AAAAABAXEH2P2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGAYTGNZSGM> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-22 02:58:17 -05:00

@Confuze commented on GitHub (Dec 22, 2023):

@Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU.

I see, after looking at the logs, it seems like you are right.

CUDA error 2 at /build/ollama-cuda/src/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9148: out of memory

So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?

@Confuze commented on GitHub (Dec 22, 2023): > @Confuze you don’t have enough VRAM to run Mixtral entirely on the GPU. ollama will be trying to load the model onto the GPU, running out of memory, and then fall back to just running on the CPU. I see, after looking at the logs, it seems like you are right. ``` CUDA error 2 at /build/ollama-cuda/src/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9148: out of memory ``` So, the only option I have is running this model on a cpu? (besides getting a better gpu of course) There's no way to load it partially with the gpu and partially with the cpu?

GiteaMirror commented

2026-04-22 02:58:18 -05:00

@coder543 commented on GitHub (Dec 22, 2023):

@Confuze the num_gpu parameter that you set to 10000 was trying to force more layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to call ollama create then start a new ollama run session after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.

@coder543 commented on GitHub (Dec 22, 2023): @Confuze the `num_gpu` parameter that you set to 10000 was trying to force _more_ layers onto the GPU. Mixtral has 33 layers. You just have to keep lowering that number until the VRAM usage is low enough. I would be surprised if you can fit more than 10 layers on the 8GB of VRAM that I think your GPU has. (You have to call `ollama create` then start a new `ollama run` session after each change to the Modelfile, or else the changes won't apply.) Then it will use both the CPU and the GPU. Unfortunately, offloading only a small number of layers of any model doesn't seem to give much more speed than just using the CPU, but you can try it out and see how well it works for you.

GiteaMirror commented

2026-04-22 02:58:19 -05:00

@madsamjp commented on GitHub (Dec 23, 2023):

@confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp.

I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama.

@madsamjp commented on GitHub (Dec 23, 2023): @confuze, I've successfully managed to run this model using text-generation-webui with llama.ccp. I offload 20 layers to my 4090 with context window of 8k. I get a consistent 8-10 tps each time. This slowness issue is definitely an issue with Ollama. ![image](https://github.com/jmorganca/ollama/assets/49611363/ee131d66-c4f8-48af-9ea1-c90799c7e863) ![image](https://github.com/jmorganca/ollama/assets/49611363/fbefe225-8ca5-46cf-a951-f11bf0fca5a2) ![image](https://github.com/jmorganca/ollama/assets/49611363/371947aa-aded-4509-a042-ce8e3961a0da)

GiteaMirror commented

2026-04-22 02:58:19 -05:00

@coder543 commented on GitHub (Dec 23, 2023):

@madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs.

I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest.

Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.

@coder543 commented on GitHub (Dec 23, 2023): @madsamjp With a 4090, you should be able to offload all 33 layers of the 3-bit quantized models and get 50+ tokens per second. If you want to run the 5-bit model, it will be slow because CPU inference of any LLM is dependent on the memory bandwidth, and outside of Apple Silicon, CPUs do not have very much memory bandwidth compared to GPUs. I’m not connected to the ollama project, but I don’t see how this is ollama’s fault in the slightest. Unless you’re talking about the prompt eval time issue, which was already discussed at length and is clearly a choice ollama has made not to cache the eval state between prompts. In which case, I don’t see anything new in your comment. @Confuze did not seem to be talking about the prompt eval issue at all. They were encountering slowness on the very first prompt, not subsequent prompts where the context was growing.

GiteaMirror commented

2026-04-22 02:58:20 -05:00

@madsamjp commented on GitHub (Dec 23, 2023):

@coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is not slow - about 10 tps.

My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible.

It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.

@madsamjp commented on GitHub (Dec 23, 2023): @coder543 I understand that running the 5 bit model will be slow on a 4090 compared to running the 3 bit. My comment was specifically in response to this point that @confuze made: "So, the only option I have is running this model on a cpu? ". I've found that running this model using llama.cpp (with ooba), and partially offloading to gpu seems to work fine compared to Ollama, where it doesn't work without very long (and progressively worse) prompt eval times. Using Ollama, after 4 prompts, I'm waiting about 1 minute before I start to get a response. The response timing for me is _not_ slow - about 10 tps. My understanding of this thread was that Ollama seems to have progressively longer prompt eval times - even for models that fit entirely in VRAM. If this is because of a conscious decision that Ollama team have made, then it makes running Mixtral using Ollama unfeasible. It seems that perhaps we are discussing separate issues in the same thread which is leading to confusion.

GiteaMirror commented

2026-04-22 02:58:21 -05:00

@iTestAndroid commented on GitHub (Dec 29, 2023):

This is my nvidia-smi output

Fri Dec 29 03:00:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000000:01:00.0 Off |                    0 |
| N/A   57C    P0              45W / 250W |   5786MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       Off | 00000000:04:00.0 Off |                  Off |
| N/A   70C    P0              32W /  70W |   5181MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000                Off | 00000000:05:00.0 Off |                  Off |
| 33%   48C    P2              62W / 260W |  13581MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                       Off | 00000000:08:00.0 Off |                    0 |
| N/A   66C    P0              30W /  70W |   5171MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE-16GB           Off | 00000000:86:00.0 Off |                    0 |
| N/A   55C    P0              46W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE-16GB           Off | 00000000:89:00.0 Off |                    0 |
| N/A   51C    P0              41W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE-16GB           Off | 00000000:8A:00.0 Off |                    0 |
| N/A   53C    P0              40W / 250W |   5350MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5782MiB |
|    1   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5176MiB |
|    2   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner    13578MiB |
|    3   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5166MiB |
|    4   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
|    5   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
|    6   N/A  N/A     32955      C   ...p/gguf/build/cuda/bin/ollama-runner     5346MiB |
+---------------------------------------------------------------------------------------+

It's responding very slowly for me. Some prompts takes 15 seconds. Any suggestions? I did the Modelfile trick with num_gpus set to 1000 but it's still doing 93 as I can see that when I run ps -aux

My server does have 768GB RAM and 2x Xeon CPU, still it's disappointingly slow to run mixtral

GiteaMirror commented

2026-04-22 02:58:21 -05:00

@coder543 commented on GitHub (Jan 6, 2024):

FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API."

I just tested it, and it appears that ollama is now caching the eval state between prompts.

The first prompt:

total duration:       6.250217789s
load duration:        262.05µs
prompt eval count:    16 token(s)
prompt eval duration: 253.355ms
prompt eval rate:     63.15 tokens/s
eval count:           299 token(s)
eval duration:        5.992618s
eval rate:            49.89 tokens/s

The second prompt:

total duration:       10.447833337s
load duration:        240.24µs
prompt eval count:    15 token(s)
prompt eval duration: 241.787ms
prompt eval rate:     62.04 tokens/s
eval count:           495 token(s)
eval duration:        10.203837s
eval rate:            48.51 tokens/s

And, for good measure, sending a third prompt to the same chat:

total duration:       535.952887ms
load duration:        433.381µs
prompt eval count:    18 token(s)
prompt eval duration: 303.442ms
prompt eval rate:     59.32 tokens/s
eval count:           12 token(s)
eval duration:        229.294ms
eval rate:            52.33 tokens/s

In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message.

So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.

@coder543 commented on GitHub (Jan 6, 2024): FWIW, I saw that the release notes for ollama 0.1.18 mentioned "Improved performance when sending follow up messages in ollama run or via the API." I just tested it, and it appears that ollama is now caching the eval state between prompts. The first prompt: ``` total duration: 6.250217789s load duration: 262.05µs prompt eval count: 16 token(s) prompt eval duration: 253.355ms prompt eval rate: 63.15 tokens/s eval count: 299 token(s) eval duration: 5.992618s eval rate: 49.89 tokens/s ``` The second prompt: ``` total duration: 10.447833337s load duration: 240.24µs prompt eval count: 15 token(s) prompt eval duration: 241.787ms prompt eval rate: 62.04 tokens/s eval count: 495 token(s) eval duration: 10.203837s eval rate: 48.51 tokens/s ``` And, for good measure, sending a third prompt to the same chat: ``` total duration: 535.952887ms load duration: 433.381µs prompt eval count: 18 token(s) prompt eval duration: 303.442ms prompt eval rate: 59.32 tokens/s eval count: 12 token(s) eval duration: 229.294ms eval rate: 52.33 tokens/s ``` In the second and third prompts, it still evaluated very few tokens for the prompt. In previous versions, it would evaluate the entire context window again with each message. So, one of the two problems being discussed here appears to be resolved. The other issue (prompt eval rate being low for Mixtral) is still relatively unsolved.

GiteaMirror commented

2026-04-22 02:58:22 -05:00

@djmaze commented on GitHub (Jan 13, 2024):

As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely TabbyAPI. It uses ExllamaV2 as its backend.

A 3.5 bpw quant of mixtral with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp.

It might even be interesting to add ExllamaV2 as a backend for Ollama?

@djmaze commented on GitHub (Jan 13, 2024): As the latest Ollama versions are crashing or not even starting for me, I was looking for an alternative solution, at least until the current problems are solved. For people experiencing similar problems, I can warmly recommend using ExllamaV2 or, more concretely [TabbyAPI](https://github.com/theroyallab/tabbyAPI/). It uses ExllamaV2 as its backend. [A 3.5 bpw quant of mixtral](https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2) with 4k context easily fits into a 24 GB card, even leaving a few GB for other stuff. With a 3090, I am seeing consistent eval rates of 70 tps, which is much more than I was able to achieve with Ollama / llama.cpp. It might even be interesting to add ExllamaV2 as a backend for Ollama?

GiteaMirror commented

2026-04-22 02:58:23 -05:00

@Bearsaerker commented on GitHub (Jan 22, 2024):

This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew

@Bearsaerker commented on GitHub (Jan 22, 2024): This problem does not only exist with the 8x7b Mixtral version. All MoEs I tested had the initial big delay, while other models where instant. I used the fusion 2x7b q4km and the solar q5km. The Solar output was instant, while the fusion 2x7b gradually increased its delay, as the context grew

GiteaMirror commented

2026-04-22 02:58:23 -05:00

@Bearsaerker commented on GitHub (Jan 23, 2024):

With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release

@Bearsaerker commented on GitHub (Jan 23, 2024): With the newest pre release of ollama 0.1.21 it seems fixed. I'm sure it had something to do with llama.cpp which was updated in this release

GiteaMirror commented

2026-04-22 02:58:24 -05:00

@pdevine commented on GitHub (Mar 12, 2024):

This should be fixed now. With a 4090, I see:

total duration:       2.664210661s
load duration:        438.566µs
prompt eval duration: 54.54ms
prompt eval rate:     0.00 tokens/s
eval count:           69 token(s)
eval duration:        2.608517s
eval rate:            26.45 tokens/s

Then:

total duration:       17.580922486s
load duration:        671.919µs
prompt eval count:    13 token(s)
prompt eval duration: 274.06ms
prompt eval rate:     47.43 tokens/s
eval count:           440 token(s)
eval duration:        17.303543s
eval rate:            25.43 tokens/s

And for the 3rd prompt:

total duration:       23.699967097s
load duration:        826.491µs
prompt eval count:    15 token(s)
prompt eval duration: 318.672ms
prompt eval rate:     47.07 tokens/s
eval count:           564 token(s)
eval duration:        23.372658s
eval rate:            24.13 tokens/s

Going to go ahead and close the issue.

@pdevine commented on GitHub (Mar 12, 2024): This should be fixed now. With a 4090, I see: ``` total duration: 2.664210661s load duration: 438.566µs prompt eval duration: 54.54ms prompt eval rate: 0.00 tokens/s eval count: 69 token(s) eval duration: 2.608517s eval rate: 26.45 tokens/s ``` Then: ``` total duration: 17.580922486s load duration: 671.919µs prompt eval count: 13 token(s) prompt eval duration: 274.06ms prompt eval rate: 47.43 tokens/s eval count: 440 token(s) eval duration: 17.303543s eval rate: 25.43 tokens/s ``` And for the 3rd prompt: ``` total duration: 23.699967097s load duration: 826.491µs prompt eval count: 15 token(s) prompt eval duration: 318.672ms prompt eval rate: 47.07 tokens/s eval count: 564 token(s) eval duration: 23.372658s eval rate: 24.13 tokens/s ``` Going to go ahead and close the issue.

GiteaMirror commented

2026-04-22 02:58:25 -05:00

@grafke commented on GitHub (Mar 13, 2024):

I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main?

Here are my results, the first prompt:

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =    7733.01 ms /  2163 tokens (    3.58 ms per token,   279.71 tokens per second)","n_prompt_tokens_processed":2163,"n_tokens_second":279.7100791451502,"slot_id":0,"t_prompt_processing":7733.007,"t_token":3.575130374479889,"task_id":312,"tid":"139930931721792","timestamp":1710342290}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2852.03 ms /   161 runs   (   17.71 ms per token,    56.45 tokens per second)","n_decoded":161,"n_tokens_second":56.45101909867707,"slot_id":0,"t_token":17.71447204968944,"t_token_generation":2852.03,"task_id":312,"tid":"139930931721792","timestamp":1710342290}

the second prompt

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =   13081.74 ms /  3813 tokens (    3.43 ms per token,   291.47 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.47492042918134,"slot_id":0,"t_prompt_processing":13081.743,"t_token":3.430826907946499,"task_id":166,"tid":"139930931721792","timestamp":1710342267}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2617.80 ms /   143 runs   (   18.31 ms per token,    54.63 tokens per second)","n_decoded":143,"n_tokens_second":54.62606358473802,"slot_id":0,"t_token":18.30627972027972,"t_token_generation":2617.798,"task_id":166,"tid":"139930931721792","timestamp":1710342267}

and the third prompt:

{"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =   13079.91 ms /  3813 tokens (    3.43 ms per token,   291.52 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.51570044846625,"slot_id":0,"t_prompt_processing":13079.913,"t_token":3.430346970889064,"task_id":476,"tid":"139930931721792","timestamp":1710342414}
{"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    2602.60 ms /   143 runs   (   18.20 ms per token,    54.95 tokens per second)","n_decoded":143,"n_tokens_second":54.94511827993346,"slot_id":0,"t_token":18.199979020979022,"t_token_generation":2602.597,"task_id":476,"tid":"139930931721792","timestamp":1710342414}

Generation is fast but the prompt eval time is suuuuper slow.
I'm using the option: "num_ctx": 32768
And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.

@grafke commented on GitHub (Mar 13, 2024): I just pulled the latest docker image ollama/ollama:0.1.29 and I'm still experiencing very long prompt eval times (with large prompts). @pdevine do you know if the fix for it is in the image? Or shall I build the ollama from the latest main? Here are my results, the first prompt: ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 7733.01 ms / 2163 tokens ( 3.58 ms per token, 279.71 tokens per second)","n_prompt_tokens_processed":2163,"n_tokens_second":279.7100791451502,"slot_id":0,"t_prompt_processing":7733.007,"t_token":3.575130374479889,"task_id":312,"tid":"139930931721792","timestamp":1710342290} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2852.03 ms / 161 runs ( 17.71 ms per token, 56.45 tokens per second)","n_decoded":161,"n_tokens_second":56.45101909867707,"slot_id":0,"t_token":17.71447204968944,"t_token_generation":2852.03,"task_id":312,"tid":"139930931721792","timestamp":1710342290} ``` the second prompt ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 13081.74 ms / 3813 tokens ( 3.43 ms per token, 291.47 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.47492042918134,"slot_id":0,"t_prompt_processing":13081.743,"t_token":3.430826907946499,"task_id":166,"tid":"139930931721792","timestamp":1710342267} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2617.80 ms / 143 runs ( 18.31 ms per token, 54.63 tokens per second)","n_decoded":143,"n_tokens_second":54.62606358473802,"slot_id":0,"t_token":18.30627972027972,"t_token_generation":2617.798,"task_id":166,"tid":"139930931721792","timestamp":1710342267} ``` and the third prompt: ``` {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 13079.91 ms / 3813 tokens ( 3.43 ms per token, 291.52 tokens per second)","n_prompt_tokens_processed":3813,"n_tokens_second":291.51570044846625,"slot_id":0,"t_prompt_processing":13079.913,"t_token":3.430346970889064,"task_id":476,"tid":"139930931721792","timestamp":1710342414} {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 2602.60 ms / 143 runs ( 18.20 ms per token, 54.95 tokens per second)","n_decoded":143,"n_tokens_second":54.94511827993346,"slot_id":0,"t_token":18.199979020979022,"t_token_generation":2602.597,"task_id":476,"tid":"139930931721792","timestamp":1710342414} ``` Generation is fast but the prompt eval time is suuuuper slow. I'm using the option: "num_ctx": 32768 And running this model: https://ollama.com/grf/mixtral_wa_q4_cp (it's a quantized mixtral with an adapter) on an A100-40GB.

GiteaMirror commented

2026-04-22 02:58:26 -05:00

@pdevine commented on GitHub (Mar 13, 2024):

@grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power.

My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got:

prompt eval count:    829 token(s)
prompt eval duration: 4.738038s
prompt eval rate:     174.97 tokens/s

I think that's roughly tracking the speeds you're seeing?

@pdevine commented on GitHub (Mar 13, 2024): @grafke back-of-the-napkin math for mixtral at a 4 bit quantization is it needs about 30ish GB, but I'm not 100% sure how the context length impacts the total amount of memory required (i.e. if you're swapping) or if it's just the long context length requires that much more computation power. My understanding is that the memory/computational resources scales quadratically as you increase the context size, so you're going to need quite a bit more memory than the 40GB. FWIW I pulled your model on my M3 128GB machine and got: ``` prompt eval count: 829 token(s) prompt eval duration: 4.738038s prompt eval rate: 174.97 tokens/s ``` I think that's roughly tracking the speeds you're seeing?

GiteaMirror commented

2026-04-22 02:58:26 -05:00

@grafke commented on GitHub (Mar 14, 2024):

@pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results.

I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama).
So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.

@grafke commented on GitHub (Mar 14, 2024): @pdevine Thanks for taking a look into it! I will try to get a A100 80GB to see if this could be resolved by increasing the memory. Indeed you're right, I'm seeing similar results. I tested the (slow) transformers library I got the TTFT (time to first token) ~1.5 seconds (prompts are between 2k and 3k tokens long), whilst on the ollama ttft ~8-12 seconds with the same prompts (however, the total response time is 2-3x faster on the ollama). So that got me thinking if there is something that I'm missing that makes the "slow" library to have a shorter ttft.

GiteaMirror commented

2026-04-22 02:58:27 -05:00

@pdevine commented on GitHub (Mar 14, 2024):

Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.

@pdevine commented on GitHub (Mar 14, 2024): Interesting. Are you including model loading time in the TTFT? On my system that's about 2 seconds, although I'm not including model load time.

GiteaMirror commented

2026-04-22 02:58:27 -05:00

@pdevine commented on GitHub (Mar 14, 2024):

@grafke just thinking about that some more, you can make a call like:

curl http://localhost:11434/api/generate -d '{"model": "grf/mixtral_wa_q4_cp", "prompt": ""}'

Which will preload the model in memory so that when you make the next call it should be faster.

@pdevine commented on GitHub (Mar 14, 2024): @grafke just thinking about that some more, you can make a call like: ``` curl http://localhost:11434/api/generate -d '{"model": "grf/mixtral_wa_q4_cp", "prompt": ""}' ``` Which will preload the model in memory so that when you make the next call it should be faster.

GiteaMirror commented

2026-04-22 02:58:28 -05:00

@grafke commented on GitHub (Mar 18, 2024):

@pdevine The model is preloaded in memory after I make the first request indeed (and it is slow). The requests I posted above are subsequent requests, when the model is in memory already. I guess this is due very long prompts (up to 2k-3k tokens). Shorter prompts indeed have a much shorter ttft.

@grafke commented on GitHub (Mar 18, 2024): @pdevine The model is preloaded in memory after I make the first request indeed (and it is slow). The requests I posted above are subsequent requests, when the model is in memory already. I guess this is due very long prompts (up to 2k-3k tokens). Shorter prompts indeed have a much shorter ttft.

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#26611