[GH-ISSUE #7762] What happened with the recent update? #51469

New Issue

GiteaMirror · 2026-04-28T20:16:27-05:00

GiteaMirror commented

2026-04-28 20:16:27 -05:00

Originally created by @JTMarsh556 on GitHub (Nov 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7762

What is the issue?

I just updated this morning and applications that worked flawlessly no longer work. It is like RAG was decimated. The LLMs are just providing generic garbage answers like they always do without RAG.

What happened and how can we fix it?

Until then how can revert back?

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.4.2

Originally created by @JTMarsh556 on GitHub (Nov 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7762 ### What is the issue? I just updated this morning and applications that worked flawlessly no longer work. It is like RAG was decimated. The LLMs are just providing generic garbage answers like they always do without RAG. What happened and how can we fix it? Until then how can revert back? ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.2

GiteaMirror added the bug label 2026-04-28 20:16:27 -05:00

GiteaMirror closed this issue

2026-04-28 20:16:29 -05:00

GiteaMirror commented

2026-04-28 20:16:32 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

Server logs would make debugging easier. You can rollback by running OllamaSetup.exe of the version you upgraded from, eg 0.3.14

@rick-github commented on GitHub (Nov 20, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) would make debugging easier. You can rollback by running OllamaSetup.exe of the version you upgraded from, eg [0.3.14](https://github.com/ollama/ollama/releases/download/v0.3.14/OllamaSetup.exe)

GiteaMirror commented

2026-04-28 20:16:34 -05:00

@JTMarsh556 commented on GitHub (Nov 20, 2024):

Thank you, that is very helpful. I can only find one .exe on my system and it is the version that started to experience the issues with.

@JTMarsh556 commented on GitHub (Nov 20, 2024): Thank you, that is very helpful. I can only find one .exe on my system and it is the version that started to experience the issues with.

GiteaMirror commented

2026-04-28 20:16:36 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

The link I gave is to the setup program for 0.3.14, click on it and it will download and run, overwriting the 0.4.2 version you have with a new 0.3.14 installation. Models will be preserved.

@rick-github commented on GitHub (Nov 20, 2024): The link I gave is to the setup program for 0.3.14, click on it and it will download and run, overwriting the 0.4.2 version you have with a new 0.3.14 installation. Models will be preserved.

GiteaMirror commented

2026-04-28 20:16:37 -05:00

@JTMarsh556 commented on GitHub (Nov 20, 2024):

Ok, that fixed it. I am going update again to the same release where I started to see the issue and see if possibly it was just something that went wrong during the update but if not I can revert back again. Thank you so much rick

@JTMarsh556 commented on GitHub (Nov 20, 2024): Ok, that fixed it. I am going update again to the same release where I started to see the issue and see if possibly it was just something that went wrong during the update but if not I can revert back again. Thank you so much rick

GiteaMirror commented

2026-04-28 20:16:40 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

The transition from 0.3 to 0.4 was due to a change in runner architecture. Some teething problems were expected, and they've mainly fallen in to two camps: performance and self-building. AFAIK, nobody has reported issues with RAG systems with the 0.4 series. If you find that it's still broken, we'd appreciate server logs so we can rectify the problem ASAP.

@rick-github commented on GitHub (Nov 20, 2024): The transition from 0.3 to 0.4 was due to a change in runner architecture. Some teething problems were expected, and they've mainly fallen in to two camps: performance and self-building. AFAIK, nobody has reported issues with RAG systems with the 0.4 series. If you find that it's still broken, we'd appreciate server logs so we can rectify the problem ASAP.

GiteaMirror commented

2026-04-28 20:16:43 -05:00

@jreves commented on GitHub (Nov 20, 2024):

One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows:

time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5

..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this:

time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408

..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input.

No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits.

I can open a separate issue if it's helpful.

@jreves commented on GitHub (Nov 20, 2024): One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows: time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5 ..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this: time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408 ..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input. No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits. I can open a separate issue if it's helpful.

GiteaMirror commented

2026-04-28 20:16:46 -05:00

@JTMarsh556 commented on GitHub (Nov 20, 2024):

so I took this and was able to test through all versions from 3.4 to 4.2. Everything is stable through 4.1. Unfortunately 4.2 has issues with context. At first I thought it was just a related to the RAG context however it also cannot follow instructions properly. Simple things like following output structure instructions are not being followed.

@JTMarsh556 commented on GitHub (Nov 20, 2024): so I took this and was able to test through all versions from 3.4 to 4.2. Everything is stable through 4.1. Unfortunately 4.2 has issues with context. At first I thought it was just a related to the RAG context however it also cannot follow instructions properly. Simple things like following output structure instructions are not being followed.

GiteaMirror commented

2026-04-28 20:16:47 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

OK, it sounds like context issues in 0.4.2 for both of you. Please add server logs, and if possible, enable extra debugging by setting OLLAMA_DEBUG=1 in the server environment.

@rick-github commented on GitHub (Nov 20, 2024): OK, it sounds like context issues in 0.4.2 for both of you. Please add server logs, and if possible, enable extra debugging by setting `OLLAMA_DEBUG=1` in the server environment.

GiteaMirror commented

2026-04-28 20:16:51 -05:00

@JTMarsh556 commented on GitHub (Nov 20, 2024):

One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows:

time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5

..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this:

time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408

..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input.

No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits.

I can open a separate issue if it's helpful.

Interesting I think it is related. I have some test questions that definitely exceed the context window.

@JTMarsh556 commented on GitHub (Nov 20, 2024): > One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows: > > time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5 > > ..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this: > > time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408 > > ..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input. > > No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits. > > I can open a separate issue if it's helpful. Interesting I think it is related. I have some test questions that definitely exceed the context window.

GiteaMirror commented

2026-04-28 20:16:57 -05:00

@jreves commented on GitHub (Nov 20, 2024):

Here are my server logs for 0.4.0 and 0.4.2
server.log
server-1.log

I'll see if I can get debug enabled, as well

** Give me a bit to get you debug output. This debug level dumps the entire input, and I need to trim that out ;-)

@jreves commented on GitHub (Nov 20, 2024): Here are my server logs for 0.4.0 and 0.4.2 [server.log](https://github.com/user-attachments/files/17833139/server.log) [server-1.log](https://github.com/user-attachments/files/17833140/server-1.log) I'll see if I can get debug enabled, as well ** Give me a bit to get you debug output. This debug level dumps the entire input, and I need to trim that out ;-)

GiteaMirror commented

2026-04-28 20:16:58 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

Thanks, I was able to repro with:

(echo '{"model":"llama3.2:3b-instruct-q4_K_M","prompt":' ; (echo write a story with these words ; cat /usr/share/dict/word{s,s,s,s,s,s,s,s}) | dd bs=1 count=500000 status=none | jq -sR . ; echo ',"options":{"seed":42,"temperature":0,"num_gpu":-1,"num_ctx":2048,"num_predict":256},"stream":true}')|  curl -s localhost:11434/api/generate -d @-

It looks the change in behaviour was introduced with 65973ceb64 (@jessegross ). Rather than straight up truncating a large prompt, it processes the lot which results in a lot of context shifts (196 for the repro case):

ollama  | time=2024-11-20T17:16:10.643Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021
ollama  | time=2024-11-20T17:16:10.867Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021
ollama  | time=2024-11-20T17:16:11.099Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021
ollama  | time=2024-11-20T17:16:11.323Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021

It eventually completes and returns, but for extremely large prompts like yours and @JTMarsh556 it just takes too long and I expect your clients are timing out.

@rick-github commented on GitHub (Nov 20, 2024): Thanks, I was able to repro with: ```console (echo '{"model":"llama3.2:3b-instruct-q4_K_M","prompt":' ; (echo write a story with these words ; cat /usr/share/dict/word{s,s,s,s,s,s,s,s}) | dd bs=1 count=500000 status=none | jq -sR . ; echo ',"options":{"seed":42,"temperature":0,"num_gpu":-1,"num_ctx":2048,"num_predict":256},"stream":true}')| curl -s localhost:11434/api/generate -d @- ``` It looks the change in behaviour was introduced with https://github.com/ollama/ollama/commit/65973ceb6417c2e2796fa59bd3225bc7bd79b403 (@jessegross ). Rather than straight up truncating a large prompt, it processes the lot which results in a lot of context shifts (196 for the repro case): ``` ollama | time=2024-11-20T17:16:10.643Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ollama | time=2024-11-20T17:16:10.867Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ollama | time=2024-11-20T17:16:11.099Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ollama | time=2024-11-20T17:16:11.323Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ``` It eventually completes and returns, but for extremely large prompts like yours and @JTMarsh556 it just takes too long and I expect your clients are timing out.

GiteaMirror commented

2026-04-28 20:17:00 -05:00

@jreves commented on GitHub (Nov 20, 2024):

What's the best mechanism to handle this? I can see

failing the request, and returning an error to the caller to trim the input (ugly)
gracefully truncating the input to the num_ctx value, and completing (that's the legacy behavior, but there's no warning)
adding a warning, truncating and completing
something more drastic, like killing the server...?

Here's my (trimmed) 0.4.0 debug log
server.log

@jreves commented on GitHub (Nov 20, 2024): What's the best mechanism to handle this? I can see - failing the request, and returning an error to the caller to trim the input (ugly) - gracefully truncating the input to the num_ctx value, and completing (that's the legacy behavior, but there's no warning) - adding a warning, truncating and completing - something more drastic, like killing the server...? Here's my (trimmed) 0.4.0 debug log [server.log](https://github.com/user-attachments/files/17834672/server.log)

GiteaMirror commented

2026-04-28 20:17:01 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

BTW, your logs show that OLLAMA_NUM_PARALLEL is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set OLLAMA_NUM_PARALLEL=1 and approximately quadruple your context window.

@rick-github commented on GitHub (Nov 20, 2024): BTW, your logs show that `OLLAMA_NUM_PARALLEL` is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set `OLLAMA_NUM_PARALLEL=1` and approximately quadruple your context window.

GiteaMirror commented

2026-04-28 20:17:02 -05:00

@jreves commented on GitHub (Nov 20, 2024):

I will experiment with that! I have a 16G GPU to work with.

@jreves commented on GitHub (Nov 20, 2024): I will experiment with that! I have a 16G GPU to work with.

GiteaMirror commented

2026-04-28 20:17:05 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

What's the best mechanism to handle this?

I think returning to the previous behaviour, but that's up to Jesse, as I don't what know the rationale for changing it was. Longer term it would be nice to have a warning, and there's a ticket open for that.

@rick-github commented on GitHub (Nov 20, 2024): > What's the best mechanism to handle this? I think returning to the previous behaviour, but that's up to Jesse, as I don't what know the rationale for changing it was. Longer term it would be nice to have a warning, and there's a [ticket](https://github.com/ollama/ollama/issues/7043) open for that.

GiteaMirror commented

2026-04-28 20:17:08 -05:00

@jreves commented on GitHub (Nov 20, 2024):

Rick, thanks for all your insights on this! JT, I'm sorry if I hijacked this thread, but I think Rick got us to a better understanding quickly. Let's see how Jesse wants to handle this.

With 0.4.2, I was just killing the server after hours and hours of maxed-out GPU, but I'm seeing this in the tail of the verbose logging now:

time=2024-11-20T11:56:14.287-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408
time=2024-11-20T11:56:14.543-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2458413 used=0 remaining=2458413
time=2024-11-20T11:58:00.882-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T11:58:54.768-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T11:59:55.007-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:00:52.154-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:01:50.381-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:02:50.476-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:03:40.530-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:04:34.076-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:05:23.436-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201

...which seems to confirm what you found attempting to reproduce.

@jreves commented on GitHub (Nov 20, 2024): Rick, thanks for all your insights on this! JT, I'm sorry if I hijacked this thread, but I think Rick got us to a better understanding quickly. Let's see how Jesse wants to handle this. With 0.4.2, I was just killing the server after hours and hours of maxed-out GPU, but I'm seeing this in the tail of the verbose logging now: time=2024-11-20T11:56:14.287-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408 time=2024-11-20T11:56:14.543-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2458413 used=0 remaining=2458413 time=2024-11-20T11:58:00.882-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T11:58:54.768-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T11:59:55.007-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:00:52.154-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:01:50.381-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:02:50.476-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:03:40.530-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:04:34.076-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:05:23.436-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 ...which seems to confirm what you found attempting to reproduce.

GiteaMirror commented

2026-04-28 20:17:12 -05:00

@JTMarsh556 commented on GitHub (Nov 20, 2024):

jreves, I don't feel like you hijacked it. Your input helped a lot. I am hoping we can either revert back or create a parameter that allows us to use either method.

@JTMarsh556 commented on GitHub (Nov 20, 2024): jreves, I don't feel like you hijacked it. Your input helped a lot. I am hoping we can either revert back or create a parameter that allows us to use either method.

GiteaMirror commented

2026-04-28 20:17:15 -05:00

@JTMarsh556 commented on GitHub (Nov 20, 2024):

BTW, your logs show that OLLAMA_NUM_PARALLEL is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set OLLAMA_NUM_PARALLEL=1 and approximately quadruple your context window.

Rick, this helps a lot. Thank you for everything.

@JTMarsh556 commented on GitHub (Nov 20, 2024): > BTW, your logs show that `OLLAMA_NUM_PARALLEL` is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set `OLLAMA_NUM_PARALLEL=1` and approximately quadruple your context window. Rick, this helps a lot. Thank you for everything.

GiteaMirror commented

2026-04-28 20:17:15 -05:00

@jessegross commented on GitHub (Nov 20, 2024):

Thanks for the debugging everyone. I just sent out a PR (#7767) that restores the previous behavior as well as fixes a number of other issues in this area. It fixes the issue with Rick's script but for others with more complex scenarios, if you are able to build from source and test it, that would be appreciated.

As for the original reason, here is the rationale from the commit message:

Previous versions of the runner would truncate inputs to the context
window before beginning processing. The main processing loop relied
on this behavior if the context needed to be shifted later (due to
token generation). If truncation did not occur then invariants
would be broken, causing crashes or infinite loops.

Later versions attempted to fix these bugs and make the logic less
subtle so that all inputs could be handled. Truncation was removed
to make things consistent.

However, truncation is much faster than processing and shifting, so
removing it caused performance problems when the input vastly exceeded
the context size. This restores the input truncation as a performance
optimization while keeping the more robust processing logic.

@jessegross commented on GitHub (Nov 20, 2024): Thanks for the debugging everyone. I just sent out a PR (#7767) that restores the previous behavior as well as fixes a number of other issues in this area. It fixes the issue with Rick's script but for others with more complex scenarios, if you are able to build from source and test it, that would be appreciated. As for the original reason, here is the rationale from the commit message: ``` Previous versions of the runner would truncate inputs to the context window before beginning processing. The main processing loop relied on this behavior if the context needed to be shifted later (due to token generation). If truncation did not occur then invariants would be broken, causing crashes or infinite loops. Later versions attempted to fix these bugs and make the logic less subtle so that all inputs could be handled. Truncation was removed to make things consistent. However, truncation is much faster than processing and shifting, so removing it caused performance problems when the input vastly exceeded the context size. This restores the input truncation as a performance optimization while keeping the more robust processing logic. ```

GiteaMirror commented

2026-04-28 20:17:17 -05:00

@rick-github commented on GitHub (Nov 20, 2024):

tokens/s, 1.9MB prompt, 20KB context

version	llama3.1:8b	llama3.2:3b
0.3.10	57.41 ± 0.14	92.53 ± 0.49
0.3.11	57.43 ± 0.02	92.40 ± 0.32
0.3.12	57.39 ± 0.11	92.50 ± 0.51
0.3.13	57.39 ± 0.13	92.29 ± 0.23
0.3.14	57.77 ± 0.15	93.50 ± 0.44
0.4.0	27.24 ± 0.09	34.95 ± 0.24
0.4.1	27.29 ± 0.04	34.94 ± 0.21
`c4b34f2`	40.48 ± 1.23	60.25 ± 2.53

@rick-github commented on GitHub (Nov 20, 2024): tokens/s, 1.9MB prompt, 20KB context | version | llama3.1:8b | llama3.2:3b | |---------|-----------|-----------| | 0.3.10 | 57.41 ± 0.14 | 92.53 ± 0.49 | | 0.3.11 | 57.43 ± 0.02 | 92.40 ± 0.32 | | 0.3.12 | 57.39 ± 0.11 | 92.50 ± 0.51 | | 0.3.13 | 57.39 ± 0.13 | 92.29 ± 0.23 | | 0.3.14 | 57.77 ± 0.15 | 93.50 ± 0.44 | | 0.4.0 | 27.24 ± 0.09 | 34.95 ± 0.24 | | 0.4.1 | 27.29 ± 0.04 | 34.94 ± 0.21 | | c4b34f2 | 40.48 ± 1.23 | 60.25 ± 2.53 |

GiteaMirror commented

2026-04-28 20:17:18 -05:00

@jessegross commented on GitHub (Nov 21, 2024):

I think the reason why the new version still looks slower than the 0.3 series is that the older versions truncates even more aggressively:

0.3, if the input exceeds the context, it will remove more than the difference in order to leave room for future token generation
0.4 will truncate the minimum possible, filling up the context window with input tokens

You should be able to see this if you look at prompt_eval_count. If I change 0.4 to truncate similarly to 0.3 then the performance is the same (actually slightly better) as 0.3.

Keeping more of the input should improve quality in some case - particularly for embeddings, which will never generate additional tokens.

@jessegross commented on GitHub (Nov 21, 2024): I think the reason why the new version still looks slower than the 0.3 series is that the older versions truncates even more aggressively: - 0.3, if the input exceeds the context, it will remove more than the difference in order to leave room for future token generation - 0.4 will truncate the minimum possible, filling up the context window with input tokens You should be able to see this if you look at `prompt_eval_count`. If I change 0.4 to truncate similarly to 0.3 then the performance is the same (actually slightly better) as 0.3. Keeping more of the input should improve quality in some case - particularly for embeddings, which will never generate additional tokens.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#51469