[GH-ISSUE #7762] What happened with the recent update? #51469

Closed
opened 2026-04-28 20:16:27 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @JTMarsh556 on GitHub (Nov 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7762

What is the issue?

I just updated this morning and applications that worked flawlessly no longer work. It is like RAG was decimated. The LLMs are just providing generic garbage answers like they always do without RAG.

What happened and how can we fix it?

Until then how can revert back?

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.4.2

Originally created by @JTMarsh556 on GitHub (Nov 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7762 ### What is the issue? I just updated this morning and applications that worked flawlessly no longer work. It is like RAG was decimated. The LLMs are just providing generic garbage answers like they always do without RAG. What happened and how can we fix it? Until then how can revert back? ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.2
GiteaMirror added the bug label 2026-04-28 20:16:27 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

Server logs would make debugging easier. You can rollback by running OllamaSetup.exe of the version you upgraded from, eg 0.3.14

<!-- gh-comment-id:2488822239 --> @rick-github commented on GitHub (Nov 20, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) would make debugging easier. You can rollback by running OllamaSetup.exe of the version you upgraded from, eg [0.3.14](https://github.com/ollama/ollama/releases/download/v0.3.14/OllamaSetup.exe)
Author
Owner

@JTMarsh556 commented on GitHub (Nov 20, 2024):

Thank you, that is very helpful. I can only find one .exe on my system and it is the version that started to experience the issues with.

<!-- gh-comment-id:2488862711 --> @JTMarsh556 commented on GitHub (Nov 20, 2024): Thank you, that is very helpful. I can only find one .exe on my system and it is the version that started to experience the issues with.
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

The link I gave is to the setup program for 0.3.14, click on it and it will download and run, overwriting the 0.4.2 version you have with a new 0.3.14 installation. Models will be preserved.

<!-- gh-comment-id:2488869264 --> @rick-github commented on GitHub (Nov 20, 2024): The link I gave is to the setup program for 0.3.14, click on it and it will download and run, overwriting the 0.4.2 version you have with a new 0.3.14 installation. Models will be preserved.
Author
Owner

@JTMarsh556 commented on GitHub (Nov 20, 2024):

Ok, that fixed it. I am going update again to the same release where I started to see the issue and see if possibly it was just something that went wrong during the update but if not I can revert back again. Thank you so much rick

<!-- gh-comment-id:2488887088 --> @JTMarsh556 commented on GitHub (Nov 20, 2024): Ok, that fixed it. I am going update again to the same release where I started to see the issue and see if possibly it was just something that went wrong during the update but if not I can revert back again. Thank you so much rick
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

The transition from 0.3 to 0.4 was due to a change in runner architecture. Some teething problems were expected, and they've mainly fallen in to two camps: performance and self-building. AFAIK, nobody has reported issues with RAG systems with the 0.4 series. If you find that it's still broken, we'd appreciate server logs so we can rectify the problem ASAP.

<!-- gh-comment-id:2488897719 --> @rick-github commented on GitHub (Nov 20, 2024): The transition from 0.3 to 0.4 was due to a change in runner architecture. Some teething problems were expected, and they've mainly fallen in to two camps: performance and self-building. AFAIK, nobody has reported issues with RAG systems with the 0.4 series. If you find that it's still broken, we'd appreciate server logs so we can rectify the problem ASAP.
Author
Owner

@jreves commented on GitHub (Nov 20, 2024):

One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows:

time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5

..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this:

time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408

..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input.

No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits.

I can open a separate issue if it's helpful.

<!-- gh-comment-id:2488922354 --> @jreves commented on GitHub (Nov 20, 2024): One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows: time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5 ..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this: time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408 ..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input. No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits. I can open a separate issue if it's helpful.
Author
Owner

@JTMarsh556 commented on GitHub (Nov 20, 2024):

so I took this and was able to test through all versions from 3.4 to 4.2. Everything is stable through 4.1. Unfortunately 4.2 has issues with context. At first I thought it was just a related to the RAG context however it also cannot follow instructions properly. Simple things like following output structure instructions are not being followed.

<!-- gh-comment-id:2488924248 --> @JTMarsh556 commented on GitHub (Nov 20, 2024): so I took this and was able to test through all versions from 3.4 to 4.2. Everything is stable through 4.1. Unfortunately 4.2 has issues with context. At first I thought it was just a related to the RAG context however it also cannot follow instructions properly. Simple things like following output structure instructions are not being followed.
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

OK, it sounds like context issues in 0.4.2 for both of you. Please add server logs, and if possible, enable extra debugging by setting OLLAMA_DEBUG=1 in the server environment.

<!-- gh-comment-id:2488931471 --> @rick-github commented on GitHub (Nov 20, 2024): OK, it sounds like context issues in 0.4.2 for both of you. Please add server logs, and if possible, enable extra debugging by setting `OLLAMA_DEBUG=1` in the server environment.
Author
Owner

@JTMarsh556 commented on GitHub (Nov 20, 2024):

One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows:

time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5

..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this:

time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408

..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input.

No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits.

I can open a separate issue if it's helpful.

Interesting I think it is related. I have some test questions that definitely exceed the context window.

<!-- gh-comment-id:2488932973 --> @JTMarsh556 commented on GitHub (Nov 20, 2024): > One change I've been struggling with this morning has to do with how the 0.4.2 handles inputs greater than the context window. In version 0.4.0, my server log shows: > > time=2024-11-20T09:25:19.706-06:00 level=WARN source=runner.go:126 msg="truncating input prompt" limit=20408 prompt=2458413 numKeep=5 > > ..with a num_ctx value specified as 20408. But with the 0.4.2 version, I see this: > > time=2024-11-20T09:19:16.788-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408 > > ..and the behavior is the big issue - under 0.4.2, the GPU / Memory just maxes out and hangs, and under 0.4.0, the query completes with the truncated input. > > No doubt I'm doing something wrong in formulating the query, but it's a change in behavior that might be manifesting different symptoms. I see something in the change log for 0.4.2 that mentions reliability improvements when hitting context limits. > > I can open a separate issue if it's helpful. Interesting I think it is related. I have some test questions that definitely exceed the context window.
Author
Owner

@jreves commented on GitHub (Nov 20, 2024):

Here are my server logs for 0.4.0 and 0.4.2
server.log
server-1.log

I'll see if I can get debug enabled, as well

** Give me a bit to get you debug output. This debug level dumps the entire input, and I need to trim that out ;-)

<!-- gh-comment-id:2488940613 --> @jreves commented on GitHub (Nov 20, 2024): Here are my server logs for 0.4.0 and 0.4.2 [server.log](https://github.com/user-attachments/files/17833139/server.log) [server-1.log](https://github.com/user-attachments/files/17833140/server-1.log) I'll see if I can get debug enabled, as well ** Give me a bit to get you debug output. This debug level dumps the entire input, and I need to trim that out ;-)
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

Thanks, I was able to repro with:

(echo '{"model":"llama3.2:3b-instruct-q4_K_M","prompt":' ; (echo write a story with these words ; cat /usr/share/dict/word{s,s,s,s,s,s,s,s}) | dd bs=1 count=500000 status=none | jq -sR . ; echo ',"options":{"seed":42,"temperature":0,"num_gpu":-1,"num_ctx":2048,"num_predict":256},"stream":true}')|  curl -s localhost:11434/api/generate -d @-

It looks the change in behaviour was introduced with 65973ceb64 (@jessegross ). Rather than straight up truncating a large prompt, it processes the lot which results in a lot of context shifts (196 for the repro case):

ollama  | time=2024-11-20T17:16:10.643Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021
ollama  | time=2024-11-20T17:16:10.867Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021
ollama  | time=2024-11-20T17:16:11.099Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021
ollama  | time=2024-11-20T17:16:11.323Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021

It eventually completes and returns, but for extremely large prompts like yours and @JTMarsh556 it just takes too long and I expect your clients are timing out.

<!-- gh-comment-id:2489165173 --> @rick-github commented on GitHub (Nov 20, 2024): Thanks, I was able to repro with: ```console (echo '{"model":"llama3.2:3b-instruct-q4_K_M","prompt":' ; (echo write a story with these words ; cat /usr/share/dict/word{s,s,s,s,s,s,s,s}) | dd bs=1 count=500000 status=none | jq -sR . ; echo ',"options":{"seed":42,"temperature":0,"num_gpu":-1,"num_ctx":2048,"num_predict":256},"stream":true}')| curl -s localhost:11434/api/generate -d @- ``` It looks the change in behaviour was introduced with https://github.com/ollama/ollama/commit/65973ceb6417c2e2796fa59bd3225bc7bd79b403 (@jessegross ). Rather than straight up truncating a large prompt, it processes the lot which results in a lot of context shifts (196 for the repro case): ``` ollama | time=2024-11-20T17:16:10.643Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ollama | time=2024-11-20T17:16:10.867Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ollama | time=2024-11-20T17:16:11.099Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ollama | time=2024-11-20T17:16:11.323Z level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=2048 input=2048 keep=5 discard=1021 ``` It eventually completes and returns, but for extremely large prompts like yours and @JTMarsh556 it just takes too long and I expect your clients are timing out.
Author
Owner

@jreves commented on GitHub (Nov 20, 2024):

What's the best mechanism to handle this? I can see

  • failing the request, and returning an error to the caller to trim the input (ugly)
  • gracefully truncating the input to the num_ctx value, and completing (that's the legacy behavior, but there's no warning)
  • adding a warning, truncating and completing
  • something more drastic, like killing the server...?

Here's my (trimmed) 0.4.0 debug log
server.log

<!-- gh-comment-id:2489191270 --> @jreves commented on GitHub (Nov 20, 2024): What's the best mechanism to handle this? I can see - failing the request, and returning an error to the caller to trim the input (ugly) - gracefully truncating the input to the num_ctx value, and completing (that's the legacy behavior, but there's no warning) - adding a warning, truncating and completing - something more drastic, like killing the server...? Here's my (trimmed) 0.4.0 debug log [server.log](https://github.com/user-attachments/files/17834672/server.log)
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

BTW, your logs show that OLLAMA_NUM_PARALLEL is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set OLLAMA_NUM_PARALLEL=1 and approximately quadruple your context window.

<!-- gh-comment-id:2489192027 --> @rick-github commented on GitHub (Nov 20, 2024): BTW, your logs show that `OLLAMA_NUM_PARALLEL` is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set `OLLAMA_NUM_PARALLEL=1` and approximately quadruple your context window.
Author
Owner

@jreves commented on GitHub (Nov 20, 2024):

I will experiment with that! I have a 16G GPU to work with.

<!-- gh-comment-id:2489196236 --> @jreves commented on GitHub (Nov 20, 2024): I will experiment with that! I have a 16G GPU to work with.
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

What's the best mechanism to handle this?

I think returning to the previous behaviour, but that's up to Jesse, as I don't what know the rationale for changing it was. Longer term it would be nice to have a warning, and there's a ticket open for that.

<!-- gh-comment-id:2489198273 --> @rick-github commented on GitHub (Nov 20, 2024): > What's the best mechanism to handle this? I think returning to the previous behaviour, but that's up to Jesse, as I don't what know the rationale for changing it was. Longer term it would be nice to have a warning, and there's a [ticket](https://github.com/ollama/ollama/issues/7043) open for that.
Author
Owner

@jreves commented on GitHub (Nov 20, 2024):

Rick, thanks for all your insights on this! JT, I'm sorry if I hijacked this thread, but I think Rick got us to a better understanding quickly. Let's see how Jesse wants to handle this.

With 0.4.2, I was just killing the server after hours and hours of maxed-out GPU, but I'm seeing this in the tail of the verbose logging now:

time=2024-11-20T11:56:14.287-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408
time=2024-11-20T11:56:14.543-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2458413 used=0 remaining=2458413
time=2024-11-20T11:58:00.882-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T11:58:54.768-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T11:59:55.007-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:00:52.154-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:01:50.381-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:02:50.476-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:03:40.530-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:04:34.076-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201
time=2024-11-20T12:05:23.436-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201

...which seems to confirm what you found attempting to reproduce.

<!-- gh-comment-id:2489246803 --> @jreves commented on GitHub (Nov 20, 2024): Rick, thanks for all your insights on this! JT, I'm sorry if I hijacked this thread, but I think Rick got us to a better understanding quickly. Let's see how Jesse wants to handle this. With 0.4.2, I was just killing the server after hours and hours of maxed-out GPU, but I'm seeing this in the tail of the verbose logging now: time=2024-11-20T11:56:14.287-06:00 level=WARN source=runner.go:122 msg="input exceeds context length" prompt=2458413 limit=20408 time=2024-11-20T11:56:14.543-06:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=2458413 used=0 remaining=2458413 time=2024-11-20T11:58:00.882-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T11:58:54.768-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T11:59:55.007-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:00:52.154-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:01:50.381-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:02:50.476-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:03:40.530-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:04:34.076-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 time=2024-11-20T12:05:23.436-06:00 level=DEBUG source=cache.go:217 msg="context limit hit - shifting" limit=20408 input=20408 keep=5 discard=10201 ...which seems to confirm what you found attempting to reproduce.
Author
Owner

@JTMarsh556 commented on GitHub (Nov 20, 2024):

jreves, I don't feel like you hijacked it. Your input helped a lot. I am hoping we can either revert back or create a parameter that allows us to use either method.

<!-- gh-comment-id:2489286794 --> @JTMarsh556 commented on GitHub (Nov 20, 2024): jreves, I don't feel like you hijacked it. Your input helped a lot. I am hoping we can either revert back or create a parameter that allows us to use either method.
Author
Owner

@JTMarsh556 commented on GitHub (Nov 20, 2024):

BTW, your logs show that OLLAMA_NUM_PARALLEL is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set OLLAMA_NUM_PARALLEL=1 and approximately quadruple your context window.

Rick, this helps a lot. Thank you for everything.

<!-- gh-comment-id:2489291070 --> @JTMarsh556 commented on GitHub (Nov 20, 2024): > BTW, your logs show that `OLLAMA_NUM_PARALLEL` is unset, and as a result, ollama is using a default of 4. Because of this, ollama is allocating 4 buffers of 20408, for a total context buffer size of 81632. From the looks of it, this just squeezes into available GPU RAM. If you are not doing concurrent requests, you can set `OLLAMA_NUM_PARALLEL=1` and approximately quadruple your context window. Rick, this helps a lot. Thank you for everything.
Author
Owner

@jessegross commented on GitHub (Nov 20, 2024):

Thanks for the debugging everyone. I just sent out a PR (#7767) that restores the previous behavior as well as fixes a number of other issues in this area. It fixes the issue with Rick's script but for others with more complex scenarios, if you are able to build from source and test it, that would be appreciated.

As for the original reason, here is the rationale from the commit message:

Previous versions of the runner would truncate inputs to the context
window before beginning processing. The main processing loop relied
on this behavior if the context needed to be shifted later (due to
token generation). If truncation did not occur then invariants
would be broken, causing crashes or infinite loops.

Later versions attempted to fix these bugs and make the logic less
subtle so that all inputs could be handled. Truncation was removed
to make things consistent.

However, truncation is much faster than processing and shifting, so
removing it caused performance problems when the input vastly exceeded
the context size. This restores the input truncation as a performance
optimization while keeping the more robust processing logic.
<!-- gh-comment-id:2489404898 --> @jessegross commented on GitHub (Nov 20, 2024): Thanks for the debugging everyone. I just sent out a PR (#7767) that restores the previous behavior as well as fixes a number of other issues in this area. It fixes the issue with Rick's script but for others with more complex scenarios, if you are able to build from source and test it, that would be appreciated. As for the original reason, here is the rationale from the commit message: ``` Previous versions of the runner would truncate inputs to the context window before beginning processing. The main processing loop relied on this behavior if the context needed to be shifted later (due to token generation). If truncation did not occur then invariants would be broken, causing crashes or infinite loops. Later versions attempted to fix these bugs and make the logic less subtle so that all inputs could be handled. Truncation was removed to make things consistent. However, truncation is much faster than processing and shifting, so removing it caused performance problems when the input vastly exceeded the context size. This restores the input truncation as a performance optimization while keeping the more robust processing logic. ```
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

tokens/s, 1.9MB prompt, 20KB context

version llama3.1:8b llama3.2:3b
0.3.10 57.41 ± 0.14 92.53 ± 0.49
0.3.11 57.43 ± 0.02 92.40 ± 0.32
0.3.12 57.39 ± 0.11 92.50 ± 0.51
0.3.13 57.39 ± 0.13 92.29 ± 0.23
0.3.14 57.77 ± 0.15 93.50 ± 0.44
0.4.0 27.24 ± 0.09 34.95 ± 0.24
0.4.1 27.29 ± 0.04 34.94 ± 0.21
c4b34f2 40.48 ± 1.23 60.25 ± 2.53
<!-- gh-comment-id:2489666437 --> @rick-github commented on GitHub (Nov 20, 2024): tokens/s, 1.9MB prompt, 20KB context | version | llama3.1:8b | llama3.2:3b | |---------|-----------|-----------| | 0.3.10 | 57.41 ± 0.14 | 92.53 ± 0.49 | | 0.3.11 | 57.43 ± 0.02 | 92.40 ± 0.32 | | 0.3.12 | 57.39 ± 0.11 | 92.50 ± 0.51 | | 0.3.13 | 57.39 ± 0.13 | 92.29 ± 0.23 | | 0.3.14 | 57.77 ± 0.15 | 93.50 ± 0.44 | | 0.4.0 | 27.24 ± 0.09 | 34.95 ± 0.24 | | 0.4.1 | 27.29 ± 0.04 | 34.94 ± 0.21 | | c4b34f2 | 40.48 ± 1.23 | 60.25 ± 2.53 |
Author
Owner

@jessegross commented on GitHub (Nov 21, 2024):

I think the reason why the new version still looks slower than the 0.3 series is that the older versions truncates even more aggressively:

  • 0.3, if the input exceeds the context, it will remove more than the difference in order to leave room for future token generation
  • 0.4 will truncate the minimum possible, filling up the context window with input tokens

You should be able to see this if you look at prompt_eval_count. If I change 0.4 to truncate similarly to 0.3 then the performance is the same (actually slightly better) as 0.3.

Keeping more of the input should improve quality in some case - particularly for embeddings, which will never generate additional tokens.

<!-- gh-comment-id:2489836910 --> @jessegross commented on GitHub (Nov 21, 2024): I think the reason why the new version still looks slower than the 0.3 series is that the older versions truncates even more aggressively: - 0.3, if the input exceeds the context, it will remove more than the difference in order to leave room for future token generation - 0.4 will truncate the minimum possible, filling up the context window with input tokens You should be able to see this if you look at `prompt_eval_count`. If I change 0.4 to truncate similarly to 0.3 then the performance is the same (actually slightly better) as 0.3. Keeping more of the input should improve quality in some case - particularly for embeddings, which will never generate additional tokens.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51469