[GH-ISSUE #5303] Ollama keeps to randomly re-evaluate whole prompt, making chats impossible #49837

Open
opened 2026-04-28 13:07:37 -05:00 by GiteaMirror · 20 comments
Owner

Originally created by @drazdra on GitHub (Jun 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5303

Originally assigned to: @jessegross on GitHub.

What is the issue?

Ollama randomly starts whole prompt re-evaluation ignoring the cache. Normally next message on my system starts in 1-2 seconds, but when it happens i've to wait 7-20 minutes. Another proof is that in last message stats it shows the whole size of the prompt in prompt eval, instead of just last added message.

Obviously, it's just unusable as yesterday i spent many hours just waiting for replies instead of getting them. I don't know what it depends upon, at first it didn't trigger much but later it was nearly every second/third time when context grew longer. Or i didn't notice it with the small context.

In logs i don't see anything related to this issue, it just next reply starts getting re-evaluated and that's all:
200 | 48.669029927s | 127.0.0.1 | POST "/api/chat"
200 | 20.60948557s | 127.0.0.1 | POST "/api/chat"
200 | 30.495043951s | 127.0.0.1 | POST "/api/chat"
200 | 1m4s | 127.0.0.1 | POST "/api/chat"
200 | 32.433507128s | 127.0.0.1 | POST "/api/chat"
200 | 37.937415675s | 127.0.0.1 | POST "/api/chat"
200 | 8m7s | 127.0.0.1 | POST "/api/chat"
200 | 17.687657448s | 127.0.0.1 | POST "/api/chat"
200 | 17.344552043s | 127.0.0.1 | POST "/api/chat"
200 | 24.688732997s | 127.0.0.1 | POST "/api/chat"
200 | 34.470677196s | 127.0.0.1 | POST "/api/chat"
200 | 7m53s | 127.0.0.1 | POST "/api/chat"

these is the same chat with the same context but as you see some requests are way slower and these are prompt re-evals.

In my opinion it's related to concurrency changes to kv-cache processing as similar problems started back then.

It's just unusable on my system right now.

OS

Linux

GPU

Other

CPU

AMD

Ollama version

0.1.46

Originally created by @drazdra on GitHub (Jun 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5303 Originally assigned to: @jessegross on GitHub. ### What is the issue? Ollama randomly starts whole prompt re-evaluation ignoring the cache. Normally next message on my system starts in 1-2 seconds, but when it happens i've to wait 7-20 minutes. Another proof is that in last message stats it shows the whole size of the prompt in prompt eval, instead of just last added message. Obviously, it's just unusable as yesterday i spent many hours just waiting for replies instead of getting them. I don't know what it depends upon, at first it didn't trigger much but later it was nearly every second/third time when context grew longer. Or i didn't notice it with the small context. In logs i don't see anything related to this issue, it just next reply starts getting re-evaluated and that's all: 200 | 48.669029927s | 127.0.0.1 | POST "/api/chat" 200 | 20.60948557s | 127.0.0.1 | POST "/api/chat" 200 | 30.495043951s | 127.0.0.1 | POST "/api/chat" 200 | 1m4s | 127.0.0.1 | POST "/api/chat" 200 | 32.433507128s | 127.0.0.1 | POST "/api/chat" 200 | 37.937415675s | 127.0.0.1 | POST "/api/chat" 200 | 8m7s | 127.0.0.1 | POST "/api/chat" 200 | 17.687657448s | 127.0.0.1 | POST "/api/chat" 200 | 17.344552043s | 127.0.0.1 | POST "/api/chat" 200 | 24.688732997s | 127.0.0.1 | POST "/api/chat" 200 | 34.470677196s | 127.0.0.1 | POST "/api/chat" 200 | 7m53s | 127.0.0.1 | POST "/api/chat" these is the same chat with the same context but as you see some requests are way slower and these are prompt re-evals. In my opinion it's related to concurrency changes to kv-cache processing as similar problems started back then. It's just unusable on my system right now. ### OS Linux ### GPU Other ### CPU AMD ### Ollama version 0.1.46
GiteaMirror added the performancebug labels 2026-04-28 13:07:38 -05:00
Author
Owner

@mann1x commented on GitHub (Jun 26, 2024):

@drazdra I'd check if you get something meaningful from the logs with the OLLAMA_DEBUG=1 env var

<!-- gh-comment-id:2192240040 --> @mann1x commented on GitHub (Jun 26, 2024): @drazdra I'd check if you get something meaningful from the logs with the `OLLAMA_DEBUG=1` env var
Author
Owner

@drazdra commented on GitHub (Jun 27, 2024):

@drazdra I'd check if you get something meaningful from the logs with the OLLAMA_DEBUG=1 env var

i did but i didn't see anything interesting there. kv_cache just resetted.

[update_slots] slot released | n_cache_tokens=2201 n_ctx=4000 n_past=2200 n_system_tokens=0 slot_id=0 task_id=2584 tid="135739399088000" timestamp=1719442050 truncated=true
[log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=34430 status=200 tid="135739373286976" timestamp=1719442050
200 | 2m7s | 127.0.0.1 | POST "/api/chat"
source=sched.go:348 msg="context for request finished"
source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 duration=15m0s
source=sched.go:299 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 refCount=0
source=sched.go:507 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54
[process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2692 tid="135739399088000" timestamp=1719442124
[process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2693 tid="135739399088000" timestamp=1719442124
[log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44830 status=200 tid="135739364894272" timestamp=1719442124
source=routes.go:1367 msg="chat handler" prompt="" images=0
source=server.go:695 msg="setting token limit to 10x num_ctx" num_ctx=4000 num_predict=40000
[process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2694 tid="135739399088000" timestamp=1719442124
[launch_slot_with_data] slot is processing task | slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442124
[update_slots] slot progression | ga_i=0 n_past=24 n_past_se=0 n_prompt_tokens_processed=3848 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125
[update_slots] kv cache rm [p0, end) | p0=24 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125

<!-- gh-comment-id:2192869073 --> @drazdra commented on GitHub (Jun 27, 2024): > @drazdra I'd check if you get something meaningful from the logs with the `OLLAMA_DEBUG=1` env var i did but i didn't see anything interesting there. kv_cache just resetted. [update_slots] slot released | n_cache_tokens=2201 n_ctx=4000 n_past=2200 n_system_tokens=0 slot_id=0 task_id=2584 tid="135739399088000" timestamp=1719442050 truncated=true [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=34430 status=200 tid="135739373286976" timestamp=1719442050 200 | 2m7s | 127.0.0.1 | POST "/api/chat" source=sched.go:348 msg="context for request finished" source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 duration=15m0s source=sched.go:299 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 refCount=0 source=sched.go:507 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2692 tid="135739399088000" timestamp=1719442124 [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2693 tid="135739399088000" timestamp=1719442124 [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44830 status=200 tid="135739364894272" timestamp=1719442124 source=routes.go:1367 msg="chat handler" prompt="" images=0 source=server.go:695 msg="setting token limit to 10x num_ctx" num_ctx=4000 num_predict=40000 [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2694 tid="135739399088000" timestamp=1719442124 [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442124 [update_slots] slot progression | ga_i=0 n_past=24 n_past_se=0 n_prompt_tokens_processed=3848 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125 [update_slots] kv cache rm [p0, end) | p0=24 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125
Author
Owner

@rasodu commented on GitHub (Jun 27, 2024):

I'm also encountering the same problem. When I input a prompt, it takes over a minute to receive a response, especially when there's a large conversation history involved. Notably, my system has ample resources available, with 6GB of spare GPU memory and 64GB of computer memory. Could you clarify whether caching relies on GPU or CPU memory?

Edit:
I've noticed an interesting trend: when my entire model exceeds the capacity of my GPU's memory, the processing speed of my prompt tokens actually increases compared to instances where the full model can fit within the GPU's VRAM.

<!-- gh-comment-id:2194964480 --> @rasodu commented on GitHub (Jun 27, 2024): I'm also encountering the same problem. When I input a prompt, it takes over a minute to receive a response, especially when there's a large conversation history involved. Notably, my system has ample resources available, with 6GB of spare GPU memory and 64GB of computer memory. Could you clarify whether caching relies on GPU or CPU memory? Edit: I've noticed an interesting trend: when my entire model exceeds the capacity of my GPU's memory, the processing speed of my prompt tokens actually increases compared to instances where the full model can fit within the GPU's VRAM.
Author
Owner

@rasodu commented on GitHub (Jul 2, 2024):

After some investigation, I discovered the root cause of the problem: I had set OLLAMA_NUM_PARALLEL=4. By changing it to OLLAMA_NUM_PARALLEL=1, I was able to resolve the issue and enable correct caching for single-threaded conversations. However, I've noticed that when I start a new conversation and then attempt to revisit an old one, the entire conversation history is re-evaluated from scratch.

Edit: I also observe high prompt eval scores for some of my responses, confirming the random behavior you described.

<!-- gh-comment-id:2204512242 --> @rasodu commented on GitHub (Jul 2, 2024): After some investigation, I discovered the root cause of the problem: I had set OLLAMA_NUM_PARALLEL=4. By changing it to OLLAMA_NUM_PARALLEL=1, I was able to resolve the issue and enable correct caching for single-threaded conversations. However, I've noticed that when I start a new conversation and then attempt to revisit an old one, the entire conversation history is re-evaluated from scratch. Edit: I also observe high prompt eval scores for some of my responses, confirming the random behavior you described.
Author
Owner

@rasodu commented on GitHub (Jul 9, 2024):

With Ollama 0.2.1 I see that it's trying to match prompt with slots. But in debugging mode it always ends up selecting slot 0 and re-evaluate with prompt is changed.

First Query

DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",3031]

Second query

DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",43]
<!-- gh-comment-id:2217910603 --> @rasodu commented on GitHub (Jul 9, 2024): With Ollama 0.2.1 I see that it's trying to match prompt with slots. But in debugging mode it always ends up selecting slot 0 and re-evaluate with prompt is changed. **First Query** ``` DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",3031] ``` **Second query** ``` DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",43] ```
Author
Owner

@rasodu commented on GitHub (Jul 9, 2024):

@jmorganca , can you please explain how is caching suppose to work with OLLAMA_NUM_PARALLEL? if I set OLLAMA_NUM_PARALLEL to 4, will LLaMA recompute the response every time it encounters an old prompt, or will it retrieve the cached result instead?

<!-- gh-comment-id:2217926614 --> @rasodu commented on GitHub (Jul 9, 2024): @jmorganca , can you please explain how is caching suppose to work with OLLAMA_NUM_PARALLEL? if I set OLLAMA_NUM_PARALLEL to 4, will LLaMA recompute the response every time it encounters an old prompt, or will it retrieve the cached result instead?
Author
Owner

@rasodu commented on GitHub (Jul 14, 2024):

Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot.

I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

<!-- gh-comment-id:2227525584 --> @rasodu commented on GitHub (Jul 14, 2024): Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot. I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.
Author
Owner

@drazdra commented on GitHub (Jul 15, 2024):

Copy from discord:

i wasted more time on some testing and it seems to fully correlate with the size of the last reply, take a look. all of these are generations for the same context, that is, i just request a different message for the same chat log

when the last generated message is large - next one requires prompt-reevaluation. if the last message is small, next message is created asap. in other words when the newly generated msg exceeds num_ctx, it loses kv-cache. that's how it looks like from the user side.

and yes, for that log num_ctx=2000

i wasted even more time on tests and yes, it correlates with message size. all these replies are for the same context.

last generated message should not shift the cache, even if it exceeds num_ctx.
4

<!-- gh-comment-id:2227529346 --> @drazdra commented on GitHub (Jul 15, 2024): Copy from discord: i wasted more time on some testing and it seems to fully correlate with the size of the last reply, take a look. all of these are generations for the same context, that is, i just request a different message for the same chat log when the last generated message is large - next one requires prompt-reevaluation. if the last message is small, next message is created asap. in other words when the newly generated msg exceeds num_ctx, it loses kv-cache. that's how it looks like from the user side. and yes, for that log num_ctx=2000 i wasted even more time on tests and yes, it correlates with message size. all these replies are for the same context. last generated message should not shift the cache, even if it exceeds num_ctx. ![4](https://github.com/user-attachments/assets/b5fd2209-d977-40d1-aaab-0a4d1f8511cd)
Author
Owner

@drazdra commented on GitHub (Jul 15, 2024):

a side thing, i suggest to add to the log separate columns for prompt-eval time/inference time and for the prompt size/inference size.

<!-- gh-comment-id:2227530134 --> @drazdra commented on GitHub (Jul 15, 2024): a side thing, i suggest to add to the log separate columns for prompt-eval time/inference time and for the prompt size/inference size.
Author
Owner

@rasodu commented on GitHub (Jul 16, 2024):

I have created pull request to fix this issue: #5716

<!-- gh-comment-id:2229951924 --> @rasodu commented on GitHub (Jul 16, 2024): I have created pull request to fix this issue: #5716
Author
Owner

@rasodu commented on GitHub (Jul 20, 2024):

@drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b.

<!-- gh-comment-id:2241163482 --> @rasodu commented on GitHub (Jul 20, 2024): @drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b.
Author
Owner

@drazdra commented on GitHub (Jul 20, 2024):

@drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b.

  1. my issue is totally unrelated.
  2. what you did in the code is meaningless. it was just finding the slot with the largest matching part. you, for some unexplainable reason, restricted it also to be larger than 60% of the prompt. this way partial matches with lower percentage simply won't find the best slot/cache.
  3. ollama doesn't re-evaluate whole prompt upon changes in the last parts of the context, only the last part is re-evaluated.
<!-- gh-comment-id:2241171164 --> @drazdra commented on GitHub (Jul 20, 2024): > @drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b. 1. my issue is totally unrelated. 2. what you did in the code is meaningless. it was just finding the slot with the largest matching part. you, for some unexplainable reason, restricted it also to be larger than 60% of the prompt. this way partial matches with lower percentage simply won't find the best slot/cache. 3. ollama doesn't re-evaluate whole prompt upon changes in the last parts of the context, only the last part is re-evaluated.
Author
Owner

@rasodu commented on GitHub (Jul 20, 2024):

@drazdra

  1. Are you setting OLLAMA_NUM_PARALLEL to anything other than 1?
  2. And are you absolutely certain that there is no other query that is being sent to Ollama between your multiple API calls ? (For example, I use OpenWebUI - And it sends a query to figure out the name it should set for the chat after I send first prompt)

I understand the issue you have may not be fixed by my PR. But just want to see if what you are experiencing is caused because ollama is currently only using the first slot.

<!-- gh-comment-id:2241335906 --> @rasodu commented on GitHub (Jul 20, 2024): @drazdra 1. Are you setting OLLAMA_NUM_PARALLEL to anything other than 1? 2. And are you absolutely certain that there is no other query that is being sent to Ollama between your multiple API calls ? (For example, I use OpenWebUI - And it sends a query to figure out the name it should set for the chat after I send first prompt) I understand the issue you have may not be fixed by my PR. But just want to see if what you are experiencing is caused because ollama is currently only using the first slot.
Author
Owner

@rasodu commented on GitHub (Jul 20, 2024):

Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot.

I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much.

<!-- gh-comment-id:2241337615 --> @rasodu commented on GitHub (Jul 20, 2024): > Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot. > > I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes. Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much.
Author
Owner

@drazdra commented on GitHub (Jul 21, 2024):

Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot.
I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much.

your explanation doesn't make any sense as well.

  1. context should match the largest partial string in the cache, to use that part of the cache. this way edited messages do not trigger full prompt re-evaluation. just the way it's done in ollama.
  2. your code doesn't do what you claim, it doesn't match the full string, it requires >60% of the match to use cache, which is plain meaningless, bad and makes things worse as it drops the useful cache, and doesn't follow your own explanations of matching "the whole string", which in turn would be bad too, due to 1.
  3. The way it is now it may overwrite existing cache with parallel requests but it's a known thing and i spoke about it months ago on discord. solution for that should be different, which is named sessions: storing old cache under certain identifier with timeout timer and then reusing it upon the sent key in api.
  4. all of this is totally unrelated to my issue, so it should not be discussed in this request.
<!-- gh-comment-id:2241462096 --> @drazdra commented on GitHub (Jul 21, 2024): > > Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot. > > I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes. > > Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much. your explanation doesn't make any sense as well. 1. context should match the largest partial string in the cache, to use that part of the cache. this way edited messages do not trigger full prompt re-evaluation. just the way it's done in ollama. 2. your code doesn't do what you claim, it doesn't match the full string, it requires >60% of the match to use cache, which is plain meaningless, bad and makes things worse as it drops the useful cache, and doesn't follow your own explanations of matching "the whole string", which in turn would be bad too, due to 1. 3. The way it is now it may overwrite existing cache with parallel requests but it's a known thing and i spoke about it months ago on discord. solution for that should be different, which is named sessions: storing old cache under certain identifier with timeout timer and then reusing it upon the sent key in api. 4. all of this is totally unrelated to my issue, so it should not be discussed in this request.
Author
Owner

@jessegross commented on GitHub (Oct 16, 2024):

I think there are at least two separate issues here:

  • Yes, when a message history exceeds the context length, the history gets truncated and shifted. This breaks the prompt cache and causes re-evaluation. It may be possible to fix this but I haven't had a chance to think it through.
  • There are definitely issues with multiple conversations overwriting each others' caches. There is an improved implementation that you can try out if you build from source - need to follow these instructions. However, it is not enabled by default as it causes a slow down in single conversation processing. You can set the environment variable OLLAMA_MULTIUSER_CACHE=1 to try it but note that this is purely for testing at this point and is subject to change. Hopefully we can get the best of both worlds in the future.
<!-- gh-comment-id:2418135885 --> @jessegross commented on GitHub (Oct 16, 2024): I think there are at least two separate issues here: - Yes, when a message history exceeds the context length, the history gets truncated and shifted. This breaks the prompt cache and causes re-evaluation. It may be possible to fix this but I haven't had a chance to think it through. - There are definitely issues with multiple conversations overwriting each others' caches. There is an improved implementation that you can try out if you build from source - need to follow these [instructions](https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner). However, it is not enabled by default as it causes a slow down in single conversation processing. You can set the environment variable OLLAMA_MULTIUSER_CACHE=1 to try it but note that this is purely for testing at this point and is subject to change. Hopefully we can get the best of both worlds in the future.
Author
Owner

@chrisoutwright commented on GitHub (Nov 4, 2024):

I also observed this and have set in environment:
MaxLoadedModels = "1"
NumParallel = "1"

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW)
llm_load_print_meta: general.name     = Qwen2.5 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2024/11/04 - 21:54:31 | 200 |   15.2285805s |             ::1 | POST     "/api/chat"
[GIN] 2024/11/04 - 21:55:22 | 200 |   34.3895991s |             ::1 | POST     "/api/chat"
time=2024-11-04T21:55:22.936+01:00 level=WARN source=runner.go:122 msg="truncating input prompt" limit=2400 prompt=6125 numKeep=4
[GIN] 2024/11/04 - 21:55:42 | 200 |   28.6406763s |             ::1 | POST     "/api/chat"
[GIN] 2024/11/04 - 21:57:12 | 200 |         1m24s |             ::1 | POST     "/api/chat"
[GIN] 2024/11/04 - 21:58:42 | 200 |         1m19s |             ::1 | POST     "/api/chat"

after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal?

What is this numKeep=4 about?

Will this be like a sliding text windows and thus each time needs to reevaluate then (since each time the beggining starts within different window?) I would rather have the option to keep last fitting window fixed and discard when exceeded, if this is slow like this following each follow-up.

<!-- gh-comment-id:2455691542 --> @chrisoutwright commented on GitHub (Nov 4, 2024): I also observed this and have set in environment: MaxLoadedModels = "1" NumParallel = "1" ``` llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 72.71 B llm_load_print_meta: model size = 44.15 GiB (5.22 BPW) llm_load_print_meta: general.name = Qwen2.5 72B Instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab only - skipping tensors [GIN] 2024/11/04 - 21:54:31 | 200 | 15.2285805s | ::1 | POST "/api/chat" [GIN] 2024/11/04 - 21:55:22 | 200 | 34.3895991s | ::1 | POST "/api/chat" time=2024-11-04T21:55:22.936+01:00 level=WARN source=runner.go:122 msg="truncating input prompt" limit=2400 prompt=6125 numKeep=4 [GIN] 2024/11/04 - 21:55:42 | 200 | 28.6406763s | ::1 | POST "/api/chat" [GIN] 2024/11/04 - 21:57:12 | 200 | 1m24s | ::1 | POST "/api/chat" [GIN] 2024/11/04 - 21:58:42 | 200 | 1m19s | ::1 | POST "/api/chat" ``` after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal? What is this ` numKeep=4` about? Will this be like a sliding text windows and thus each time needs to reevaluate then (since each time the beggining starts within different window?) I would rather have the option to keep last fitting window fixed and discard when exceeded, if this is slow like this following each follow-up.
Author
Owner

@drazdra commented on GitHub (Nov 6, 2024):

after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal?

there are 2 stages: prompt evaluation and inference. first one converts prompt to vectors space and stores it into a cache. second one uses that as a starting point to generate completion, which is inference making the actual replies.

on a next prompt the same part (history) is taken from cache and only the new part is being converted to vector space and concatenated to the cache. that's why it's fast.

cache has a size limit, it's matching the context window size (ollama multiplies it per amount of concurrent workers), which is num_ctx. when it's met, the start of the prompt (history) is thrown away.

cache matching is implemented by matching it from the start, so when the start of a history changes (thrown away as not fitting the cache size anymore), cache can't be found and whole prompt is being re-converted into the vectors again, which is very slow.

then, most next prompts are going to be slow as they will make start of the history change every time to allow for new messages you and ai add.

to fix it there should be a change to the cache matching mechanism.

What is this numKeep=4 about?
it's the amount of tokens after your system prompt that are always kept, when the start of the history is thrown away.

<!-- gh-comment-id:2458496393 --> @drazdra commented on GitHub (Nov 6, 2024): > after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal? there are 2 stages: prompt evaluation and inference. first one converts prompt to vectors space and stores it into a cache. second one uses that as a starting point to generate completion, which is inference making the actual replies. on a next prompt the same part (history) is taken from cache and only the new part is being converted to vector space and concatenated to the cache. that's why it's fast. cache has a size limit, it's matching the context window size (ollama multiplies it per amount of concurrent workers), which is num_ctx. when it's met, the start of the prompt (history) is thrown away. cache matching is implemented by matching it from the start, so when the start of a history changes (thrown away as not fitting the cache size anymore), cache can't be found and whole prompt is being re-converted into the vectors again, which is very slow. then, most next prompts are going to be slow as they will make start of the history change every time to allow for new messages you and ai add. to fix it there should be a change to the cache matching mechanism. > What is this ` numKeep=4` about? it's the amount of tokens after your system prompt that are always kept, when the start of the history is thrown away.
Author
Owner

@kripper commented on GitHub (Oct 2, 2025):

Same problem here when using OpenHands with Ollama. Somehow the cache is not being used at all. The initial prompt is over 15.000 tokens. Second multi-turn message can be only 20 tokens, but takes longer. This makes Ollama unusable :-(

When trying openwebui, cache worked fine, but for one model I noticed that the second message was not using the cache, while the next ones were using it. So this is not exactly random, but there is definitely a caching issue present in latest Ollama version.

I guess other users are not reporting this issue. because it requires deeper understanding of how everything works, so they just assume "Ollama is slow, vLLM et. al. are faster".

Please fix.

<!-- gh-comment-id:3359890540 --> @kripper commented on GitHub (Oct 2, 2025): Same problem here when using OpenHands with Ollama. Somehow the cache is not being used at all. The initial prompt is over 15.000 tokens. Second multi-turn message can be only 20 tokens, but takes longer. This makes Ollama unusable :-( When trying openwebui, cache worked fine, but for one model I noticed that the second message was not using the cache, while the next ones were using it. So this is not exactly random, but there is definitely a caching issue present in latest Ollama version. I guess other users are not reporting this issue. because it requires deeper understanding of how everything works, so they just assume "Ollama is slow, vLLM et. al. are faster". Please fix.
Author
Owner

@kripper commented on GitHub (Oct 4, 2025):

I'm analyzing my show performance issue here: https://github.com/ollama/ollama/issues/12477

<!-- gh-comment-id:3368175869 --> @kripper commented on GitHub (Oct 4, 2025): I'm analyzing my show performance issue here: https://github.com/ollama/ollama/issues/12477
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49837