[GH-ISSUE #5303] Ollama keeps to randomly re-evaluate whole prompt, making chats impossible #49837

New Issue

GiteaMirror · 2026-04-28T13:07:37-05:00

GiteaMirror commented

2026-04-28 13:07:37 -05:00

Originally created by @drazdra on GitHub (Jun 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5303

Originally assigned to: @jessegross on GitHub.

What is the issue?

Ollama randomly starts whole prompt re-evaluation ignoring the cache. Normally next message on my system starts in 1-2 seconds, but when it happens i've to wait 7-20 minutes. Another proof is that in last message stats it shows the whole size of the prompt in prompt eval, instead of just last added message.

Obviously, it's just unusable as yesterday i spent many hours just waiting for replies instead of getting them. I don't know what it depends upon, at first it didn't trigger much but later it was nearly every second/third time when context grew longer. Or i didn't notice it with the small context.

In logs i don't see anything related to this issue, it just next reply starts getting re-evaluated and that's all:
200 | 48.669029927s | 127.0.0.1 | POST "/api/chat"
200 | 20.60948557s | 127.0.0.1 | POST "/api/chat"
200 | 30.495043951s | 127.0.0.1 | POST "/api/chat"
200 | 1m4s | 127.0.0.1 | POST "/api/chat"
200 | 32.433507128s | 127.0.0.1 | POST "/api/chat"
200 | 37.937415675s | 127.0.0.1 | POST "/api/chat"
200 | 8m7s | 127.0.0.1 | POST "/api/chat"
200 | 17.687657448s | 127.0.0.1 | POST "/api/chat"
200 | 17.344552043s | 127.0.0.1 | POST "/api/chat"
200 | 24.688732997s | 127.0.0.1 | POST "/api/chat"
200 | 34.470677196s | 127.0.0.1 | POST "/api/chat"
200 | 7m53s | 127.0.0.1 | POST "/api/chat"

these is the same chat with the same context but as you see some requests are way slower and these are prompt re-evals.

In my opinion it's related to concurrency changes to kv-cache processing as similar problems started back then.

It's just unusable on my system right now.

OS

Linux

GPU

Other

CPU

AMD

Ollama version

0.1.46

Originally created by @drazdra on GitHub (Jun 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5303 Originally assigned to: @jessegross on GitHub. ### What is the issue? Ollama randomly starts whole prompt re-evaluation ignoring the cache. Normally next message on my system starts in 1-2 seconds, but when it happens i've to wait 7-20 minutes. Another proof is that in last message stats it shows the whole size of the prompt in prompt eval, instead of just last added message. Obviously, it's just unusable as yesterday i spent many hours just waiting for replies instead of getting them. I don't know what it depends upon, at first it didn't trigger much but later it was nearly every second/third time when context grew longer. Or i didn't notice it with the small context. In logs i don't see anything related to this issue, it just next reply starts getting re-evaluated and that's all: 200 | 48.669029927s | 127.0.0.1 | POST "/api/chat" 200 | 20.60948557s | 127.0.0.1 | POST "/api/chat" 200 | 30.495043951s | 127.0.0.1 | POST "/api/chat" 200 | 1m4s | 127.0.0.1 | POST "/api/chat" 200 | 32.433507128s | 127.0.0.1 | POST "/api/chat" 200 | 37.937415675s | 127.0.0.1 | POST "/api/chat" 200 | 8m7s | 127.0.0.1 | POST "/api/chat" 200 | 17.687657448s | 127.0.0.1 | POST "/api/chat" 200 | 17.344552043s | 127.0.0.1 | POST "/api/chat" 200 | 24.688732997s | 127.0.0.1 | POST "/api/chat" 200 | 34.470677196s | 127.0.0.1 | POST "/api/chat" 200 | 7m53s | 127.0.0.1 | POST "/api/chat" these is the same chat with the same context but as you see some requests are way slower and these are prompt re-evals. In my opinion it's related to concurrency changes to kv-cache processing as similar problems started back then. It's just unusable on my system right now. ### OS Linux ### GPU Other ### CPU AMD ### Ollama version 0.1.46

GiteaMirror added the performance bug labels 2026-04-28 13:07:38 -05:00

GiteaMirror commented

2026-04-28 13:07:40 -05:00

@mann1x commented on GitHub (Jun 26, 2024):

@drazdra I'd check if you get something meaningful from the logs with the OLLAMA_DEBUG=1 env var

@mann1x commented on GitHub (Jun 26, 2024): @drazdra I'd check if you get something meaningful from the logs with the `OLLAMA_DEBUG=1` env var

GiteaMirror commented

2026-04-28 13:07:40 -05:00

@drazdra commented on GitHub (Jun 27, 2024):

@drazdra I'd check if you get something meaningful from the logs with the OLLAMA_DEBUG=1 env var

i did but i didn't see anything interesting there. kv_cache just resetted.

[update_slots] slot released | n_cache_tokens=2201 n_ctx=4000 n_past=2200 n_system_tokens=0 slot_id=0 task_id=2584 tid="135739399088000" timestamp=1719442050 truncated=true
[log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=34430 status=200 tid="135739373286976" timestamp=1719442050
200 | 2m7s | 127.0.0.1 | POST "/api/chat"
source=sched.go:348 msg="context for request finished"
source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 duration=15m0s
source=sched.go:299 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 refCount=0
source=sched.go:507 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54
[process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2692 tid="135739399088000" timestamp=1719442124
[process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2693 tid="135739399088000" timestamp=1719442124
[log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44830 status=200 tid="135739364894272" timestamp=1719442124
source=routes.go:1367 msg="chat handler" prompt="" images=0
source=server.go:695 msg="setting token limit to 10x num_ctx" num_ctx=4000 num_predict=40000
[process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2694 tid="135739399088000" timestamp=1719442124
[launch_slot_with_data] slot is processing task | slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442124
[update_slots] slot progression | ga_i=0 n_past=24 n_past_se=0 n_prompt_tokens_processed=3848 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125
[update_slots] kv cache rm [p0, end) | p0=24 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125

@drazdra commented on GitHub (Jun 27, 2024): > @drazdra I'd check if you get something meaningful from the logs with the `OLLAMA_DEBUG=1` env var i did but i didn't see anything interesting there. kv_cache just resetted. [update_slots] slot released | n_cache_tokens=2201 n_ctx=4000 n_past=2200 n_system_tokens=0 slot_id=0 task_id=2584 tid="135739399088000" timestamp=1719442050 truncated=true [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=34430 status=200 tid="135739373286976" timestamp=1719442050 200 | 2m7s | 127.0.0.1 | POST "/api/chat" source=sched.go:348 msg="context for request finished" source=sched.go:281 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 duration=15m0s source=sched.go:299 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 refCount=0 source=sched.go:507 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-60af83b47d53e839830a77eb7cf8b7d474a8b4f778aca21dc73b337a304c4b54 [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2692 tid="135739399088000" timestamp=1719442124 [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2693 tid="135739399088000" timestamp=1719442124 [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=44830 status=200 tid="135739364894272" timestamp=1719442124 source=routes.go:1367 msg="chat handler" prompt="" images=0 source=server.go:695 msg="setting token limit to 10x num_ctx" num_ctx=4000 num_predict=40000 [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2694 tid="135739399088000" timestamp=1719442124 [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442124 [update_slots] slot progression | ga_i=0 n_past=24 n_past_se=0 n_prompt_tokens_processed=3848 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125 [update_slots] kv cache rm [p0, end) | p0=24 slot_id=0 task_id=2695 tid="135739399088000" timestamp=1719442125

GiteaMirror commented

2026-04-28 13:07:43 -05:00

@rasodu commented on GitHub (Jun 27, 2024):

I'm also encountering the same problem. When I input a prompt, it takes over a minute to receive a response, especially when there's a large conversation history involved. Notably, my system has ample resources available, with 6GB of spare GPU memory and 64GB of computer memory. Could you clarify whether caching relies on GPU or CPU memory?

Edit:
I've noticed an interesting trend: when my entire model exceeds the capacity of my GPU's memory, the processing speed of my prompt tokens actually increases compared to instances where the full model can fit within the GPU's VRAM.

@rasodu commented on GitHub (Jun 27, 2024): I'm also encountering the same problem. When I input a prompt, it takes over a minute to receive a response, especially when there's a large conversation history involved. Notably, my system has ample resources available, with 6GB of spare GPU memory and 64GB of computer memory. Could you clarify whether caching relies on GPU or CPU memory? Edit: I've noticed an interesting trend: when my entire model exceeds the capacity of my GPU's memory, the processing speed of my prompt tokens actually increases compared to instances where the full model can fit within the GPU's VRAM.

GiteaMirror commented

2026-04-28 13:07:46 -05:00

@rasodu commented on GitHub (Jul 2, 2024):

After some investigation, I discovered the root cause of the problem: I had set OLLAMA_NUM_PARALLEL=4. By changing it to OLLAMA_NUM_PARALLEL=1, I was able to resolve the issue and enable correct caching for single-threaded conversations. However, I've noticed that when I start a new conversation and then attempt to revisit an old one, the entire conversation history is re-evaluated from scratch.

Edit: I also observe high prompt eval scores for some of my responses, confirming the random behavior you described.

@rasodu commented on GitHub (Jul 2, 2024): After some investigation, I discovered the root cause of the problem: I had set OLLAMA_NUM_PARALLEL=4. By changing it to OLLAMA_NUM_PARALLEL=1, I was able to resolve the issue and enable correct caching for single-threaded conversations. However, I've noticed that when I start a new conversation and then attempt to revisit an old one, the entire conversation history is re-evaluated from scratch. Edit: I also observe high prompt eval scores for some of my responses, confirming the random behavior you described.

GiteaMirror commented

2026-04-28 13:07:47 -05:00

@rasodu commented on GitHub (Jul 9, 2024):

With Ollama 0.2.1 I see that it's trying to match prompt with slots. But in debugging mode it always ends up selecting slot 0 and re-evaluate with prompt is changed.

First Query

DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",3031]

Second query

DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",43]

@rasodu commented on GitHub (Jul 9, 2024): With Ollama 0.2.1 I see that it's trying to match prompt with slots. But in debugging mode it always ends up selecting slot 0 and re-evaluate with prompt is changed. **First Query** ``` DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",3031] ``` **Second query** ``` DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",43] ```

GiteaMirror commented

2026-04-28 13:07:48 -05:00

@rasodu commented on GitHub (Jul 9, 2024):

@jmorganca , can you please explain how is caching suppose to work with OLLAMA_NUM_PARALLEL? if I set OLLAMA_NUM_PARALLEL to 4, will LLaMA recompute the response every time it encounters an old prompt, or will it retrieve the cached result instead?

@rasodu commented on GitHub (Jul 9, 2024): @jmorganca , can you please explain how is caching suppose to work with OLLAMA_NUM_PARALLEL? if I set OLLAMA_NUM_PARALLEL to 4, will LLaMA recompute the response every time it encounters an old prompt, or will it retrieve the cached result instead?

GiteaMirror commented

2026-04-28 13:07:48 -05:00

@rasodu commented on GitHub (Jul 14, 2024):

I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

@rasodu commented on GitHub (Jul 14, 2024): Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot. I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

GiteaMirror commented

2026-04-28 13:07:49 -05:00

@drazdra commented on GitHub (Jul 15, 2024):

Copy from discord:

i wasted more time on some testing and it seems to fully correlate with the size of the last reply, take a look. all of these are generations for the same context, that is, i just request a different message for the same chat log

when the last generated message is large - next one requires prompt-reevaluation. if the last message is small, next message is created asap. in other words when the newly generated msg exceeds num_ctx, it loses kv-cache. that's how it looks like from the user side.

and yes, for that log num_ctx=2000

i wasted even more time on tests and yes, it correlates with message size. all these replies are for the same context.

last generated message should not shift the cache, even if it exceeds num_ctx.

@drazdra commented on GitHub (Jul 15, 2024): Copy from discord: i wasted more time on some testing and it seems to fully correlate with the size of the last reply, take a look. all of these are generations for the same context, that is, i just request a different message for the same chat log when the last generated message is large - next one requires prompt-reevaluation. if the last message is small, next message is created asap. in other words when the newly generated msg exceeds num_ctx, it loses kv-cache. that's how it looks like from the user side. and yes, for that log num_ctx=2000 i wasted even more time on tests and yes, it correlates with message size. all these replies are for the same context. last generated message should not shift the cache, even if it exceeds num_ctx. ![4](https://github.com/user-attachments/assets/b5fd2209-d977-40d1-aaab-0a4d1f8511cd)

GiteaMirror commented

2026-04-28 13:07:50 -05:00

@drazdra commented on GitHub (Jul 15, 2024):

a side thing, i suggest to add to the log separate columns for prompt-eval time/inference time and for the prompt size/inference size.

@drazdra commented on GitHub (Jul 15, 2024): a side thing, i suggest to add to the log separate columns for prompt-eval time/inference time and for the prompt size/inference size.

GiteaMirror commented

2026-04-28 13:07:51 -05:00

@rasodu commented on GitHub (Jul 16, 2024):

I have created pull request to fix this issue: #5716

@rasodu commented on GitHub (Jul 16, 2024): I have created pull request to fix this issue: #5716

GiteaMirror commented

2026-04-28 13:07:52 -05:00

@rasodu commented on GitHub (Jul 20, 2024):

@drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b.

@rasodu commented on GitHub (Jul 20, 2024): @drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b.

GiteaMirror commented

2026-04-28 13:07:53 -05:00

@drazdra commented on GitHub (Jul 20, 2024):

@drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b.

my issue is totally unrelated.
what you did in the code is meaningless. it was just finding the slot with the largest matching part. you, for some unexplainable reason, restricted it also to be larger than 60% of the prompt. this way partial matches with lower percentage simply won't find the best slot/cache.
ollama doesn't re-evaluate whole prompt upon changes in the last parts of the context, only the last part is re-evaluated.

@drazdra commented on GitHub (Jul 20, 2024): > @drazdra , can you please provide the prompt you use for your issue? I want to check if the fix I have done also addresses your issue too. Also, if possible try to test that you get the same behavior with phi3:3.8b. I am going to test your prompt with phi3:3.8b. 1. my issue is totally unrelated. 2. what you did in the code is meaningless. it was just finding the slot with the largest matching part. you, for some unexplainable reason, restricted it also to be larger than 60% of the prompt. this way partial matches with lower percentage simply won't find the best slot/cache. 3. ollama doesn't re-evaluate whole prompt upon changes in the last parts of the context, only the last part is re-evaluated.

GiteaMirror commented

2026-04-28 13:07:56 -05:00

@rasodu commented on GitHub (Jul 20, 2024):

@drazdra

Are you setting OLLAMA_NUM_PARALLEL to anything other than 1?
And are you absolutely certain that there is no other query that is being sent to Ollama between your multiple API calls ? (For example, I use OpenWebUI - And it sends a query to figure out the name it should set for the chat after I send first prompt)

I understand the issue you have may not be fixed by my PR. But just want to see if what you are experiencing is caused because ollama is currently only using the first slot.

@rasodu commented on GitHub (Jul 20, 2024): @drazdra 1. Are you setting OLLAMA_NUM_PARALLEL to anything other than 1? 2. And are you absolutely certain that there is no other query that is being sent to Ollama between your multiple API calls ? (For example, I use OpenWebUI - And it sends a query to figure out the name it should set for the chat after I send first prompt) I understand the issue you have may not be fixed by my PR. But just want to see if what you are experiencing is caused because ollama is currently only using the first slot.

GiteaMirror commented

2026-04-28 13:08:00 -05:00

@rasodu commented on GitHub (Jul 20, 2024):

Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot.

I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much.

@rasodu commented on GitHub (Jul 20, 2024): > Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot. > > I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes. Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much.

GiteaMirror commented

2026-04-28 13:08:02 -05:00

@drazdra commented on GitHub (Jul 21, 2024):

Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot.
I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes.

Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much.

your explanation doesn't make any sense as well.

context should match the largest partial string in the cache, to use that part of the cache. this way edited messages do not trigger full prompt re-evaluation. just the way it's done in ollama.
your code doesn't do what you claim, it doesn't match the full string, it requires >60% of the match to use cache, which is plain meaningless, bad and makes things worse as it drops the useful cache, and doesn't follow your own explanations of matching "the whole string", which in turn would be bad too, due to 1.
The way it is now it may overwrite existing cache with parallel requests but it's a known thing and i spoke about it months ago on discord. solution for that should be different, which is named sessions: storing old cache under certain identifier with timeout timer and then reusing it upon the sent key in api.
all of this is totally unrelated to my issue, so it should not be discussed in this request.

@drazdra commented on GitHub (Jul 21, 2024): > > Did some more debugging. Found that it always ends up using 1st slot. For example, This is the first message "<|user|>\nhi<|end|>\n<|assistant|>\n". Now if I start a new console with completely new session and send prompt ""<|user|>\nGood morning<|end|>\n<|assistant|>\n", it still matched "<|user|>"(Which is correct because we are matching longest substring) and uses first slot. It never utilized the second slot. > > I think we should match full string instead searching for partial match when selecting the slot to resolve this issue. But I don't fully understand why this approach was chosen. @jmorganca If you tell me it's ok to match full string, then I can create the pull request with the changes. > > Also, here is the explanation why I am matching 60% rather than doing a simple match. But if you are just using 1 slot then this doesn't really matter much. your explanation doesn't make any sense as well. 1. context should match the largest partial string in the cache, to use that part of the cache. this way edited messages do not trigger full prompt re-evaluation. just the way it's done in ollama. 2. your code doesn't do what you claim, it doesn't match the full string, it requires >60% of the match to use cache, which is plain meaningless, bad and makes things worse as it drops the useful cache, and doesn't follow your own explanations of matching "the whole string", which in turn would be bad too, due to 1. 3. The way it is now it may overwrite existing cache with parallel requests but it's a known thing and i spoke about it months ago on discord. solution for that should be different, which is named sessions: storing old cache under certain identifier with timeout timer and then reusing it upon the sent key in api. 4. all of this is totally unrelated to my issue, so it should not be discussed in this request.

GiteaMirror commented

2026-04-28 13:08:03 -05:00

@jessegross commented on GitHub (Oct 16, 2024):

I think there are at least two separate issues here:

Yes, when a message history exceeds the context length, the history gets truncated and shifted. This breaks the prompt cache and causes re-evaluation. It may be possible to fix this but I haven't had a chance to think it through.
There are definitely issues with multiple conversations overwriting each others' caches. There is an improved implementation that you can try out if you build from source - need to follow these instructions. However, it is not enabled by default as it causes a slow down in single conversation processing. You can set the environment variable OLLAMA_MULTIUSER_CACHE=1 to try it but note that this is purely for testing at this point and is subject to change. Hopefully we can get the best of both worlds in the future.

@jessegross commented on GitHub (Oct 16, 2024): I think there are at least two separate issues here: - Yes, when a message history exceeds the context length, the history gets truncated and shifted. This breaks the prompt cache and causes re-evaluation. It may be possible to fix this but I haven't had a chance to think it through. - There are definitely issues with multiple conversations overwriting each others' caches. There is an improved implementation that you can try out if you build from source - need to follow these [instructions](https://github.com/ollama/ollama/blob/main/docs/development.md#transition-to-go-runner). However, it is not enabled by default as it causes a slow down in single conversation processing. You can set the environment variable OLLAMA_MULTIUSER_CACHE=1 to try it but note that this is purely for testing at this point and is subject to change. Hopefully we can get the best of both worlds in the future.

GiteaMirror commented

2026-04-28 13:08:05 -05:00

@chrisoutwright commented on GitHub (Nov 4, 2024):

I also observed this and have set in environment:
MaxLoadedModels = "1"
NumParallel = "1"

llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW)
llm_load_print_meta: general.name     = Qwen2.5 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 '├ä─¼'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2024/11/04 - 21:54:31 | 200 |   15.2285805s |             ::1 | POST     "/api/chat"
[GIN] 2024/11/04 - 21:55:22 | 200 |   34.3895991s |             ::1 | POST     "/api/chat"
time=2024-11-04T21:55:22.936+01:00 level=WARN source=runner.go:122 msg="truncating input prompt" limit=2400 prompt=6125 numKeep=4
[GIN] 2024/11/04 - 21:55:42 | 200 |   28.6406763s |             ::1 | POST     "/api/chat"
[GIN] 2024/11/04 - 21:57:12 | 200 |         1m24s |             ::1 | POST     "/api/chat"
[GIN] 2024/11/04 - 21:58:42 | 200 |         1m19s |             ::1 | POST     "/api/chat"

after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal?

What is this numKeep=4 about?

Will this be like a sliding text windows and thus each time needs to reevaluate then (since each time the beggining starts within different window?) I would rather have the option to keep last fitting window fixed and discard when exceeded, if this is slow like this following each follow-up.

@chrisoutwright commented on GitHub (Nov 4, 2024): I also observed this and have set in environment: MaxLoadedModels = "1" NumParallel = "1" ``` llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 1 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 72.71 B llm_load_print_meta: model size = 44.15 GiB (5.22 BPW) llm_load_print_meta: general.name = Qwen2.5 72B Instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 '├ä─¼' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: EOG token = 151643 '<|endoftext|>' llm_load_print_meta: EOG token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab only - skipping tensors [GIN] 2024/11/04 - 21:54:31 | 200 | 15.2285805s | ::1 | POST "/api/chat" [GIN] 2024/11/04 - 21:55:22 | 200 | 34.3895991s | ::1 | POST "/api/chat" time=2024-11-04T21:55:22.936+01:00 level=WARN source=runner.go:122 msg="truncating input prompt" limit=2400 prompt=6125 numKeep=4 [GIN] 2024/11/04 - 21:55:42 | 200 | 28.6406763s | ::1 | POST "/api/chat" [GIN] 2024/11/04 - 21:57:12 | 200 | 1m24s | ::1 | POST "/api/chat" [GIN] 2024/11/04 - 21:58:42 | 200 | 1m19s | ::1 | POST "/api/chat" ``` after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal? What is this ` numKeep=4` about? Will this be like a sliding text windows and thus each time needs to reevaluate then (since each time the beggining starts within different window?) I would rather have the option to keep last fitting window fixed and discard when exceeded, if this is slow like this following each follow-up.

GiteaMirror commented

2026-04-28 13:08:06 -05:00

@drazdra commented on GitHub (Nov 6, 2024):

after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal?

there are 2 stages: prompt evaluation and inference. first one converts prompt to vectors space and stores it into a cache. second one uses that as a starting point to generate completion, which is inference making the actual replies.

on a next prompt the same part (history) is taken from cache and only the new part is being converted to vector space and concatenated to the cache. that's why it's fast.

cache has a size limit, it's matching the context window size (ollama multiplies it per amount of concurrent workers), which is num_ctx. when it's met, the start of the prompt (history) is thrown away.

cache matching is implemented by matching it from the start, so when the start of a history changes (thrown away as not fitting the cache size anymore), cache can't be found and whole prompt is being re-converted into the vectors again, which is very slow.

then, most next prompts are going to be slow as they will make start of the history change every time to allow for new messages you and ai add.

to fix it there should be a change to the cache matching mechanism.

What is this numKeep=4 about?
it's the amount of tokens after your system prompt that are always kept, when the start of the history is thrown away.

@drazdra commented on GitHub (Nov 6, 2024): > after trimming (depends) the second follow-up will noticeably slow down to half to react with token generation .. is this normal? there are 2 stages: prompt evaluation and inference. first one converts prompt to vectors space and stores it into a cache. second one uses that as a starting point to generate completion, which is inference making the actual replies. on a next prompt the same part (history) is taken from cache and only the new part is being converted to vector space and concatenated to the cache. that's why it's fast. cache has a size limit, it's matching the context window size (ollama multiplies it per amount of concurrent workers), which is num_ctx. when it's met, the start of the prompt (history) is thrown away. cache matching is implemented by matching it from the start, so when the start of a history changes (thrown away as not fitting the cache size anymore), cache can't be found and whole prompt is being re-converted into the vectors again, which is very slow. then, most next prompts are going to be slow as they will make start of the history change every time to allow for new messages you and ai add. to fix it there should be a change to the cache matching mechanism. > What is this ` numKeep=4` about? it's the amount of tokens after your system prompt that are always kept, when the start of the history is thrown away.

GiteaMirror commented

2026-04-28 13:08:07 -05:00

@kripper commented on GitHub (Oct 2, 2025):

Same problem here when using OpenHands with Ollama. Somehow the cache is not being used at all. The initial prompt is over 15.000 tokens. Second multi-turn message can be only 20 tokens, but takes longer. This makes Ollama unusable :-(

When trying openwebui, cache worked fine, but for one model I noticed that the second message was not using the cache, while the next ones were using it. So this is not exactly random, but there is definitely a caching issue present in latest Ollama version.

I guess other users are not reporting this issue. because it requires deeper understanding of how everything works, so they just assume "Ollama is slow, vLLM et. al. are faster".

Please fix.

@kripper commented on GitHub (Oct 2, 2025): Same problem here when using OpenHands with Ollama. Somehow the cache is not being used at all. The initial prompt is over 15.000 tokens. Second multi-turn message can be only 20 tokens, but takes longer. This makes Ollama unusable :-( When trying openwebui, cache worked fine, but for one model I noticed that the second message was not using the cache, while the next ones were using it. So this is not exactly random, but there is definitely a caching issue present in latest Ollama version. I guess other users are not reporting this issue. because it requires deeper understanding of how everything works, so they just assume "Ollama is slow, vLLM et. al. are faster". Please fix.

GiteaMirror commented

2026-04-28 13:08:09 -05:00

@kripper commented on GitHub (Oct 4, 2025):

I'm analyzing my show performance issue here: https://github.com/ollama/ollama/issues/12477

@kripper commented on GitHub (Oct 4, 2025): I'm analyzing my show performance issue here: https://github.com/ollama/ollama/issues/12477

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#49837