[GH-ISSUE #2068] prompt_eval_count disappears after repeated requests with same prompt #47709

Closed
opened 2026-04-28 05:00:33 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @puffo on GitHub (Jan 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2068

I noticed some odd behaviour when working with ollama (via litellm as I have been trying to fix a bug in the integration over there).

The prompt_eval_count parameter disappears from the response on a repeated request, yet the prompt_eval_duration (and other metrics) are still in the response payload. This happens for stream: true and stream: false variants across multiple models.

For example, a basic request like this run twice in a row:

 curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}'

First response:

{"model":"orca2","created_at":"2024-01-19T00:01:42.266089Z","message":{"role":"assistant","content":"My purpose is to assist you with any information or tasks you need, using my knowledge and skills. I am an AI assistant created by Microsoft. Is there something I can help you with?"},"done":true,"total_duration":876557125,"load_duration":725667,"prompt_eval_count":11,"prompt_eval_duration":275489000,"eval_count":39,"eval_duration":595078000}

Subsequent responses:

{"model":"orca2","created_at":"2024-01-19T00:05:12.20112Z","message":{"role":"assistant","content":"Possible responses:\n\n- My purpose is to assist users with their questions and tasks, using natural language processing and artificial intelligence.\n- I don't have a fixed purpose, but I try to help you find information and solve problems that you ask me.\n- My purpose is to learn from you and improve my skills and knowledge by interacting with you."},"done":true,"total_duration":1384266083,"load_duration":2776833,"prompt_eval_duration":215464000,"eval_count":75,"eval_duration":1163950000}

Is this an intentional omission on any subsequent responses with the same prompt, or a bug?

Originally created by @puffo on GitHub (Jan 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2068 I noticed some odd behaviour when working with ollama (via litellm as I have been trying to fix a bug in the integration over there). The `prompt_eval_count` parameter disappears from the response on a repeated request, yet the `prompt_eval_duration` (and other metrics) are still in the response payload. This happens for `stream: true` and `stream: false` variants across multiple models. For example, a basic request like this run twice in a row: ``` curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}' ``` First response: ```json {"model":"orca2","created_at":"2024-01-19T00:01:42.266089Z","message":{"role":"assistant","content":"My purpose is to assist you with any information or tasks you need, using my knowledge and skills. I am an AI assistant created by Microsoft. Is there something I can help you with?"},"done":true,"total_duration":876557125,"load_duration":725667,"prompt_eval_count":11,"prompt_eval_duration":275489000,"eval_count":39,"eval_duration":595078000} ``` Subsequent responses: ```json {"model":"orca2","created_at":"2024-01-19T00:05:12.20112Z","message":{"role":"assistant","content":"Possible responses:\n\n- My purpose is to assist users with their questions and tasks, using natural language processing and artificial intelligence.\n- I don't have a fixed purpose, but I try to help you find information and solve problems that you ask me.\n- My purpose is to learn from you and improve my skills and knowledge by interacting with you."},"done":true,"total_duration":1384266083,"load_duration":2776833,"prompt_eval_duration":215464000,"eval_count":75,"eval_duration":1163950000} ``` Is this an intentional omission on any subsequent responses with the same prompt, or a bug?
GiteaMirror added the bug label 2026-04-28 05:00:33 -05:00
Author
Owner

@mxyng commented on GitHub (Jan 19, 2024):

On initial investigation, this appears to be a bug though it's unclear if it's a bug in Ollama or llama.cpp.

I can reproduce this on Linux:

$ curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}'
{"model":"orca2","created_at":"2024-01-19T17:59:53.611509158Z","message":{"role":"assistant","content":"My purpose is to assist users with their questions, tasks, or problems by generating relevant and accurate responses from a large database of knowledge. I also try to learn from feedback and improve my skills over time."},"done":true,"total_duration":6850254927,"load_duration":748039916,"prompt_eval_count":67,"prompt_eval_duration":2633972000,"eval_count":41,"eval_duration":3462910000}
$ curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}'
{"model":"orca2","created_at":"2024-01-19T18:00:03.775694877Z","message":{"role":"assistant","content":"Possible responses:\n\n- My purpose is to assist users by providing information, answering questions, and generating text based on their input.\n- I do not have a fixed or inherent purpose. I only act according to the instructions I receive from you or follow the rules of the AI systems I am part of.\n- My purpose is to learn from you and improve my skills by interacting with you and other users in various domains and tasks."},"done":true,"total_duration":8413624523,"load_duration":161467,"prompt_eval_duration":191163000,"eval_count":93,"eval_duration":8218277000}mike@orac:~$

But not macOS:

$  curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}'
{"model":"orca2","created_at":"2024-01-19T17:58:45.096384Z","message":{"role":"assistant","content":"I am an AI assistant that helps people find information. I use natural language processing and web search to answer questions or perform tasks. What can I help you with?"},"done":true,"total_duration":1192084708,"load_duration":548464333,"prompt_eval_count":67,"prompt_eval_duration":137630000,"eval_count":35,"eval_duration":505565000}
$  curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}'
{"model":"orca2","created_at":"2024-01-19T17:58:48.423087Z","message":{"role":"assistant","content":"I am an AI assistant that helps people find information. I can answer questions, search the web, and provide feedback. My purpose is to assist you with your queries and make your life easier."},"done":true,"total_duration":864947750,"load_duration":456625,"prompt_eval_count":67,"prompt_eval_duration":270743000,"eval_count":41,"eval_duration":593412000}
<!-- gh-comment-id:1900854353 --> @mxyng commented on GitHub (Jan 19, 2024): On initial investigation, this appears to be a bug though it's unclear if it's a bug in Ollama or llama.cpp. I can reproduce this on Linux: ``` $ curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}' {"model":"orca2","created_at":"2024-01-19T17:59:53.611509158Z","message":{"role":"assistant","content":"My purpose is to assist users with their questions, tasks, or problems by generating relevant and accurate responses from a large database of knowledge. I also try to learn from feedback and improve my skills over time."},"done":true,"total_duration":6850254927,"load_duration":748039916,"prompt_eval_count":67,"prompt_eval_duration":2633972000,"eval_count":41,"eval_duration":3462910000} $ curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}' {"model":"orca2","created_at":"2024-01-19T18:00:03.775694877Z","message":{"role":"assistant","content":"Possible responses:\n\n- My purpose is to assist users by providing information, answering questions, and generating text based on their input.\n- I do not have a fixed or inherent purpose. I only act according to the instructions I receive from you or follow the rules of the AI systems I am part of.\n- My purpose is to learn from you and improve my skills by interacting with you and other users in various domains and tasks."},"done":true,"total_duration":8413624523,"load_duration":161467,"prompt_eval_duration":191163000,"eval_count":93,"eval_duration":8218277000}mike@orac:~$ ``` But not macOS: ``` $ curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}' {"model":"orca2","created_at":"2024-01-19T17:58:45.096384Z","message":{"role":"assistant","content":"I am an AI assistant that helps people find information. I use natural language processing and web search to answer questions or perform tasks. What can I help you with?"},"done":true,"total_duration":1192084708,"load_duration":548464333,"prompt_eval_count":67,"prompt_eval_duration":137630000,"eval_count":35,"eval_duration":505565000} $ curl -X POST http://0.0.0.0:11434/api/chat -d '{"model": "orca2", "messages": [{"role": "user", "content": "What is your purpose?"}], "stream": false}' {"model":"orca2","created_at":"2024-01-19T17:58:48.423087Z","message":{"role":"assistant","content":"I am an AI assistant that helps people find information. I can answer questions, search the web, and provide feedback. My purpose is to assist you with your queries and make your life easier."},"done":true,"total_duration":864947750,"load_duration":456625,"prompt_eval_count":67,"prompt_eval_duration":270743000,"eval_count":41,"eval_duration":593412000} ```
Author
Owner

@puffo commented on GitHub (Jan 19, 2024):

As of a few days ago (~5), it seemed to also be reliably reproducible on macOS.

I do see some recent commits which might have inadvertently fixed the problem, perhaps switching off cache_prompt in https://github.com/jmorganca/ollama/pull/2018 is related?

<!-- gh-comment-id:1900953612 --> @puffo commented on GitHub (Jan 19, 2024): As of a few days ago (~5), it seemed to also be reliably reproducible on macOS. I do see some recent commits which might have inadvertently fixed the problem, perhaps switching off `cache_prompt` in https://github.com/jmorganca/ollama/pull/2018 is related?
Author
Owner

@easp commented on GitHub (Jan 19, 2024):

I saw the same behavior on MacOS once the prompt caching was enabled. I assumed that it was only showing the delta of prompt tokens that had to be processed between iterations.

<!-- gh-comment-id:1901012900 --> @easp commented on GitHub (Jan 19, 2024): I saw the same behavior on MacOS once the prompt caching was enabled. I assumed that it was only showing the delta of prompt tokens that had to be processed between iterations.
Author
Owner

@julian-di commented on GitHub (Jan 25, 2024):

This is expected, since the prompt is cached in subsequent requests see: https://github.com/ollama/ollama/pull/1642

<!-- gh-comment-id:1910408744 --> @julian-di commented on GitHub (Jan 25, 2024): This is expected, since the prompt is cached in subsequent requests see: https://github.com/ollama/ollama/pull/1642
Author
Owner

@puffo commented on GitHub (Jan 25, 2024):

Thanks for pointing that out @julian-di.

I do however find the current behavior a bit surprising. I'd expect to have it return the cached evals in addition to the cached prompt, if that makes sense and/or is even possible?

<!-- gh-comment-id:1910748260 --> @puffo commented on GitHub (Jan 25, 2024): Thanks for pointing that out @julian-di. I do however find the current behavior a bit surprising. I'd expect to have it return the cached evals in addition to the cached prompt, if that makes sense and/or is even possible?
Author
Owner

@rozeappletree commented on GitHub (Feb 1, 2024):

What's the status of this issue? Is the disappearing prompt_eval_count expected behavior or not. If not, any possible fixes?

<!-- gh-comment-id:1921952833 --> @rozeappletree commented on GitHub (Feb 1, 2024): What's the status of this issue? Is the disappearing `prompt_eval_count` expected behavior or not. If not, any possible fixes?
Author
Owner

@gregnwosu commented on GitHub (Apr 8, 2024):

how does one disable prompt_caching

<!-- gh-comment-id:2043008482 --> @gregnwosu commented on GitHub (Apr 8, 2024): how does one disable prompt_caching
Author
Owner

@flu0r1ne commented on GitHub (Jun 10, 2024):

If prompt_eval_count represents the number of tokens evaluated during a particular request, it seems to contradict the documentation provided:

943172cbf4/docs/api.md (L88-L94)

If this field is omitted when caching is enabled, this behavior should be documented. However, it might be more user-friendly to always include the field in the response, possibly with a value of 0. This would eliminate the need for API consumers to check for the conditional existence of the field.

<!-- gh-comment-id:2159187298 --> @flu0r1ne commented on GitHub (Jun 10, 2024): If `prompt_eval_count` represents the number of tokens evaluated during a particular request, it seems to contradict the documentation provided: https://github.com/ollama/ollama/blob/943172cbf4d6cc0b8682021bfc9c2d816152615d/docs/api.md?plain=1#L88-L94 If this field is omitted when caching is enabled, this behavior should be documented. However, it might be more user-friendly to always include the field in the response, possibly with a value of `0`. This would eliminate the need for API consumers to check for the conditional existence of the field.
Author
Owner

@ysolanky commented on GitHub (Jun 25, 2024):

Hello! Do we have a fix for prompt_eval_count not showing up? Is the reason prompt caching and can we disable it?

<!-- gh-comment-id:2190033678 --> @ysolanky commented on GitHub (Jun 25, 2024): Hello! Do we have a fix for `prompt_eval_count` not showing up? Is the reason prompt caching and can we disable it?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47709