[GH-ISSUE #11298] Why would max token limit affect prompt eval time? #69511

Closed
opened 2026-05-04 18:17:32 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @negaralizadeh on GitHub (Jul 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11298

I was running some experiments with codellama:7b-instruct-q4_K_M using the Python library (ollama.generate), and I noticed that changing the max generated tokens limit (num_predict) affects the prompt evaluation time(prompt_eval_duration), even though the input and all other options remain the same. In each round, I queried the model with 330 different queries. Here are the results:

Metric 1024 Output Tokens 1 Output Token
prompt_eval_count 190,760 190,760
prompt_eval_duration 122,118,342,000 84,885,213,000
output_tokens 74,891 330
eval_duration 2,107,317,930,000 4,357,000
total_duration 2,265,230,752,632 121,807,018,567

As you can see, there is a 30% drop in the total prompt evaluation time, around a 112ms drop ((122118342000 − 84885213000) / 330) per prompt. I'd appreciate some explanation for that.
The model fits 100% on my GPU (RTX3070) and my OS is Ubuntu 20.04.
Thank you!

Originally created by @negaralizadeh on GitHub (Jul 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11298 I was running some experiments with codellama:7b-instruct-q4_K_M using the Python library (ollama.generate), and I noticed that changing the max generated tokens limit (num_predict) affects the prompt evaluation time(prompt_eval_duration), even though the input and all other options remain the same. In each round, I queried the model with 330 different queries. Here are the results: | Metric | 1024 Output Tokens | 1 Output Token | |:--------------------:|:-----------------:|:---------------:| | prompt_eval_count | 190,760 | 190,760 | | prompt_eval_duration | 122,118,342,000 | 84,885,213,000 | | output_tokens | 74,891 | 330 | | eval_duration | 2,107,317,930,000 | 4,357,000 | | total_duration | 2,265,230,752,632 | 121,807,018,567 | As you can see, there is a 30% drop in the total prompt evaluation time, around a 112ms drop ((122118342000 − 84885213000) / 330) per prompt. I'd appreciate some explanation for that. The model fits 100% on my GPU (RTX3070) and my OS is Ubuntu 20.04. Thank you!
Author
Owner

@rick-github commented on GitHub (Jul 4, 2025):

Part of the process of prompt processing is finding and preparing a cache slot for inference. If a cache slot was previously used, it needs to have its corresponding KV cache entries cleared. The more tokens that were generated, the more time spent clearing the KV cache.

<!-- gh-comment-id:3036870960 --> @rick-github commented on GitHub (Jul 4, 2025): Part of the process of prompt processing is finding and preparing a cache slot for inference. If a cache slot was previously used, it needs to have its corresponding KV cache entries cleared. The more tokens that were generated, the more time spent clearing the KV cache.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69511