[GH-ISSUE #10860] Total duration / overhead is (much) longer than the sum of the other durations #69193

Open
opened 2026-05-04 17:27:03 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @marcnaweb on GitHub (May 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10860

I am implementing a service to analyze big documents.

When query Ollama server I get a difference (of more than 5 seconds) between the "total duration" and the sum of load_duration + prompt_eval_duration + eval_duration, as seen below:

{(...) total_duration":8139553772,  # 8,13 Seconds
"load_duration":44643610,           # 0,04 Seconds
"prompt_eval_count":63859,          
"prompt_eval_duration":205727569,   # 0,20 Seconds
"eval_count":82,
"eval_duration":2676314808}         # 2,67 Seconds

I am using the Ollama "Generate" endpoint of my Ollma Docker with a long (token array) in "context": visible in the 63 859 tokens in the "prompt_eval_count"

I already checked the network bandwidth and it does not justify this difference, I am in localhost and the total payload is less than 400 kb.

The difference of around 5 seconds between the "Sum of the duration's" and the "Total Duration" seems to be consistent across all the cases when I send a payload with a big "context" tag/value --and do not appears if I do not include the context (with the tokenized context).

Here is my payload code:

request_payload = {
"model": "llama3.1",  
"prompt": "Who is the author of this document?",
"context": contextualization_response["context"],
"stream": False, 
"options":{
    'temperature': 0.5,
    'num_ctx': 70000
}

Question: what might be causing this duration difference/ overhead? Is there a way to reduce it?

While investigating on this issue, I found that this might be improved by a better handling of the Json by the Go server, but I am not sure this is the case here .
Also, I found this issue with similar(?) behavior, although in my tests it also happened with Curl requests: https://github.com/ollama/ollama/issues/6353 what suggests a difference cause.

Originally created by @marcnaweb on GitHub (May 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10860 I am implementing a service to analyze big documents. When query Ollama server I get a difference (of more than 5 seconds) between the "total duration" and the sum of load_duration + prompt_eval_duration + eval_duration, as seen below: ``` {(...) total_duration":8139553772, # 8,13 Seconds "load_duration":44643610, # 0,04 Seconds "prompt_eval_count":63859, "prompt_eval_duration":205727569, # 0,20 Seconds "eval_count":82, "eval_duration":2676314808} # 2,67 Seconds ``` I am using the Ollama "Generate" endpoint of my Ollma Docker with a long (token array) in "context": visible in the 63 859 tokens in the "prompt_eval_count" I already checked the network bandwidth and it does not justify this difference, I am in localhost and the total payload is less than 400 kb. The difference of around 5 seconds between the "Sum of the duration's" and the "Total Duration" seems to be consistent across all the cases when I send a payload with a big "context" tag/value --and do not appears if I do not include the context (with the tokenized context). Here is my payload code: ``` request_payload = { "model": "llama3.1", "prompt": "Who is the author of this document?", "context": contextualization_response["context"], "stream": False, "options":{ 'temperature': 0.5, 'num_ctx': 70000 } ``` Question: what might be causing this duration difference/ overhead? Is there a way to reduce it? While investigating on this issue, I found that this might be improved by a better handling of the Json by the Go server, but I am not sure this is the case here . Also, I found this issue with similar(?) behavior, although in my tests it also happened with Curl requests: https://github.com/ollama/ollama/issues/6353 what suggests a difference cause.
Author
Owner

@vorner commented on GitHub (Jan 25, 2026):

Hello

I'm observing a similar thing, but in much more severe case it seems. I'm using the chat endpoint with a program that exposes tools to gpt-oss. As the conversation grows longer and longer, the issue grows and the overall progress slows down significantly.

Here are stats from one of the rounds (the extra fields are computed in the obvious way):

Stats { total_duration: 336.629429994s, load_duration: 284.580431ms, prompt_eval_count: 40456, prompt_eval_duration: 3.049798077s, eval_count: 74, eval_duration: 32.813138079s, total_tokens: 40530, input_tps: "13265.14", output_tps: "2.26", total_tps: "120.40" }

As you can see, there's something like 5 minutes not accounted in either load, eval or prompt_eval (which is 10× as much as the sum of these three).

Further (indirect) observations:

  • These 5 minutes happen on CPU. The above stats are from a model that doesn't fit entirely onto the GPU (gpt-oss:120b), but if I use a model that does fit (gpt-oss:20b), it still happens on the CPU only (in that case the ratio between this preparation and the actual query is even bigger).
  • Not all available CPU threads are utilized during this time (there are 32 cores, the model's num_thread is 16, it uses about 7.5 cores during this time).
  • If the client drops connection during this time, it keeps processing the request (it still consumes the CPU); once it reaches the stage where it actually processes something, it stops.
  • If the model is "stuck" in this mode, other requests from other clients for the same model will wait until this is done despite having OLLAMA_NUM_PARALLEL=4.

Any idea how I find out more about what's happening during this time?

<!-- gh-comment-id:3797042815 --> @vorner commented on GitHub (Jan 25, 2026): Hello I'm observing a similar thing, but in much more severe case it seems. I'm using the chat endpoint with a program that exposes tools to gpt-oss. As the conversation grows longer and longer, the issue grows and the overall progress slows down significantly. Here are stats from one of the rounds (the extra fields are computed in the obvious way): ``` Stats { total_duration: 336.629429994s, load_duration: 284.580431ms, prompt_eval_count: 40456, prompt_eval_duration: 3.049798077s, eval_count: 74, eval_duration: 32.813138079s, total_tokens: 40530, input_tps: "13265.14", output_tps: "2.26", total_tps: "120.40" } ``` As you can see, there's something like 5 minutes not accounted in either load, eval or prompt_eval (which is 10× as much as the sum of these three). Further (indirect) observations: * These 5 minutes happen on CPU. The above stats are from a model that doesn't fit entirely onto the GPU (gpt-oss:120b), but if I use a model that does fit (gpt-oss:20b), it still happens on the CPU only (in that case the ratio between this preparation and the actual query is even bigger). * Not all available CPU threads are utilized during this time (there are 32 cores, the model's `num_thread` is 16, it uses about 7.5 cores during this time). * If the client drops connection during this time, it keeps processing the request (it still consumes the CPU); once it reaches the stage where it actually processes something, it stops. * If the model is "stuck" in this mode, other requests from other clients for the same model will wait until this is done despite having OLLAMA_NUM_PARALLEL=4. Any idea how I find out more about what's happening during this time?
Author
Owner

@vorner commented on GitHub (Jan 29, 2026):

A bit more info, if someone is more knowledgeable about the pipeline. These are some logs when it's happening, with a mark where the long time happens (it's only about a minute in this case, but I've seen it grow to maybe close to an hour on some occasions).

time=2026-01-29T10:59:14.878+01:00 level=DEBUG source=sched.go:678 msg="evaluating already loaded" model=/home/vorner/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
====================== The wait happens here ======================
time=2026-01-29T11:00:31.224+01:00 level=DEBUG source=server.go:1533 msg="completion request" images=0 prompt=89502 format=""
time=2026-01-29T11:00:32.830+01:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=22830 prompt=22641 used=22526 remaining=115
<!-- gh-comment-id:3818078119 --> @vorner commented on GitHub (Jan 29, 2026): A bit more info, if someone is more knowledgeable about the pipeline. These are some logs when it's happening, with a mark where the long time happens (it's only about a minute in this case, but I've seen it grow to maybe close to an hour on some occasions). ``` time=2026-01-29T10:59:14.878+01:00 level=DEBUG source=sched.go:678 msg="evaluating already loaded" model=/home/vorner/.ollama/models/blobs/sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a ====================== The wait happens here ====================== time=2026-01-29T11:00:31.224+01:00 level=DEBUG source=server.go:1533 msg="completion request" images=0 prompt=89502 format="" time=2026-01-29T11:00:32.830+01:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=22830 prompt=22641 used=22526 remaining=115 ```
Author
Owner

@vorner commented on GitHub (Jan 29, 2026):

Another interesting observation. The time/CPU is spent in ollama serve, not ollama runner.

<!-- gh-comment-id:3819462925 --> @vorner commented on GitHub (Jan 29, 2026): Another interesting observation. The time/CPU is spent in `ollama serve`, not `ollama runner`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69193