[GH-ISSUE #13914] RFC: Streaming prompt evaluation progress #71163

Open
opened 2026-05-05 00:35:05 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @balisujohn on GitHub (Jan 26, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13914

RFC: Streaming prompt evaluation progress

Summary

This PR adds an optional prompt_eval_progress parameter to the /api/generate and /api/chat endpoints. When set to N, the server streams progress updates every N tokens during prompt evaluation, letting clients show loading indicators for long prompts.

Note: Progress updates are sent every prompt_eval_progress tokens, but batch size acts as a lower bound since updates only occur between batches. In this example, batch size is 512, so updates appear at 512-token intervals despite prompt_eval_progress: 100.

Motivation

For large context windows or long system prompts, prompt evaluation can take a while with no feedback to the user. This gives clients a way to show "processing 1500/3000 tokens..." instead of just a spinner.

Example

Request:

curl -N http://localhost:11434/api/generate -d '{
  "model": "gemma3:270m",
  "prompt": "<~2000 token prompt>",
  "prompt_eval_progress": 100,
  "stream": true
}'

Response:

{"model":"gemma3:270m","created_at":"...","response":"","done":false,"prompt_eval_completed":512,"prompt_eval_total":2010}
{"model":"gemma3:270m","created_at":"...","response":"","done":false,"prompt_eval_completed":1024,"prompt_eval_total":2010}
{"model":"gemma3:270m","created_at":"...","response":"","done":false,"prompt_eval_completed":1536,"prompt_eval_total":2010}
{"model":"gemma3:270m","created_at":"...","response":"The","done":false}
{"model":"gemma3:270m","created_at":"...","response":" quick","done":false}
...

Without prompt_eval_progress set, no progress updates are sent and the first response is the first generated token.

Design note: dedicated field names

The response uses prompt_eval_completed / prompt_eval_total rather than reusing the existing completed / total fields. This avoids ambiguity with image generation, where those fields already mean "diffusion steps completed/total".

Open question: blocking vs non-blocking progress sends

The current implementation sends progress updates using a blocking channel send (with a seq.quit escape hatch), matching how token sends work:

select {
case seq.responses <- response{...}:
    seq.lastProgressSent = processed
case <-seq.quit:
    continue
}

An alternative would be non-blocking sends that skip updates if the buffer is full:

select {
case seq.responses <- response{...}:
    seq.lastProgressSent = processed
default:
    // buffer full, skip this update
}

Tradeoffs:

Blocking Non-blocking
Guarantees delivery ✗ (may skip updates)
Can stall on slow consumer
Consistent with token sends

My draft implementation is here

Curious to hear everyone's thoughts on this.

Originally created by @balisujohn on GitHub (Jan 26, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13914 ## RFC: Streaming prompt evaluation progress ### Summary This PR adds an optional `prompt_eval_progress` parameter to the `/api/generate` and `/api/chat` endpoints. When set to N, the server streams progress updates every N tokens during prompt evaluation, letting clients show loading indicators for long prompts. Note: Progress updates are sent every `prompt_eval_progress` tokens, but batch size acts as a lower bound since updates only occur between batches. In this example, batch size is 512, so updates appear at 512-token intervals despite `prompt_eval_progress: 100`. ### Motivation For large context windows or long system prompts, prompt evaluation can take a while with no feedback to the user. This gives clients a way to show "processing 1500/3000 tokens..." instead of just a spinner. ### Example **Request:** curl -N http://localhost:11434/api/generate -d '{ "model": "gemma3:270m", "prompt": "<~2000 token prompt>", "prompt_eval_progress": 100, "stream": true }' **Response:** {"model":"gemma3:270m","created_at":"...","response":"","done":false,"prompt_eval_completed":512,"prompt_eval_total":2010} {"model":"gemma3:270m","created_at":"...","response":"","done":false,"prompt_eval_completed":1024,"prompt_eval_total":2010} {"model":"gemma3:270m","created_at":"...","response":"","done":false,"prompt_eval_completed":1536,"prompt_eval_total":2010} {"model":"gemma3:270m","created_at":"...","response":"The","done":false} {"model":"gemma3:270m","created_at":"...","response":" quick","done":false} ... Without `prompt_eval_progress` set, no progress updates are sent and the first response is the first generated token. ### Design note: dedicated field names The response uses `prompt_eval_completed` / `prompt_eval_total` rather than reusing the existing `completed` / `total` fields. This avoids ambiguity with image generation, where those fields already mean "diffusion steps completed/total". ### Open question: blocking vs non-blocking progress sends The current implementation sends progress updates using a **blocking** channel send (with a `seq.quit` escape hatch), matching how token sends work: select { case seq.responses <- response{...}: seq.lastProgressSent = processed case <-seq.quit: continue } An alternative would be **non-blocking** sends that skip updates if the buffer is full: select { case seq.responses <- response{...}: seq.lastProgressSent = processed default: // buffer full, skip this update } **Tradeoffs:** | | Blocking | Non-blocking | |---|---|---| | Guarantees delivery | ✓ | ✗ (may skip updates) | | Can stall on slow consumer | ✓ | ✗ | | Consistent with token sends | ✓ | ✗ | My draft implementation is [here](https://github.com/ollama/ollama/pull/13901) Curious to hear everyone's thoughts on this.
GiteaMirror added the feature request label 2026-05-05 00:35:05 -05:00
Author
Owner

@illusdolphin commented on GitHub (Jan 27, 2026):

Can it be also extended to indicate progress of loading model to memory? for large models it can take up to minute
"model_load_progress": 1073741824, (every 1Gb)
{"model":"gpt-oss:120b","created_at":"...","response":"","done":false,"model_load_completed":1073741824,"model_load_total":71940702208}

<!-- gh-comment-id:3804602545 --> @illusdolphin commented on GitHub (Jan 27, 2026): Can it be also extended to indicate progress of loading model to memory? for large models it can take up to minute `"model_load_progress": 1073741824,` (every 1Gb) `{"model":"gpt-oss:120b","created_at":"...","response":"","done":false,"model_load_completed":1073741824,"model_load_total":71940702208}`
Author
Owner

@balisujohn commented on GitHub (Jan 27, 2026):

Seems worth adding, maybe as a separate (but not mutually exclusive) optional key.

<!-- gh-comment-id:3804629909 --> @balisujohn commented on GitHub (Jan 27, 2026): Seems worth adding, maybe as a separate (but not mutually exclusive) optional key.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71163