[GH-ISSUE #14683] Feature Request: Include usage metrics in streaming chat responses #56015

Closed
opened 2026-04-29 10:08:44 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @fuleinist on GitHub (Mar 7, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14683

Problem Description

Currently, the OpenAPI spec defines ChatStreamEvent without the usage/metrics fields that are available in the non-streaming ChatResponse. This creates an inconsistency where developers cannot access important performance metrics (token counts, timing information) during streaming responses.

Proposed Solution

Add the following fields to ChatStreamEvent in the OpenAPI spec to match ChatResponse:

  • total_duration - Total request duration in nanoseconds
  • load_duration - Model loading duration in nanoseconds
  • prompt_eval_count - Number of tokens in the prompt
  • prompt_eval_duration - Time spent evaluating the prompt in nanoseconds
  • eval_count - Number of tokens in the response
  • eval_duration - Time spent generating response in nanoseconds

Additionally, add logprobs to ChatStreamEvent since it's present in GenerateStreamEvent but missing from the chat streaming spec.

Use Case

Developers need real-time token usage and timing data during streaming to:

  • Display token counts to users in real-time
  • Monitor model performance and latency
  • Implement rate limiting based on actual usage
  • Debug performance issues

Implementation Suggestion

The Go implementation already uses a single ChatResponse struct for both streaming and non-streaming responses (see api/types.go). The fix primarily requires updating the OpenAPI spec in docs/api.md to include the metrics fields in ChatStreamEvent.

Reference: A similar pattern is already implemented for GenerateStreamEvent which correctly includes these metrics.

Additional Context

This would also resolve the inconsistency noted in issue #14680 regarding the mismatch between streaming and non-streaming response schemas.

Originally created by @fuleinist on GitHub (Mar 7, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14683 ## Problem Description Currently, the OpenAPI spec defines `ChatStreamEvent` without the usage/metrics fields that are available in the non-streaming `ChatResponse`. This creates an inconsistency where developers cannot access important performance metrics (token counts, timing information) during streaming responses. ## Proposed Solution Add the following fields to `ChatStreamEvent` in the OpenAPI spec to match `ChatResponse`: - `total_duration` - Total request duration in nanoseconds - `load_duration` - Model loading duration in nanoseconds - `prompt_eval_count` - Number of tokens in the prompt - `prompt_eval_duration` - Time spent evaluating the prompt in nanoseconds - `eval_count` - Number of tokens in the response - `eval_duration` - Time spent generating response in nanoseconds Additionally, add `logprobs` to `ChatStreamEvent` since it's present in `GenerateStreamEvent` but missing from the chat streaming spec. ## Use Case Developers need real-time token usage and timing data during streaming to: - Display token counts to users in real-time - Monitor model performance and latency - Implement rate limiting based on actual usage - Debug performance issues ## Implementation Suggestion The Go implementation already uses a single `ChatResponse` struct for both streaming and non-streaming responses (see `api/types.go`). The fix primarily requires updating the OpenAPI spec in `docs/api.md` to include the metrics fields in `ChatStreamEvent`. Reference: A similar pattern is already implemented for `GenerateStreamEvent` which correctly includes these metrics. ## Additional Context This would also resolve the inconsistency noted in issue #14680 regarding the mismatch between streaming and non-streaming response schemas.
GiteaMirror added the documentation label 2026-04-29 10:08:44 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 7, 2026):

The final response in a chat stream contains the usage metrics.

<!-- gh-comment-id:4016129108 --> @rick-github commented on GitHub (Mar 7, 2026): The final response in a chat stream contains the usage metrics.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56015