[GH-ISSUE #1365] llama_print_timings have disappeared from the logs. #62753

Closed
opened 2026-05-03 10:11:01 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @madsamjp on GitHub (Dec 3, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1365

In a previous version of Ollama, following the logs (on Linux using journalctl -t ollama -f) would give helpful information after the model has finished with its response (such as tokens per second).

e.g. this:

Dec 03 14:58:42 osm-server ollama[20658]: llama server listening at http://127.0.0.1:54457
Dec 03 14:58:42 osm-server ollama[20658]: {"timestamp":1701615522,"level":"INFO","function":"main","line":1746,"message":"HTTP server listening","hostname":"127.0.0.1","port":54457}
Dec 03 14:58:42 osm-server ollama[20658]: {"timestamp":1701615522,"level":"INFO","function":"log_server_request","line":1233,"message":"request","remote_addr":"127.0.0.1","remote_port":51344,"statu>
Dec 03 14:58:42 osm-server ollama[937]: 2023/12/03 14:58:42 llama.go:492: llama runner started in 9.200880 seconds
Dec 03 14:58:50 osm-server ollama[20658]: {"timestamp":1701615530,"level":"INFO","function":"log_server_request","line":1233,"message":"request","remote_addr":"127.0.0.1","remote_port":51344,"statu>
Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings:        load time =    8317.76 ms
Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings:      sample time =     107.35 ms /   396 runs   (    0.27 ms per token,  3688.73 tokens per second)
Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings: prompt eval time =     444.18 ms /   800 tokens (    0.56 ms per token,  1801.06 tokens per second)
Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings:        eval time =    6696.50 ms /   395 runs   (   16.95 ms per token,    58.99 tokens per second)
Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings:       total time =    7335.31 ms

This was really handy, but since updating Ollama, I've noticed this helpful info has gone. Is there an environment variable I can set to get it back?

Originally created by @madsamjp on GitHub (Dec 3, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1365 In a previous version of Ollama, following the logs (on Linux using `journalctl -t ollama -f`) would give helpful information after the model has finished with its response (such as tokens per second). e.g. this: ``` Dec 03 14:58:42 osm-server ollama[20658]: llama server listening at http://127.0.0.1:54457 Dec 03 14:58:42 osm-server ollama[20658]: {"timestamp":1701615522,"level":"INFO","function":"main","line":1746,"message":"HTTP server listening","hostname":"127.0.0.1","port":54457} Dec 03 14:58:42 osm-server ollama[20658]: {"timestamp":1701615522,"level":"INFO","function":"log_server_request","line":1233,"message":"request","remote_addr":"127.0.0.1","remote_port":51344,"statu> Dec 03 14:58:42 osm-server ollama[937]: 2023/12/03 14:58:42 llama.go:492: llama runner started in 9.200880 seconds Dec 03 14:58:50 osm-server ollama[20658]: {"timestamp":1701615530,"level":"INFO","function":"log_server_request","line":1233,"message":"request","remote_addr":"127.0.0.1","remote_port":51344,"statu> Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings: load time = 8317.76 ms Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings: sample time = 107.35 ms / 396 runs ( 0.27 ms per token, 3688.73 tokens per second) Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings: prompt eval time = 444.18 ms / 800 tokens ( 0.56 ms per token, 1801.06 tokens per second) Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings: eval time = 6696.50 ms / 395 runs ( 16.95 ms per token, 58.99 tokens per second) Dec 03 14:58:50 osm-server ollama[937]: llama_print_timings: total time = 7335.31 ms ``` This was really handy, but since updating Ollama, I've noticed this helpful info has gone. Is there an environment variable I can set to get it back?
Author
Owner

@rankun203 commented on GitHub (Dec 5, 2023):

As a workaround, you can use ollama run --verbose <MODEL_NAME> ...

Detailed usage of ollama run:

$ ollama run --help
Run a model

Usage:
  ollama run MODEL [PROMPT] [flags]

Flags:
      --format string   Response format (e.g. json)
  -h, --help            help for run
      --insecure        Use an insecure registry
      --nowordwrap      Don't wrap words to the next line automatically
      --verbose         Show timings for response

Example response:

... LLM output

total duration:       5.33640075s
load duration:        531.536292ms
prompt eval count:    27 token(s)
prompt eval duration: 110.818ms
prompt eval rate:     243.64 tokens/s
eval count:           270 token(s)
eval duration:        4.689066s
eval rate:            57.58 tokens/s
<!-- gh-comment-id:1840729645 --> @rankun203 commented on GitHub (Dec 5, 2023): As a workaround, you can use `ollama run --verbose <MODEL_NAME> ...` Detailed usage of `ollama run`: ```bash $ ollama run --help Run a model Usage: ollama run MODEL [PROMPT] [flags] Flags: --format string Response format (e.g. json) -h, --help help for run --insecure Use an insecure registry --nowordwrap Don't wrap words to the next line automatically --verbose Show timings for response ``` Example response: ```bash ... LLM output total duration: 5.33640075s load duration: 531.536292ms prompt eval count: 27 token(s) prompt eval duration: 110.818ms prompt eval rate: 243.64 tokens/s eval count: 270 token(s) eval duration: 4.689066s eval rate: 57.58 tokens/s ```
Author
Owner

@madsamjp commented on GitHub (Dec 9, 2023):

As a workaround, you can use ollama run --verbose <MODEL_NAME> ...

Detailed usage of ollama run:

$ ollama run --help
Run a model

Usage:
  ollama run MODEL [PROMPT] [flags]

Flags:
      --format string   Response format (e.g. json)
  -h, --help            help for run
      --insecure        Use an insecure registry
      --nowordwrap      Don't wrap words to the next line automatically
      --verbose         Show timings for response

Example response:

... LLM output

total duration:       5.33640075s
load duration:        531.536292ms
prompt eval count:    27 token(s)
prompt eval duration: 110.818ms
prompt eval rate:     243.64 tokens/s
eval count:           270 token(s)
eval duration:        4.689066s
eval rate:            57.58 tokens/s

This would only be useful when running a model in the terminal. If I run the models externally through the api, is there a way to get response timings?

<!-- gh-comment-id:1848636064 --> @madsamjp commented on GitHub (Dec 9, 2023): > As a workaround, you can use `ollama run --verbose <MODEL_NAME> ...` > > Detailed usage of `ollama run`: > > ```shell > $ ollama run --help > Run a model > > Usage: > ollama run MODEL [PROMPT] [flags] > > Flags: > --format string Response format (e.g. json) > -h, --help help for run > --insecure Use an insecure registry > --nowordwrap Don't wrap words to the next line automatically > --verbose Show timings for response > ``` > > Example response: > > ```shell > ... LLM output > > total duration: 5.33640075s > load duration: 531.536292ms > prompt eval count: 27 token(s) > prompt eval duration: 110.818ms > prompt eval rate: 243.64 tokens/s > eval count: 270 token(s) > eval duration: 4.689066s > eval rate: 57.58 tokens/s > ``` This would only be useful when running a model in the terminal. If I run the models externally through the api, is there a way to get response timings?
Author
Owner

@mxyng commented on GitHub (Jan 20, 2024):

the terminal metrics are based on generate or chat responses. responses will contain these fields:

total_duration: time spent generating the response
load_duration: time spent in nanoseconds loading the model
prompt_eval_count: number of tokens in the prompt
prompt_eval_duration: time spent in nanoseconds evaluating the prompt
eval_count: number of tokens the response
eval_duration: time in nanoseconds spent generating the response
<!-- gh-comment-id:1901398968 --> @mxyng commented on GitHub (Jan 20, 2024): the terminal metrics are based on [generate](https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-completion) or [chat](https://github.com/jmorganca/ollama/blob/main/docs/api.md#generate-a-chat-completion) responses. responses will contain these fields: ``` total_duration: time spent generating the response load_duration: time spent in nanoseconds loading the model prompt_eval_count: number of tokens in the prompt prompt_eval_duration: time spent in nanoseconds evaluating the prompt eval_count: number of tokens the response eval_duration: time in nanoseconds spent generating the response ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62753