[GH-ISSUE #14028] what does mean by "llama runner started in XXX seconds" in ollama logs? #34928

Closed
opened 2026-04-22 18:55:38 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @ajmal-yazdani on GitHub (Feb 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14028

We're running Ollama as a docker image with 8 replicas and a continuous inferencing is going on as multiple concurrent calls we're invoking through client app to /chat endpoint.

90% time inference going on with low latency, but in between all sudden we're getting below logs and other logs in ollama pods

`
2026-02-02 17:59:02.735
info
time=2026-02-02T12:29:02.735Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds"

2026-02-02 17:59:02.735
info
time=2026-02-02T12:29:02.735Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds"

2026-02-02 17:58:06.812
info
time=2026-02-02T12:28:06.812Z level=INFO source=server.go:1339 msg="llama runner started in 30.46 seconds"

2026-02-02 17:58:06.811
info
time=2026-02-02T12:28:06.811Z level=INFO source=server.go:1339 msg="llama runner started in 30.45 seconds"

2026-02-02 17:58:06.368
info
time=2026-02-02T12:28:06.368Z level=INFO source=server.go:1339 msg="llama runner started in 30.22 seconds"

2026-02-02 17:58:06.368
info
time=2026-02-02T12:28:06.367Z level=INFO source=server.go:1339 msg="llama runner started in 30.22 seconds"

2026-02-02 17:57:12.038
info
time=2026-02-02T12:27:12.038Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds"

2026-02-02 17:57:12.038
info
time=2026-02-02T12:27:12.038Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds"
`

During this time API call latency is also 30+ seconds. Wanted to understand why this and this mean to ?

Is this model loading unloading from VRAM in between?

`2026-02-02 17:58:32.637
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000

2026-02-02 17:58:32.637
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...

2026-02-02 17:58:32.611
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

2026-02-02 17:58:32.607
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...

2026-02-02 17:58:32.593
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe

2026-02-02 17:58:32.593
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2

2026-02-02 17:58:32.593
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128

2026-02-02 17:58:32.593
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256

2026-02-02 17:58:32.593
llama_model_loader: - kv 17: general.file_type u32 = 15

2026-02-02 17:58:32.593
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010

2026-02-02 17:58:32.593
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000

2026-02-02 17:58:32.593
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8

2026-02-02 17:58:32.593
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32

2026-02-02 17:58:32.593
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336

2026-02-02 17:58:32.593
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096

2026-02-02 17:58:32.593
llama_model_loader: - kv 10: llama.context_length u32 = 131072`

Originally created by @ajmal-yazdani on GitHub (Feb 2, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14028 We're running Ollama as a docker image with 8 replicas and a continuous inferencing is going on as multiple concurrent calls we're invoking through client app to /chat endpoint. 90% time inference going on with low latency, but in between all sudden we're getting below logs and other logs in ollama pods ` 2026-02-02 17:59:02.735 info time=2026-02-02T12:29:02.735Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds" 2026-02-02 17:59:02.735 info time=2026-02-02T12:29:02.735Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds" 2026-02-02 17:58:06.812 info time=2026-02-02T12:28:06.812Z level=INFO source=server.go:1339 msg="llama runner started in 30.46 seconds" 2026-02-02 17:58:06.811 info time=2026-02-02T12:28:06.811Z level=INFO source=server.go:1339 msg="llama runner started in 30.45 seconds" 2026-02-02 17:58:06.368 info time=2026-02-02T12:28:06.368Z level=INFO source=server.go:1339 msg="llama runner started in 30.22 seconds" 2026-02-02 17:58:06.368 info time=2026-02-02T12:28:06.367Z level=INFO source=server.go:1339 msg="llama runner started in 30.22 seconds" 2026-02-02 17:57:12.038 info time=2026-02-02T12:27:12.038Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds" 2026-02-02 17:57:12.038 info time=2026-02-02T12:27:12.038Z level=INFO source=server.go:1339 msg="llama runner started in 29.96 seconds" ` During this time API call latency is also 30+ seconds. Wanted to understand why this and this mean to ? Is this model loading unloading from VRAM in between? `2026-02-02 17:58:32.637 llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 2026-02-02 17:58:32.637 llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... 2026-02-02 17:58:32.611 llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2026-02-02 17:58:32.607 llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... 2026-02-02 17:58:32.593 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe 2026-02-02 17:58:32.593 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 2026-02-02 17:58:32.593 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 2026-02-02 17:58:32.593 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 2026-02-02 17:58:32.593 llama_model_loader: - kv 17: general.file_type u32 = 15 2026-02-02 17:58:32.593 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 2026-02-02 17:58:32.593 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 2026-02-02 17:58:32.593 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 2026-02-02 17:58:32.593 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 2026-02-02 17:58:32.593 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 2026-02-02 17:58:32.593 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 2026-02-02 17:58:32.593 llama_model_loader: - kv 10: llama.context_length u32 = 131072`
GiteaMirror added the question label 2026-04-22 18:55:38 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 2, 2026):

llama runner started in 29.96 seconds means it took 30s to load the model into the device doing inference, ie the model hadn't been loaded or had been loaded and had been unloaded. Model evictions can happen due to a change in API parameters or loading of a new model. Set OLLAMA_DEBUG=1 in the server environment to get more information on model scheduling.

<!-- gh-comment-id:3835774082 --> @rick-github commented on GitHub (Feb 2, 2026): `llama runner started in 29.96 seconds` means it took 30s to load the model into the device doing inference, ie the model hadn't been loaded or had been loaded and had been unloaded. Model evictions can happen due to a change in API parameters or loading of a new model. Set `OLLAMA_DEBUG=1` in the server environment to get more information on model scheduling.
Author
Owner

@ajmal-yazdani commented on GitHub (Feb 3, 2026):

llama_logs.txt

Hi @rick-github Thanks for your reply. I have attached debug log for 5 minutes winnow where I have lots of occurrence of llama runner started in despite setting ' - name: OLLAMA_KEEP_ALIVE
value: "6h" # Keep models in VRAM for 6 hours'

<!-- gh-comment-id:3839212753 --> @ajmal-yazdani commented on GitHub (Feb 3, 2026): [llama_logs.txt](https://github.com/user-attachments/files/25032337/llama_logs.txt) Hi @rick-github Thanks for your reply. I have attached debug log for 5 minutes winnow where I have lots of occurrence of `llama runner started in` despite setting ' - name: OLLAMA_KEEP_ALIVE value: "6h" # Keep models in VRAM for 6 hours'
Author
Owner

@ajmal-yazdani commented on GitHub (Feb 3, 2026):

Hi @rick-github

I got to know that our team using different context_window for each set of question category so this means we're changing API parameters and this make model unload and load. Right ?

payload = { "model": "llama3.1:8b", "messages": [{"role": "user", "content": question}], "stream": False, "options": { "num_ctx": context_window, }, }

Question, Is there a way to avoid this ?

<!-- gh-comment-id:3840041264 --> @ajmal-yazdani commented on GitHub (Feb 3, 2026): Hi @rick-github I got to know that our team using different `context_window` for each set of question category so this means we're changing API parameters and this make model unload and load. Right ? ` payload = { "model": "llama3.1:8b", "messages": [{"role": "user", "content": question}], "stream": False, "options": { "num_ctx": context_window, }, }` Question, Is there a way to avoid this ?
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

we're changing API parameters and this make model unload and load. Right ?

Right.

Question, Is there a way to avoid this ?

Here are some alternatives.

  1. Change the code to not set num_ctx in the API call.
  2. Change the code to use /v1/chat/completions endpoint instead of /api/chat. They are compatible but /v1 will ignore the context window setting. The response will be a different format so client changes required.
  3. If changing the endpoint is not feasible, put a proxy in front of the ollama server and redirect /api/chat to /v1/chat/completions. The response will be a different format so client changes required.
  4. Proxy as above, but also rewrite the response into ollama API format.
  5. Use a proxy to rewrite the request to remove num_ctx. No client changes required.
<!-- gh-comment-id:3840203196 --> @rick-github commented on GitHub (Feb 3, 2026): > we're changing API parameters and this make model unload and load. Right ? Right. > Question, Is there a way to avoid this ? Here are some alternatives. 1. Change the code to not set `num_ctx` in the API call. 2. Change the code to use `/v1/chat/completions` endpoint instead of `/api/chat`. They are compatible but `/v1` will ignore the context window setting. The response will be a different format so client changes required. 3. If changing the endpoint is not feasible, put a proxy in front of the ollama server and redirect `/api/chat` to `/v1/chat/completions`. The response will be a different format so client changes required. 4. Proxy as above, but also rewrite the response into ollama API format. 5. Use a proxy to rewrite the request to remove `num_ctx`. No client changes required.
Author
Owner

@ajmal-yazdani commented on GitHub (Feb 3, 2026):

@rick-github Thanks for your reply.

what's the difference with /V1 ? should we change this num_ctx or play with this parameter? Is there any relation between hallucination's response and num_ctx ?

With /V1 why this is omitted ? Can temperature be set with /v? "options": {"temperature": self.temperature, # "seed":40, "num_ctx": self.context_window },
Is temperature making any model unload load case ?

<!-- gh-comment-id:3841238582 --> @ajmal-yazdani commented on GitHub (Feb 3, 2026): @rick-github Thanks for your reply. what's the difference with `/V1` ? should we change this `num_ctx` or play with this parameter? Is there any relation between hallucination's response and `num_ctx` ? With `/V1` why this is omitted ? Can temperature be set with `/v`? ` "options": {"temperature": self.temperature, # "seed":40, "num_ctx": self.context_window },` Is temperature making any model unload load case ?
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

what's the difference with /V1

/v1 is the OpenAI compatability endpoint. It allows clients that expect to use the OpenAI API to work with ollama. The OpenAI API does not support setting num_ctx, so ollama ignores attempts to set it via /v1.

Is there any relation between hallucination's response and num_ctx ?

No.

Can temperature be set with /v?

Yes. {"model":"xx",...,"temperature":self.temperature}

Is temperature making any model unload load case ?

No.

<!-- gh-comment-id:3841298554 --> @rick-github commented on GitHub (Feb 3, 2026): > what's the difference with /V1 `/v1` is the [OpenAI compatability](https://github.com/ollama/ollama/blob/main/docs/api/openai-compatibility.mdx) endpoint. It allows clients that expect to use the OpenAI API to work with ollama. The OpenAI API does not support setting `num_ctx`, so ollama ignores attempts to set it via `/v1`. > Is there any relation between hallucination's response and num_ctx ? No. > Can temperature be set with /v? Yes. `{"model":"xx",...,"temperature":self.temperature}` > Is temperature making any model unload load case ? No.
Author
Owner

@ajmal-yazdani commented on GitHub (Feb 3, 2026):

Thanks @rick-github for your reply.

<!-- gh-comment-id:3841324373 --> @ajmal-yazdani commented on GitHub (Feb 3, 2026): Thanks @rick-github for your reply.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34928