[GH-ISSUE #2431] Ability to preload a model? #47932

Closed
opened 2026-04-28 05:57:18 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @powellnorma on GitHub (Feb 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2431

Is it possible to preload a model without actually using it? For example if the users starts typing his request, it would be useful to be able to "preload" the model, instead of just loading it once the request is submitted.

Originally created by @powellnorma on GitHub (Feb 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2431 Is it possible to preload a model without actually using it? For example if the users starts typing his request, it would be useful to be able to "preload" the model, instead of just loading it once the request is submitted.
GiteaMirror added the documentation label 2026-04-28 05:57:18 -05:00
Author
Owner

@easp commented on GitHub (Feb 10, 2024):

I believe that the CLI accomplishes this with by making a generate call with an empty prompt.

<!-- gh-comment-id:1936846149 --> @easp commented on GitHub (Feb 10, 2024): I believe that the CLI accomplishes this with by making a generate call with an empty prompt.
Author
Owner

@BruceMacD commented on GitHub (Feb 10, 2024):

Eas is correct, an empty request to the /chat, /generate, or /embeddings endpoint will preload a model.

Here's what the looks like with cURL:

curl http://localhost:11434/api/generate -d '{
    "model": "mistral"
}'

curl http://localhost:11434/api/chat -d '{
    "model": "mistral"
}'

curl http://localhost:11434/api/embeddings -d '{
    "model": "mistral"
}'

You can do it with empty messages/prompts in the SDKs too.

Leaving this open for now as this should be documented somewhere.

<!-- gh-comment-id:1936856075 --> @BruceMacD commented on GitHub (Feb 10, 2024): Eas is correct, an empty request to the `/chat`, `/generate`, or `/embeddings` endpoint will preload a model. Here's what the looks like with cURL: ``` curl http://localhost:11434/api/generate -d '{ "model": "mistral" }' curl http://localhost:11434/api/chat -d '{ "model": "mistral" }' curl http://localhost:11434/api/embeddings -d '{ "model": "mistral" }' ``` You can do it with empty messages/prompts in the SDKs too. Leaving this open for now as this should be documented somewhere.
Author
Owner

@bouchtaoui-dev commented on GitHub (Feb 10, 2024):

I experience a slow response when request with /api/chat, as if it starts the model first then process the request.
Consequent requests is still slow, while on cli it works much faster, like a few seconds.

I don’t have a powerful system with GPU. I have a 11th gen i5 mini pc with 16GB. Running Llama2 or Mistral runs fine.

<!-- gh-comment-id:1936991331 --> @bouchtaoui-dev commented on GitHub (Feb 10, 2024): I experience a slow response when request with /api/chat, as if it starts the model first then process the request. Consequent requests is still slow, while on cli it works much faster, like a few seconds. I don’t have a powerful system with GPU. I have a 11th gen i5 mini pc with 16GB. Running Llama2 or Mistral runs fine.
Author
Owner

@powellnorma commented on GitHub (Feb 10, 2024):

Does anyone know why the initial API call to /chat (with an empty list of messages) still causes a CPU-Usage Spike (up to 10s) when starting the same model via ollama run .., even when the model is already loaded (judging from Memory usage of ollama serve)?

Perhaps the default Pre-Prompt is evaluated? Is it possible to cache that computation somehow?

The log look like this:

Feb 10 16:48:34 ollama[3332]: [GIN] 2024/02/10 - 16:48:34 | 200 |      18.921µs |       127.0.0.1 | HEAD     "/"
Feb 10 16:48:34 ollama[3332]: [GIN] 2024/02/10 - 16:48:34 | 200 |     323.252µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:48:34 ollama[3332]: [GIN] 2024/02/10 - 16:48:34 | 200 |     328.492µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:48:43 ollama[3332]: [GIN] 2024/02/10 - 16:48:43 | 200 |  8.891859947s |       127.0.0.1 | POST     "/api/chat"
Feb 10 16:52:24 ollama[3332]: [GIN] 2024/02/10 - 16:52:24 | 200 |      18.011µs |       127.0.0.1 | HEAD     "/"
Feb 10 16:52:24 ollama[3332]: [GIN] 2024/02/10 - 16:52:24 | 200 |     340.832µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:52:24 ollama[3332]: [GIN] 2024/02/10 - 16:52:24 | 200 |     265.222µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:52:29 ollama[3332]: [GIN] 2024/02/10 - 16:52:29 | 200 |  4.443498171s |       127.0.0.1 | POST     "/api/chat"
Feb 10 16:52:31 ollama[3332]: [GIN] 2024/02/10 - 16:52:31 | 200 |       16.83µs |       127.0.0.1 | HEAD     "/"
Feb 10 16:52:31 ollama[3332]: [GIN] 2024/02/10 - 16:52:31 | 200 |     387.432µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:52:31 ollama[3332]: [GIN] 2024/02/10 - 16:52:31 | 200 |     295.771µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:52:35 ollama[3332]: [GIN] 2024/02/10 - 16:52:35 | 200 |  3.855491544s |       127.0.0.1 | POST     "/api/chat"
Feb 10 16:54:01 ollama[3332]: [GIN] 2024/02/10 - 16:54:01 | 200 |       29.36µs |       127.0.0.1 | HEAD     "/"
Feb 10 16:54:01 ollama[3332]: [GIN] 2024/02/10 - 16:54:01 | 200 |     342.092µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:54:01 ollama[3332]: [GIN] 2024/02/10 - 16:54:01 | 200 |     277.661µs |       127.0.0.1 | POST     "/api/show"
Feb 10 16:54:12 ollama[3332]: [GIN] 2024/02/10 - 16:54:12 | 200 | 10.302476638s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:1937050640 --> @powellnorma commented on GitHub (Feb 10, 2024): Does anyone know why the initial API call to /chat (with an empty list of messages) still causes a CPU-Usage Spike (up to 10s) when starting the same model via `ollama run ..`, even when the model is already loaded (judging from Memory usage of `ollama serve`)? Perhaps the default Pre-Prompt is evaluated? Is it possible to cache that computation somehow? The log look like this: ``` Feb 10 16:48:34 ollama[3332]: [GIN] 2024/02/10 - 16:48:34 | 200 | 18.921µs | 127.0.0.1 | HEAD "/" Feb 10 16:48:34 ollama[3332]: [GIN] 2024/02/10 - 16:48:34 | 200 | 323.252µs | 127.0.0.1 | POST "/api/show" Feb 10 16:48:34 ollama[3332]: [GIN] 2024/02/10 - 16:48:34 | 200 | 328.492µs | 127.0.0.1 | POST "/api/show" Feb 10 16:48:43 ollama[3332]: [GIN] 2024/02/10 - 16:48:43 | 200 | 8.891859947s | 127.0.0.1 | POST "/api/chat" Feb 10 16:52:24 ollama[3332]: [GIN] 2024/02/10 - 16:52:24 | 200 | 18.011µs | 127.0.0.1 | HEAD "/" Feb 10 16:52:24 ollama[3332]: [GIN] 2024/02/10 - 16:52:24 | 200 | 340.832µs | 127.0.0.1 | POST "/api/show" Feb 10 16:52:24 ollama[3332]: [GIN] 2024/02/10 - 16:52:24 | 200 | 265.222µs | 127.0.0.1 | POST "/api/show" Feb 10 16:52:29 ollama[3332]: [GIN] 2024/02/10 - 16:52:29 | 200 | 4.443498171s | 127.0.0.1 | POST "/api/chat" Feb 10 16:52:31 ollama[3332]: [GIN] 2024/02/10 - 16:52:31 | 200 | 16.83µs | 127.0.0.1 | HEAD "/" Feb 10 16:52:31 ollama[3332]: [GIN] 2024/02/10 - 16:52:31 | 200 | 387.432µs | 127.0.0.1 | POST "/api/show" Feb 10 16:52:31 ollama[3332]: [GIN] 2024/02/10 - 16:52:31 | 200 | 295.771µs | 127.0.0.1 | POST "/api/show" Feb 10 16:52:35 ollama[3332]: [GIN] 2024/02/10 - 16:52:35 | 200 | 3.855491544s | 127.0.0.1 | POST "/api/chat" Feb 10 16:54:01 ollama[3332]: [GIN] 2024/02/10 - 16:54:01 | 200 | 29.36µs | 127.0.0.1 | HEAD "/" Feb 10 16:54:01 ollama[3332]: [GIN] 2024/02/10 - 16:54:01 | 200 | 342.092µs | 127.0.0.1 | POST "/api/show" Feb 10 16:54:01 ollama[3332]: [GIN] 2024/02/10 - 16:54:01 | 200 | 277.661µs | 127.0.0.1 | POST "/api/show" Feb 10 16:54:12 ollama[3332]: [GIN] 2024/02/10 - 16:54:12 | 200 | 10.302476638s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@mpetruc commented on GitHub (Feb 12, 2024):

Should the model stay loaded? In my case it seems that it is being unloaded after a few minutes of inactivity. While this might not be a problem with fast loading models , it is extremely painful with larger ones like mixtral-8x7b-instruct-v0.1.Q8_0.gguf. I am on a i7 w/64Gb Ram and RTX3080 w/16Gb, using the SDK. Thanks.

<!-- gh-comment-id:1939334410 --> @mpetruc commented on GitHub (Feb 12, 2024): Should the model stay loaded? In my case it seems that it is being unloaded after a few minutes of inactivity. While this might not be a problem with fast loading models , it is extremely painful with larger ones like mixtral-8x7b-instruct-v0.1.Q8_0.gguf. I am on a i7 w/64Gb Ram and RTX3080 w/16Gb, using the SDK. Thanks.
Author
Owner

@pdevine commented on GitHub (Feb 19, 2024):

I've updated that FAQ to cover both situations (pre-loading models as well as controlling how long models are loaded into memory. I think people were missing this in the API docs.

The TL;DR is:

  • to preload a model, send an empty request with the model you want
  • to unload a model, use the keep_alive parameter and set it to 0
<!-- gh-comment-id:1953266741 --> @pdevine commented on GitHub (Feb 19, 2024): I've updated that FAQ to cover both situations ([pre-loading models](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-pre-load-a-model-to-get-faster-response-times) as well as [controlling how long models are loaded into memory](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately). I think people were missing this in the API docs. The TL;DR is: * to preload a model, send an empty request with the model you want * to unload a model, use the `keep_alive` parameter and set it to `0`
Author
Owner

@haydonryan commented on GitHub (May 15, 2024):

A slight variation on this - it would be really good if there was a setting in the config file (load default model on startup) - having to write a script to hit the model once on startup is kinda annoying. Also for a large model (eg Mixtral 8x22b) off the shelf clients will time out.

<!-- gh-comment-id:2113254741 --> @haydonryan commented on GitHub (May 15, 2024): A slight variation on this - it would be really good if there was a setting in the config file (load default model on startup) - having to write a script to hit the model once on startup is kinda annoying. Also for a large model (eg Mixtral 8x22b) off the shelf clients will time out.
Author
Owner

@philippeflorent commented on GitHub (Mar 15, 2026):

just sending the model does not work when running in a script
might be working on the cli, but not in python as a preload process
the request just gets returned, but the first prompt still takes ages to process

I try to check the content of the answer to see if model is loaded but I always receive the same data

{'id': 'cmpl-842', 'object': 'text_completion', 'created': 1773575087, 'model': 'qwen2.5:0.5b', 'system_fingerprint': 'fp_ollama', 'choices': [{'text': '', 'index': 0, 'finish_reason': 'load'}], 'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}}

<!-- gh-comment-id:4062813709 --> @philippeflorent commented on GitHub (Mar 15, 2026): just sending the model does not work when running in a script might be working on the cli, but not in python as a preload process the request just gets returned, but the first prompt still takes ages to process I try to check the content of the answer to see if model is loaded but I always receive the same data `{'id': 'cmpl-842', 'object': 'text_completion', 'created': 1773575087, 'model': 'qwen2.5:0.5b', 'system_fingerprint': 'fp_ollama', 'choices': [{'text': '', 'index': 0, 'finish_reason': 'load'}], 'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}}`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47932