[PR #8029] [CLOSED] Prevent model thrashing from unset num_ctx #11331

Closed
opened 2025-11-12 16:11:44 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/8029
Author: @rick-github
Created: 12/10/2024
Status: Closed

Base: mainHead: num_ctx


📝 Commits (6)

📊 Changes

3 files changed (+15 additions, -0 deletions)

View changed files

📝 llm/server.go (+5 -0)
📝 server/routes.go (+8 -0)
📝 server/sched_test.go (+2 -0)

📄 Description

TLDR: a model shouldn't be evicted due to num_ctx change if the client doesn't care about context size.

Client A loads a model with a context window different to the default or the value configured in the Modelfile:

$ curl localhost:11434/api/generate -d '{"model":"llama3.2","options":{"num_ctx":65536}}'
$ ollama ps
NAME               ID              SIZE     PROCESSOR    UNTIL   
llama3.2:latest    a80c4f17acd5    13 GB    100% GPU     Forever 

Client B does a completion but doesn't specify a context window, causing the default value of 2048 to be used, resulting in eviction and immediate reload of the model.

$ curl localhost:11434/api/generate -d '{"model":"llama3.2"}'
$ ollama ps
NAME               ID              SIZE      PROCESSOR    UNTIL   
llama3.2:latest    a80c4f17acd5    3.1 GB    100% GPU     Forever    

Client A sends another completion with the large context causing another eviction and reload.

$ curl localhost:11434/api/generate -d '{"model":"llama3.2","options":{"num_ctx":65536}}'
$ ollama ps
NAME               ID              SIZE     PROCESSOR    UNTIL   
llama3.2:latest    a80c4f17acd5    13 GB    100% GPU     Forever    

If client B is not concerned about the context window, it shouldn't cause the eviction of an an already loaded model. This is particularly noticeable when sharing a model between ollama and OpenAI endpoints - since the OpenAI endpoint can't set a context window, a model loaded via the ollama endpoint with a custom context window gets evicted by the next OpenAI request.

Thrashing can also occur when a client makes secondary completions after a primary completion, eg open-webui's auto-complete feature (see https://github.com/ollama/ollama/issues/7919#issuecomment-2560465774), or when a model is used for both completion and embedding (https://github.com/ollama/ollama/issues/6148#issuecomment-2568402497).


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/8029 **Author:** [@rick-github](https://github.com/rick-github) **Created:** 12/10/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `num_ctx` --- ### 📝 Commits (6) - [`e187f6f`](https://github.com/ollama/ollama/commit/e187f6f3f7ff09e12d4e8b22870c218e99c5a2b8) Prevent underflow when FreeMemory < overhead - [`ca3c818`](https://github.com/ollama/ollama/commit/ca3c818e647b5d8be5e4955b6c9dba4040965dd3) Merge branch 'main' of https://github.com/rick-github/ollama - [`8426d4c`](https://github.com/ollama/ollama/commit/8426d4c44feeee3f78f3cea0b06c376945950615) Avoid model thrashing from unset num_ctx - [`115eebd`](https://github.com/ollama/ollama/commit/115eebd997444760579490cfef18cbe5572625fd) Merge branch 'ollama:main' into num_ctx - [`394dcd2`](https://github.com/ollama/ollama/commit/394dcd254718d25a4b72e75b6a56a97fab0f0e0e) Merge branch 'ollama:main' into num_ctx - [`01bc6c9`](https://github.com/ollama/ollama/commit/01bc6c95e6a56a25c28ad7a4151b36793430d4d0) Add NumCtx to mock LLM ### 📊 Changes **3 files changed** (+15 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `llm/server.go` (+5 -0) 📝 `server/routes.go` (+8 -0) 📝 `server/sched_test.go` (+2 -0) </details> ### 📄 Description TLDR: a model shouldn't be evicted due to num_ctx change if the client doesn't care about context size. Client A loads a model with a context window different to the default or the value configured in the Modelfile: ```console $ curl localhost:11434/api/generate -d '{"model":"llama3.2","options":{"num_ctx":65536}}' $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2:latest a80c4f17acd5 13 GB 100% GPU Forever ``` Client B does a completion but doesn't specify a context window, causing the default value of 2048 to be used, resulting in eviction and immediate reload of the model. ```console $ curl localhost:11434/api/generate -d '{"model":"llama3.2"}' $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2:latest a80c4f17acd5 3.1 GB 100% GPU Forever ``` Client A sends another completion with the large context causing another eviction and reload. ```console $ curl localhost:11434/api/generate -d '{"model":"llama3.2","options":{"num_ctx":65536}}' $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2:latest a80c4f17acd5 13 GB 100% GPU Forever ``` If client B is not concerned about the context window, it shouldn't cause the eviction of an an already loaded model. This is particularly noticeable when sharing a model between ollama and OpenAI endpoints - since the OpenAI endpoint can't set a context window, a model loaded via the ollama endpoint with a custom context window gets evicted by the next OpenAI request. Thrashing can also occur when a client makes secondary completions after a primary completion, eg open-webui's auto-complete feature (see https://github.com/ollama/ollama/issues/7919#issuecomment-2560465774), or when a model is used for both completion and embedding (https://github.com/ollama/ollama/issues/6148#issuecomment-2568402497). --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-12 16:11:44 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#11331