[GH-ISSUE #1268] feat: smart context length managment #51086

New Issue

GiteaMirror · 2026-05-05T11:56:41-05:00

GiteaMirror commented

2026-05-05 11:56:41 -05:00

Originally created by @tjbck on GitHub (Mar 22, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/1268

e.g. messages.length > 10, slice

Originally created by @tjbck on GitHub (Mar 22, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/1268 e.g. messages.length > 10, slice

GiteaMirror added the enhancement good first issue help wanted core labels 2026-05-05 11:56:41 -05:00

GiteaMirror closed this issue

2026-05-05 11:56:43 -05:00

GiteaMirror commented

2026-05-05 11:56:45 -05:00

@ghost commented on GitHub (Mar 26, 2024):

Great idea. I think it would be beneficial to cache this litellm file anyway which contains useful information including max_tokens:
https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json

While it may not be useful for local Ollama models, the Ollama Modelfile syntax supports the num_ctx parameter and can be queried via the API. A good strategy may be to leverage the litellm JSON data for external models like OpenAI and presume that all Ollama context length is the Ollama default of 2048 unless otherwise determined by the Modelfile parameter for a given model.

Although it's still beneficial to retain configurability for cases where you don't require the max context or if the information is absent.

@ghost commented on GitHub (Mar 26, 2024): Great idea. I think it would be beneficial to cache this litellm file anyway which contains useful information including max_tokens: https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json While it may not be useful for local Ollama models, the Ollama [Modelfile syntax supports the num_ctx parameter](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values) and [can be queried via the API](https://github.com/ollama/ollama/blob/main/docs/api.md#show-model-information). A good strategy may be to leverage the litellm JSON data for external models like OpenAI and presume that all Ollama context length is the Ollama default of `2048` unless otherwise determined by the Modelfile parameter for a given model. Although it's still beneficial to retain configurability for cases where you don't require the max context or if the information is absent.

GiteaMirror commented

2026-05-05 11:56:47 -05:00

@VertexMachine commented on GitHub (Jun 5, 2024):

Let me add to this some more information. This is very needed if you use APIs, and not only OpenAI API, but others like OpenRouter or Infermatics. Some models+endpoints just fail when you exceed the context length (returning error 400), some will incur massive cost for the user (as it grows with context size, and turncation might be a good option in those cases). Unfortunately, the problem is that there is no standardized tokenization endpoint defined in OpenAI compatibile API. OpenAI recommend using https://github.com/openai/tiktoken on client side.

As a workaround I do use AutoTokenizer from Transformers (from transformers import AutoTokenizer ) to calculate token count in my apps. This is the function I've written (feel free to incorporate it in your code):

    def get_token_count(self, prompt: str, raw: bool = False) -> int:
        """
        Get the token count of the given prompt.

        If raw is True than we don't count BOS and EOS tokens.
        """
        if self.tokenizer is None:
            raise ValueError("Tokenizer is not selected")

        return len(self.tokenizer.encode(prompt, add_special_tokens=False)) if raw else len(self.tokenizer.encode(prompt)) + 1

I made it generic as sometimes I don't want to have BOS/EOS token counted (hf.tokniezers by default add BOS, but not EOS).

There are a few issues here:

one have to create appropriate tokenizer, and a lot of models out there have different ones (as even fine tunes can add/remove some tokens). The good news is, that if you know the model name from HF, tokenizer created from it, eg. , tokenizer = AutoTokenizer.from_pretrained("model_name", legacy=False) will download appropriate files.
Another bad news is that for example for use of official Meta's Llama3 tokenizer from their repo, you have to agree to their terms on HF and wait a bit, so you should handle fall backs or direct user to it as well.
Also you have to know the model name, which is often different than the name from API endpoint. To quickly workaround I created such mapping for my apps:

TOKENIZER_MAP = [
    # TotalGPT/Infermatic.ai models
    ("Midnight-Miqu-70B-v1.5", "sophosympatheia/Midnight-Miqu-70B-v1.5"),
    ("CodeLlama-13b-Instruct-hf", "codellama/CodeLlama-13b-Instruct-hf"),
    ("MiquMaid-v3-70B", "NeverSleep/MiquMaid-v3-70B"),
    ("UNA-SimpleSmaug-34b-v1beta", "fblgit/UNA-SimpleSmaug-34b-v1beta"),
    ("L3-MS-Astoria-70b", "Steelskull/L3-MS-Astoria-70b"),
    ("Mixtral-8x7B-Instruct-v0.1", "mistralai/Mixtral-8x7B-Instruct-v0.1"),
    ("miquliz-120b-v2.0", "wolfram/miquliz-120b-v2.0"),
    ("Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss", "NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss"),
    ("Smaug-Llama-3-70B-Instruct", "abacusai/Smaug-Llama-3-70B-Instruct"),
    # OpenRouter models
    ("mistralai/mistral-7b-instruct:free", "mistralai/Mistral-7B-Instruct-v0.1"),
    ("alpindale/goliath-120b", "alpindale/goliath-120b"),
    ("sao10k/fimbulvetr-11b-v2", "Sao10K/Fimbulvetr-11B-v2"),
    ("cognitivecomputations/dolphin-mixtral-8x7b", "cognitivecomputations/dolphin-2.6-mixtral-8x7b"),
    ("cohere/command-r", "CohereForAI/c4ai-command-r-v01"),
    ("cohere/command-r-plus", "CohereForAI/c4ai-command-r-plus"),
    ("meta-llama/llama-3-70b-instruct", "Undi95/Meta-Llama-3-8B-hf"),
    ("neversleep/llama-3-lumimaid-70b", "NeverSleep/Llama-3-Lumimaid-70B-v0.1"),
    # Generic fallback guesses
    ("8x22B", "mistralai/Mixtral-8x22B-v0.1"),
    ("llama-3", "Undi95/Meta-Llama-3-8B-hf"),
    ("l3", "Undi95/Meta-Llama-3-8B-hf")
]

Also, feel free to use the above mapping as a starting point. As a last effort fallback I'm using simply gpt2 tokenizer AutoTokenizer.from_pretrained("gpt2").

@VertexMachine commented on GitHub (Jun 5, 2024): Let me add to this some more information. This is very needed if you use APIs, and not only OpenAI API, but others like OpenRouter or Infermatics. Some models+endpoints just fail when you exceed the context length (returning **error 400**), some will incur **massive cost for the user** (as it grows with context size, and turncation might be a good option in those cases). Unfortunately, the problem is that there is no standardized tokenization endpoint defined in OpenAI compatibile API. OpenAI recommend using https://github.com/openai/tiktoken on client side. As a workaround I do use AutoTokenizer from Transformers (`from transformers import AutoTokenizer `) to calculate token count in my apps. This is the function I've written (feel free to incorporate it in your code): ```python def get_token_count(self, prompt: str, raw: bool = False) -> int: """ Get the token count of the given prompt. If raw is True than we don't count BOS and EOS tokens. """ if self.tokenizer is None: raise ValueError("Tokenizer is not selected") return len(self.tokenizer.encode(prompt, add_special_tokens=False)) if raw else len(self.tokenizer.encode(prompt)) + 1 ``` I made it generic as sometimes I don't want to have BOS/EOS token counted (hf.tokniezers by default add BOS, but not EOS). There are a few issues here: - one have to create appropriate tokenizer, and a lot of models out there have different ones (as even fine tunes can add/remove some tokens). The good news is, that if you know the model name from HF, tokenizer created from it, eg. , `tokenizer = AutoTokenizer.from_pretrained("model_name", legacy=False)` will download appropriate files. - Another bad news is that for example for use of official Meta's Llama3 tokenizer from their repo, you have to agree to their terms on HF and wait a bit, so you should handle fall backs or direct user to it as well. - Also you have to know the model name, which is often different than the name from API endpoint. To quickly workaround I created such mapping for my apps: ```python TOKENIZER_MAP = [ # TotalGPT/Infermatic.ai models ("Midnight-Miqu-70B-v1.5", "sophosympatheia/Midnight-Miqu-70B-v1.5"), ("CodeLlama-13b-Instruct-hf", "codellama/CodeLlama-13b-Instruct-hf"), ("MiquMaid-v3-70B", "NeverSleep/MiquMaid-v3-70B"), ("UNA-SimpleSmaug-34b-v1beta", "fblgit/UNA-SimpleSmaug-34b-v1beta"), ("L3-MS-Astoria-70b", "Steelskull/L3-MS-Astoria-70b"), ("Mixtral-8x7B-Instruct-v0.1", "mistralai/Mixtral-8x7B-Instruct-v0.1"), ("miquliz-120b-v2.0", "wolfram/miquliz-120b-v2.0"), ("Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss", "NeverSleep/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss"), ("Smaug-Llama-3-70B-Instruct", "abacusai/Smaug-Llama-3-70B-Instruct"), # OpenRouter models ("mistralai/mistral-7b-instruct:free", "mistralai/Mistral-7B-Instruct-v0.1"), ("alpindale/goliath-120b", "alpindale/goliath-120b"), ("sao10k/fimbulvetr-11b-v2", "Sao10K/Fimbulvetr-11B-v2"), ("cognitivecomputations/dolphin-mixtral-8x7b", "cognitivecomputations/dolphin-2.6-mixtral-8x7b"), ("cohere/command-r", "CohereForAI/c4ai-command-r-v01"), ("cohere/command-r-plus", "CohereForAI/c4ai-command-r-plus"), ("meta-llama/llama-3-70b-instruct", "Undi95/Meta-Llama-3-8B-hf"), ("neversleep/llama-3-lumimaid-70b", "NeverSleep/Llama-3-Lumimaid-70B-v0.1"), # Generic fallback guesses ("8x22B", "mistralai/Mixtral-8x22B-v0.1"), ("llama-3", "Undi95/Meta-Llama-3-8B-hf"), ("l3", "Undi95/Meta-Llama-3-8B-hf") ] ``` Also, feel free to use the above mapping as a starting point. As a last effort fallback I'm using simply gpt2 tokenizer `AutoTokenizer.from_pretrained("gpt2")`.

GiteaMirror commented

2026-05-05 11:56:52 -05:00

@tjbck commented on GitHub (Jun 19, 2024):

Filter function from #3247 will resolved this. You can essentially write your own custom middleware and install it with functions.

@tjbck commented on GitHub (Jun 19, 2024): Filter function from #3247 will resolved this. You can essentially write your own custom middleware and install it with functions.

GiteaMirror commented

2026-05-05 11:56:56 -05:00

@tjbck commented on GitHub (Jun 30, 2024):

https://openwebui.com/f/hub/context_clip_filter

Feedback wanted here!

@tjbck commented on GitHub (Jun 30, 2024): https://openwebui.com/f/hub/context_clip_filter Feedback wanted here!

Sign in to join this conversation.