[GH-ISSUE #2732] Provide model context length in API #1642

Closed
opened 2026-04-12 11:35:21 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @gluonfield on GitHub (Feb 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2732

Hi guys, love Ollama and contributing to the ecosystem by building Enchanted.

One important thing that is currently missing from /api/show API is the context length that model supports. For all RAG applications this is essential to know and it seems that in the future models will support greatly varied context lengths.

It would be great to include this metadata into API and as well as Ollama library.

Example of /api/show response for Gemma contains no clues of context length.

{
  "license": "...",
  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM gemma:7b\n\nFROM /Users/wpc/.ollama/models/blobs/sha256:456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9\nTEMPLATE \"\"\"<start_of_turn>user\n{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>\n<start_of_turn>model\n{{ .Response }}<end_of_turn>\n\"\"\"\nPARAMETER repeat_penalty 1\nPARAMETER stop \"<start_of_turn>\"\nPARAMETER stop \"<end_of_turn>\"",
  "parameters": "repeat_penalty                 1\nstop                           \"<start_of_turn>\"\nstop                           \"<end_of_turn>\"",
  "template": "<start_of_turn>user\n{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>\n<start_of_turn>model\n{{ .Response }}<end_of_turn>\n",
  "details": {
    "parent_model": "",
    "format": "gguf",
    "family": "gemma",
    "families": [
      "gemma"
    ],
    "parameter_size": "9B",
    "quantization_level": "Q4_0"
  }
}
Originally created by @gluonfield on GitHub (Feb 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2732 Hi guys, love Ollama and contributing to the ecosystem by building [Enchanted](https://github.com/AugustDev/enchanted). One important thing that is currently missing from `/api/show` API is the context length that model supports. For all RAG applications this is essential to know and it seems that in the future models will support greatly varied context lengths. It would be great to include this metadata into API and as well as [Ollama library](https://ollama.com/library). Example of `/api/show` response for Gemma contains no clues of context length. ``` { "license": "...", "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM gemma:7b\n\nFROM /Users/wpc/.ollama/models/blobs/sha256:456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9\nTEMPLATE \"\"\"<start_of_turn>user\n{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>\n<start_of_turn>model\n{{ .Response }}<end_of_turn>\n\"\"\"\nPARAMETER repeat_penalty 1\nPARAMETER stop \"<start_of_turn>\"\nPARAMETER stop \"<end_of_turn>\"", "parameters": "repeat_penalty 1\nstop \"<start_of_turn>\"\nstop \"<end_of_turn>\"", "template": "<start_of_turn>user\n{{ if .System }}{{ .System }} {{ end }}{{ .Prompt }}<end_of_turn>\n<start_of_turn>model\n{{ .Response }}<end_of_turn>\n", "details": { "parent_model": "", "format": "gguf", "family": "gemma", "families": [ "gemma" ], "parameter_size": "9B", "quantization_level": "Q4_0" } } ```
GiteaMirror added the feature request label 2026-04-12 11:35:22 -05:00
Author
Owner

@Nurgo commented on GitHub (Mar 20, 2024):

I can confirm that this feature would be very useful. We'd love to add Ollama support to BrainSoup, but not being able to know the maximum context size of models is very problematic for optimizing RAG.

<!-- gh-comment-id:2009281697 --> @Nurgo commented on GitHub (Mar 20, 2024): I can confirm that this feature would be very useful. We'd love to add Ollama support to [BrainSoup](https://www.nurgo-software.com/products/brainsoup), but not being able to know the maximum context size of models is very problematic for optimizing RAG.
Author
Owner

@supercurio commented on GitHub (Jun 4, 2024):

I need this feature as well, in order to set the context size and build the chat history according to the model's capability.

Current workarounds have shortcomings:

  • pattern matching based on the name is not ideal when llama3 is 8k and llama3-gradient is 1048k. But feasible when matching for "llama3:" or "llama3-gradient:" strings.
  • pattern matching on the base name doesn't work for phi3 which exists in 4k and 128k variants.
  • the info is available on ollama.com but it doesn't work offline.

The maximum context size info itself would be valuable. It might be even more useful in conjunction in an API which returns also:

  • VRAM availability
  • Which model + context size fits in VRAM.
<!-- gh-comment-id:2147640966 --> @supercurio commented on GitHub (Jun 4, 2024): I need this feature as well, in order to set the context size and build the chat history according to the model's capability. Current workarounds have shortcomings: - pattern matching based on the name is not ideal when llama3 is 8k and llama3-gradient is 1048k. But feasible when matching for "llama3:" or "llama3-gradient:" strings. - pattern matching on the base name doesn't work for phi3 which exists in 4k and 128k variants. - the info is [available on ollama.com](https://ollama.com/library/llama3-gradient/blobs/011c3962dbd7) but it doesn't work offline. The maximum context size info itself would be valuable. It might be even more useful in conjunction in an API which returns also: - VRAM availability - Which model + context size fits in VRAM.
Author
Owner

@jmorganca commented on GitHub (Jun 25, 2024):

Hi all, this should be possible via /api/show:

"model_info": {
    "general.architecture": "llama",
    "general.file_type": 2,
    "general.parameter_count": 8030261248,
    "general.quantization_version": 2,
    "llama.attention.head_count": 32,
    "llama.attention.head_count_kv": 8,
    "llama.attention.layer_norm_rms_epsilon": 0.00001,
    "llama.block_count": 32,
    "llama.context_length": 8192,
    "llama.embedding_length": 4096,
    "llama.feed_forward_length": 14336,
    "llama.rope.dimension_count": 128,
    "llama.rope.freq_base": 500000,
    "llama.vocab_size": 128256,
    "tokenizer.ggml.bos_token_id": 128000,
    "tokenizer.ggml.eos_token_id": 128009,
    "tokenizer.ggml.merges": [],
    "tokenizer.ggml.model": "gpt2",
    "tokenizer.ggml.pre": "llama-bpe",
    "tokenizer.ggml.token_type": [],
    "tokenizer.ggml.tokens": []
  }
<!-- gh-comment-id:2187953995 --> @jmorganca commented on GitHub (Jun 25, 2024): Hi all, this should be possible via `/api/show`: ``` "model_info": { "general.architecture": "llama", "general.file_type": 2, "general.parameter_count": 8030261248, "general.quantization_version": 2, "llama.attention.head_count": 32, "llama.attention.head_count_kv": 8, "llama.attention.layer_norm_rms_epsilon": 0.00001, "llama.block_count": 32, "llama.context_length": 8192, "llama.embedding_length": 4096, "llama.feed_forward_length": 14336, "llama.rope.dimension_count": 128, "llama.rope.freq_base": 500000, "llama.vocab_size": 128256, "tokenizer.ggml.bos_token_id": 128000, "tokenizer.ggml.eos_token_id": 128009, "tokenizer.ggml.merges": [], "tokenizer.ggml.model": "gpt2", "tokenizer.ggml.pre": "llama-bpe", "tokenizer.ggml.token_type": [], "tokenizer.ggml.tokens": [] } ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1642