[GH-ISSUE #11871] Extend Ollama’s Maximum Context Window to 512k Tokens (524,288) with Dynamic KV Allocation and Advanced Long-Context Scaling #69939

Open
opened 2026-05-04 19:49:32 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Kinkazma on GitHub (Aug 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11871

512k

Today, Ollama allows the context window size to be set via the num_ctx parameter, with a default value of 2,048 tokens and support for up to 128k in certain recent models (e.g., Llama 3.1). However, a number of advanced use cases — such as large-scale corpus analysis, extensive code generation, or managing long multi-turn conversations — would directly benefit from a substantial increase to this limit.

The goal of this request is to increase the maximum supported context window to 512,000 tokens (524,288), when the model and hardware resources allow it, while ensuring stable performance and optimized memory usage.

Motivations

•	Full exploitation of long-context models: Several models on the market (including some already available in Ollama’s library) advertise combined input+output capacities of up to 512k tokens. Being unable to leverage this in Ollama limits their potential.
•	Flexibility for research and prototyping: R&D workflows often require experimenting with very large contexts without relying on RAG chunking, which can degrade response quality or reference coherence.
•	Alignment with competing runtimes: Projects like vLLM or Hugging Face Transformers backends have already introduced paged memory mechanisms and RoPE/YaRN scaling capable of reaching or exceeding 512k.

Technical Proposals

1.	Lift software limitations:
•	Allow num_ctx up to 524,288 if the model supports it (read directly from GGUF metadata).
•	Display in ollama show the difference between the model’s native context and the runtime’s effective context.
2.	Dynamic KV allocation:
•	Replace full pre-allocation of the K/V cache with a paged allocation approach, progressively allocating memory based on the number of tokens actually used (similar to PagedAttention in vLLM).
•	Optimize for Metal/Apple Silicon with paged MTL buffers and UMA support.
3.	Advanced long-context scaling management:
•	Officially expose RoPE/NTK/YaRN parameters in the Modelfile and API (options).
•	Default to the model-provided values, but allow controlled overrides.
4.	Ergonomic improvements:
•	Add a --dry-run option or ollama plan command to estimate memory/performance impact of a given context before execution.
•	Enable defining and persisting a high num_ctx value via the CLI (/set, /save).
Originally created by @Kinkazma on GitHub (Aug 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11871 # **512k** Today, Ollama allows the context window size to be set via the num_ctx parameter, with a default value of 2,048 tokens and support for up to 128k in certain recent models (e.g., Llama 3.1). However, a number of advanced use cases — such as large-scale corpus analysis, extensive code generation, or managing long multi-turn conversations — would directly benefit from a substantial increase to this limit. The goal of this request is to increase the maximum supported context window to 512,000 tokens (524,288), when the model and hardware resources allow it, while ensuring stable performance and optimized memory usage. ## Motivations • Full exploitation of long-context models: Several models on the market (including some already available in Ollama’s library) advertise combined input+output capacities of up to 512k tokens. Being unable to leverage this in Ollama limits their potential. • Flexibility for research and prototyping: R&D workflows often require experimenting with very large contexts without relying on RAG chunking, which can degrade response quality or reference coherence. • Alignment with competing runtimes: Projects like vLLM or Hugging Face Transformers backends have already introduced paged memory mechanisms and RoPE/YaRN scaling capable of reaching or exceeding 512k. ## Technical Proposals 1. Lift software limitations: • Allow num_ctx up to 524,288 if the model supports it (read directly from GGUF metadata). • Display in ollama show the difference between the model’s native context and the runtime’s effective context. 2. Dynamic KV allocation: • Replace full pre-allocation of the K/V cache with a paged allocation approach, progressively allocating memory based on the number of tokens actually used (similar to PagedAttention in vLLM). • Optimize for Metal/Apple Silicon with paged MTL buffers and UMA support. 3. Advanced long-context scaling management: • Officially expose RoPE/NTK/YaRN parameters in the Modelfile and API (options). • Default to the model-provided values, but allow controlled overrides. 4. Ergonomic improvements: • Add a --dry-run option or ollama plan command to estimate memory/performance impact of a given context before execution. • Enable defining and persisting a high num_ctx value via the CLI (/set, /save).
GiteaMirror added the feature request label 2026-05-04 19:49:32 -05:00
Author
Owner

@onestardao commented on GitHub (Aug 13, 2025):

It sounds like your proposal could hit Problem Map issue #9: Entropy Collapse — when the system’s output coherence degrades as context grows, especially at extreme window sizes.
If you want, I can share the detailed fix guide we’ve used to help over 80+ devs mitigate this in high-token settings — just let me know and I’ll send you the link.

<!-- gh-comment-id:3181937498 --> @onestardao commented on GitHub (Aug 13, 2025): It sounds like your proposal could hit Problem Map issue **#9: Entropy Collapse** — when the system’s output coherence degrades as context grows, especially at extreme window sizes. If you want, I can share the detailed fix guide we’ve used to help over 80+ devs mitigate this in high-token settings — just let me know and I’ll send you the link.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69939