[GH-ISSUE #8356] Allow context to be set from the command line. #67414

Closed
opened 2026-05-04 10:16:37 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @iplayfast on GitHub (Jan 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8356

I have a shell script called wizard that is simply

more ~/bin/wizard 
#!/bin/bash
ollama run llama3.2

It's a very useful script but I would like it to be able to set the context to a larger amount something like

ollama run llama3.2 --num_ctx 4096

Current docs:

By default, Ollama uses a context window size of 2048 tokens.

To change this when using ollama run, use /set parameter:

/set parameter num_ctx 4096
When using the API, specify the num_ctx parameter:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'
Originally created by @iplayfast on GitHub (Jan 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8356 I have a shell script called wizard that is simply ``` more ~/bin/wizard #!/bin/bash ollama run llama3.2 ``` It's a very useful script but I would like it to be able to set the context to a larger amount something like ``` ollama run llama3.2 --num_ctx 4096 ``` Current docs: ```How can I specify the context window size? By default, Ollama uses a context window size of 2048 tokens. To change this when using ollama run, use /set parameter: /set parameter num_ctx 4096 When using the API, specify the num_ctx parameter: curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Why is the sky blue?", "options": { "num_ctx": 4096 } }' ````
GiteaMirror added the feature request label 2026-05-04 10:16:38 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 9, 2025):

workaround: create a model that has a different context size.

$ ollama run llama3.2
>>> /set parameter num_ctx 4096
Set parameter 'num_ctx' to '4096'
>>> /save wizard
Created new model 'wizard'
>>> /bye
$ ollama show wizard 
  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Parameters
    num_ctx    4096                     
    stop       "<|start_header_id|>"    
    stop       "<|end_header_id|>"      
    stop       "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    
$  ollama run wizard 
>>> hello
Hello! How can I assist you today?
>>> /bye
<!-- gh-comment-id:2579221678 --> @rick-github commented on GitHub (Jan 9, 2025): workaround: create a model that has a different context size. ```console $ ollama run llama3.2 >>> /set parameter num_ctx 4096 Set parameter 'num_ctx' to '4096' >>> /save wizard Created new model 'wizard' >>> /bye $ ollama show wizard Model architecture llama parameters 3.2B context length 131072 embedding length 3072 quantization Q4_K_M Parameters num_ctx 4096 stop "<|start_header_id|>" stop "<|end_header_id|>" stop "<|eot_id|>" License LLAMA 3.2 COMMUNITY LICENSE AGREEMENT Llama 3.2 Version Release Date: September 25, 2024 $ ollama run wizard >>> hello Hello! How can I assist you today? >>> /bye ```
Author
Owner

@pacien commented on GitHub (Jan 9, 2025):

Patch allowing that: https://github.com/ollama/ollama/pull/8340

<!-- gh-comment-id:2581352139 --> @pacien commented on GitHub (Jan 9, 2025): Patch allowing that: https://github.com/ollama/ollama/pull/8340
Author
Owner

@pdevine commented on GitHub (Jan 11, 2025):

Going to go ahead and close this since there's a workaround.

<!-- gh-comment-id:2584937641 --> @pdevine commented on GitHub (Jan 11, 2025): Going to go ahead and close this since there's a workaround.
Author
Owner

@iplayfast commented on GitHub (Jan 27, 2025):

@pdevine I don't think this issue should be closed. There is a work around but it is not a solution. Just because you can push a car to the gas station, doesn't mean that it's got a full tank.

<!-- gh-comment-id:2616455018 --> @iplayfast commented on GitHub (Jan 27, 2025): @pdevine I don't think this issue should be closed. There is a work around but it is not a solution. Just because you can push a car to the gas station, doesn't mean that it's got a full tank.
Author
Owner

@rick-github commented on GitHub (Jan 27, 2025):

The ollama cli is just a simple client. There are other ways to get more functionality: clients in integrations, patch in #8340 and self-build, or use a small script to add the required arguments.

#!/usr/bin/env python3

import ollama
import argparse
import sys
try:
  import readline
except:
  pass

parser = argparse.ArgumentParser()
parser.add_argument("--system", help="Set system message", default=None)
parser.add_argument("--num_ctx", help="Set context size", default=None)
parser.add_argument("--temperature", help="Set tempaeratur", default=None)
parser.add_argument("--nostream", help="Disable streaming", default=False, action="store_true")
parser.add_argument("model")
parser.add_argument("prompts", nargs='*')
args = parser.parse_args()

client = ollama.Client()
userprompt = ">>> " if sys.stdin.isatty() else ""

options = {}
if args.temperature:
  options["temperature"] = args.temperature
if args.num_ctx:
  options["num_ctx"] = int(args.num_ctx)

def chat(messages, prompt):
  messages.append({"role":"user", "content": prompt})
  response = client.chat(model=args.model, messages=messages, options=options, stream=not args.nostream)
  m = ''
  for r in response if not args.nostream else [response]:
    c = r['message']['content']
    print(c, end='', flush=True)
    m = m + c
  print()
  messages.append({"role": "assistant", "content": m})
  return messages

messages = []
if args.system:
  messages.append({"role":"system","content":args.system})
for prompt in args.prompts:
  messages = chat(messages, prompt)
while True:
  try:
    prompt = input(userprompt)
  except:
    print()
    break
  if prompt == "/bye":
    break
  messages = chat(messages, prompt)

If the fuel tank on the car is too small, get a car with a bigger tank.

<!-- gh-comment-id:2616575924 --> @rick-github commented on GitHub (Jan 27, 2025): The ollama cli is just a simple client. There are other ways to get more functionality: clients in [integrations](https://github.com/ollama/ollama?tab=readme-ov-file#community-integrations), patch in #8340 and self-build, or use a small script to add the required arguments. ```python #!/usr/bin/env python3 import ollama import argparse import sys try: import readline except: pass parser = argparse.ArgumentParser() parser.add_argument("--system", help="Set system message", default=None) parser.add_argument("--num_ctx", help="Set context size", default=None) parser.add_argument("--temperature", help="Set tempaeratur", default=None) parser.add_argument("--nostream", help="Disable streaming", default=False, action="store_true") parser.add_argument("model") parser.add_argument("prompts", nargs='*') args = parser.parse_args() client = ollama.Client() userprompt = ">>> " if sys.stdin.isatty() else "" options = {} if args.temperature: options["temperature"] = args.temperature if args.num_ctx: options["num_ctx"] = int(args.num_ctx) def chat(messages, prompt): messages.append({"role":"user", "content": prompt}) response = client.chat(model=args.model, messages=messages, options=options, stream=not args.nostream) m = '' for r in response if not args.nostream else [response]: c = r['message']['content'] print(c, end='', flush=True) m = m + c print() messages.append({"role": "assistant", "content": m}) return messages messages = [] if args.system: messages.append({"role":"system","content":args.system}) for prompt in args.prompts: messages = chat(messages, prompt) while True: try: prompt = input(userprompt) except: print() break if prompt == "/bye": break messages = chat(messages, prompt) ``` If the fuel tank on the car is too small, get a car with a bigger tank.
Author
Owner

@MrHyplex9511 commented on GitHub (Aug 15, 2025):

ollama run llama3.2 --num_ctx 4096 doesnt work

<!-- gh-comment-id:3190802960 --> @MrHyplex9511 commented on GitHub (Aug 15, 2025): ollama run llama3.2 --num_ctx 4096 doesnt work
Author
Owner

@pdevine commented on GitHub (Aug 15, 2025):

ollama run llama3.2
>>> /set parameter num_ctx 4096
>>>

Alternatively you can set the OLLAMA_CONTEXT_LENGTH variable when you're running ollama serve to change the default context size.

<!-- gh-comment-id:3192821206 --> @pdevine commented on GitHub (Aug 15, 2025): ``` ollama run llama3.2 >>> /set parameter num_ctx 4096 >>> ``` Alternatively you can set the `OLLAMA_CONTEXT_LENGTH` variable when you're running `ollama serve` to change the default context size.
Author
Owner

@Heavy-A commented on GitHub (Feb 22, 2026):

Why isn't the max content length used anyway?? This is such a hidden default that does not make too much sense. With the option to global overriding all max content length or use a Modelfile to downgrade. I now have to go through all my models, make a Modelfile for them and apply. This is upside down thinking.

<!-- gh-comment-id:3941480682 --> @Heavy-A commented on GitHub (Feb 22, 2026): Why isn't the max content length used anyway?? This is such a hidden default that does not make too much sense. With the option to global overriding all max content length or use a Modelfile to downgrade. I now have to go through all my models, make a Modelfile for them and apply. This is upside down thinking.
Author
Owner

@rick-github commented on GitHub (Feb 22, 2026):

Set OLLAMA_CONTEXT_LENGTH=2000000. The ollama server will cap the context size to the max supported by the model.

<!-- gh-comment-id:3941487378 --> @rick-github commented on GitHub (Feb 22, 2026): Set `OLLAMA_CONTEXT_LENGTH=2000000`. The ollama server will cap the context size to the max supported by the model.
Author
Owner

@Heavy-A commented on GitHub (Feb 22, 2026):

Great!
I have not set it yet, but I noticed there were some models that were higher than 4096 when loaded. I now understand that ollama can and is dynamically setting the context length as long as your app requests this. Setting to max via the OLLAMA_CONTEXT_LENGTH will then set it to max always irrespective of request?

Alpaca does (only in 'child' models) and Obsidian does this when using copilot (setting it to 128k, just by testing the connection.)
It did this setting to 128k for 2 small 40k models, qwen3:06.b & qwen3:1.7b. From the memory used it seems indeed that context length is capped at model max (40k). The ollama ps tool however does show 128k (which was requested, so that seems incorrect)

Normal:
user@ThinkBook:~$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:0.6b 7df6b6e09427 1.0 GB 100% CPU 4096 4 minutes from now

Alpaca:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:0.6b 7df6b6e09427 2.4 GB 100% CPU 16384 4 minutes from now

Obsidian:
user@ThinkBook:~$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
qwen3:0.6b 7df6b6e09427 5.3 GB 100% CPU 131072 3 minutes from now

user@ThinkBook:~$ ollama show qwen3:0.6b
Model
architecture qwen3
parameters 751.63M
context length 40960
embedding length 1024
quantization Q4_K_M

Now I understand.

<!-- gh-comment-id:3941657704 --> @Heavy-A commented on GitHub (Feb 22, 2026): Great! I have not set it yet, but I noticed there were some models that were higher than 4096 when loaded. I now understand that ollama can and is dynamically setting the context length as long as your app requests this. Setting to max via the OLLAMA_CONTEXT_LENGTH will then set it to max always irrespective of request? Alpaca does (only in 'child' models) and Obsidian does this when using copilot (setting it to 128k, just by testing the connection.) It did this setting to 128k for 2 small 40k models, qwen3:06.b & qwen3:1.7b. From the memory used it seems indeed that context length is capped at model max (40k). The ollama ps tool however does show 128k (which was requested, so that seems incorrect) Normal: user@ThinkBook:~$ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:0.6b 7df6b6e09427 1.0 GB 100% CPU 4096 4 minutes from now Alpaca: NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:0.6b 7df6b6e09427 2.4 GB 100% CPU 16384 4 minutes from now Obsidian: user@ThinkBook:~$ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3:0.6b 7df6b6e09427 5.3 GB 100% CPU 131072 3 minutes from now user@ThinkBook:~$ ollama show qwen3:0.6b Model architecture qwen3 parameters 751.63M context length 40960 embedding length 1024 quantization Q4_K_M Now I understand.
Author
Owner

@rick-github commented on GitHub (Feb 22, 2026):

The ollama ps tool however does show 128k (which was requested, so that seems incorrect)

There was a recent change to ollama that corrected this, try updating.

<!-- gh-comment-id:3941663693 --> @rick-github commented on GitHub (Feb 22, 2026): > The ollama ps tool however does show 128k (which was requested, so that seems incorrect) There was a recent change to ollama that corrected this, try updating.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67414