[GH-ISSUE #10829] mistral-small3.1:latest: OLLAMA_CONTEXT_LENGTH=16384 but ollama runs with --ctx-size 4096 #32870

Closed
opened 2026-04-22 14:46:07 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @dguembel-itomig on GitHub (May 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10829

What is the issue?

Hello,

I am running ollama 0.7.0 in a dockerized environment. I have set the OLLAMA_CONTEXT_LENGTH environment variable to 16384. It seems (see below) that works (the variable is correctly set), but ollama run is ignoring this and starts with a ctx-size of 4096, when using mistrall-small3.1:latest.

It looks to me that this is because the modelfile of MIstral-Small 3.1 seems to hard set the num_ctx to 4096, which is not the case in (e.g.) mistral-small:latest, gemma3:12b-it-qat, phi4:latest (and probably many others).

# docker exec -it watchtower-ollama-1 ollama run mistral-small:latest 
>>> /show parameters
Model defined parameters:
temperature                    0.15
>>> 

# docker exec -it watchtower-ollama-1 ollama run mistral-small3.1:latest 
>>> /show parameters
Model defined parameters:
num_ctx                        4096

<=>

  • is this a bug in ollama's behavior (i.e. shouldn't OLLAMA_CONTEXT_LENGTH override a modelfile setting)?
  • if not, maybe the documentation might include the information what takes precedence over what (modelfile setting >> env variable)?
  • or is this possibly unintended and the mistral-small3.1:latest modelfile needs an update?

If I can be of assistance in debugging / testing, please let me know.

Thank you,

David

(I manually added some newlines in the log so the env variables are easier to read)

more info

At startup, the env variable looks correct (see log entries below): OLLAMA_CONTEXT_LENGTH:16384

But, according to the log the command that is run is /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 4096 --batch-size 512 --n-gpu-layers 26 --threads 6 --flash-attn --kv-cache-type q8_0 --parallel 1 --port 44161

Thus, I see this warning in the logs:
time=2025-05-23T07:22:41.725Z level=WARN source=runner.go:151 msg="truncating input prompt" limit=4096 prompt=4665 keep=4 new=4096

time=2025-05-23T07:15:56.077Z level=INFO source=routes.go:1205 msg="server config" env="map
[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: 
OLLAMA_CONTEXT_LENGTH:16384 
OLLAMA_DEBUG:INFO 
OLLAMA_FLASH_ATTENTION:true 
OLLAMA_GPU_OVERHEAD:0 
OLLAMA_HOST:http://0.0.0.0:11434 
OLLAMA_INTEL_GPU:false 
OLLAMA_KEEP_ALIVE:5m0s 
OLLAMA_KV_CACHE_TYPE:q8_0 
OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s 
OLLAMA_MAX_LOADED_MODELS:0 
OLLAMA_MAX_QUEUE:512 
OLLAMA_MODELS:/root/.ollama/models 
OLLAMA_MULTIUSER_CACHE:false 
OLLAMA_NEW_ENGINE:false 
OLLAMA_NOHISTORY:false 
OLLAMA_NOPRUNE:false 
OLLAMA_NUM_PARALLEL:0 
OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] 
OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-05-23T07:15:56.079Z level=INFO source=images.go:463 msg="total blobs: 57"
time=2025-05-23T07:15:56.079Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
time=2025-05-23T07:15:56.079Z level=INFO source=routes.go:1258 msg="Listening on [::]:11434 (version 0.7.0)"
time=2025-05-23T07:15:56.080Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-05-23T07:15:56.400Z level=INFO source=types.go:130 msg="inference compute" id=GPU-aa5c1cb5-bae0-5032-c426-49bcfd0cf7b5 library=cuda variant=v12 compute=8.9 driver=12.5 name="NVIDIA RTX 4000 SFF Ada Generation" total="19.6 GiB" available="19.4 GiB"
time=2025-05-23T07:15:57.287Z level=INFO source=server.go:135 msg="system memory" total="62.6 GiB" free="60.5 GiB" free_swap="32.0 GiB"
time=2025-05-23T07:15:57.406Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=26 layers.split="" memory.available="[19.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.1 GiB" memory.required.partial="19.1 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[19.1 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB"
time=2025-05-23T07:15:57.406Z level=INFO source=server.go:211 msg="enabling flash attention"


time=2025-05-23T07:15:57.428Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 4096 --batch-size 512 --n-gpu-layers 26 --threads 6 --flash-attn --kv-cache-type q8_0 --parallel 1 --port 44161"


time=2025-05-23T07:15:57.428Z level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T07:15:57.428Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T07:15:57.428Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-23T07:15:57.437Z level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T07:15:57.437Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:44161"
time=2025-05-23T07:15:57.460Z level=INFO source=ggml.go:73 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-05-23T07:15:57.503Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-05-23T07:15:57.588Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="6.2 GiB"
time=2025-05-23T07:15:57.588Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.2 GiB"
time=2025-05-23T07:15:57.684Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-23T07:15:59.346Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="152.0 MiB"
time=2025-05-23T07:15:59.346Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="148.0 MiB"
time=2025-05-23T07:15:59.441Z level=INFO source=server.go:630 msg="llama runner started in 2.01 seconds"

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @dguembel-itomig on GitHub (May 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10829 ### What is the issue? Hello, I am running ollama 0.7.0 in a dockerized environment. I have set the `OLLAMA_CONTEXT_LENGTH` environment variable to `16384`. It seems (see below) that works (the variable is correctly set), but ollama run is ignoring this and starts with a `ctx-size` of `4096`, when using `mistrall-small3.1:latest`. It looks to me that this is because the modelfile of MIstral-Small 3.1 seems to hard set the `num_ctx` to `4096`, which is not the case in (e.g.) `mistral-small:latest`, `gemma3:12b-it-qat`, `phi4:latest` (and probably many others). ``` # docker exec -it watchtower-ollama-1 ollama run mistral-small:latest >>> /show parameters Model defined parameters: temperature 0.15 >>> # docker exec -it watchtower-ollama-1 ollama run mistral-small3.1:latest >>> /show parameters Model defined parameters: num_ctx 4096 ``` <=> - is this a bug in ollama's behavior (i.e. shouldn't OLLAMA_CONTEXT_LENGTH override a modelfile setting)? - if not, maybe the documentation might include the information what takes precedence over what (modelfile setting >> env variable)? - or is this possibly unintended and the mistral-small3.1:latest modelfile needs an update? If I can be of assistance in debugging / testing, please let me know. Thank you, David (I manually added some newlines in the log so the env variables are easier to read) ## more info At startup, the env variable looks correct (see log entries below): `OLLAMA_CONTEXT_LENGTH:16384` But, according to the log the command that is run is `/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 4096 --batch-size 512 --n-gpu-layers 26 --threads 6 --flash-attn --kv-cache-type q8_0 --parallel 1 --port 44161 ` Thus, I see this warning in the logs: `time=2025-05-23T07:22:41.725Z level=WARN source=runner.go:151 msg="truncating input prompt" limit=4096 prompt=4665 keep=4 new=4096` ``` time=2025-05-23T07:15:56.077Z level=INFO source=routes.go:1205 msg="server config" env="map [CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:16384 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-05-23T07:15:56.079Z level=INFO source=images.go:463 msg="total blobs: 57" time=2025-05-23T07:15:56.079Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" time=2025-05-23T07:15:56.079Z level=INFO source=routes.go:1258 msg="Listening on [::]:11434 (version 0.7.0)" time=2025-05-23T07:15:56.080Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-05-23T07:15:56.400Z level=INFO source=types.go:130 msg="inference compute" id=GPU-aa5c1cb5-bae0-5032-c426-49bcfd0cf7b5 library=cuda variant=v12 compute=8.9 driver=12.5 name="NVIDIA RTX 4000 SFF Ada Generation" total="19.6 GiB" available="19.4 GiB" time=2025-05-23T07:15:57.287Z level=INFO source=server.go:135 msg="system memory" total="62.6 GiB" free="60.5 GiB" free_swap="32.0 GiB" time=2025-05-23T07:15:57.406Z level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=26 layers.split="" memory.available="[19.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.1 GiB" memory.required.partial="19.1 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[19.1 GiB]" memory.weights.total="13.1 GiB" memory.weights.repeating="12.7 GiB" memory.weights.nonrepeating="360.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB" projector.weights="769.3 MiB" projector.graph="8.8 GiB" time=2025-05-23T07:15:57.406Z level=INFO source=server.go:211 msg="enabling flash attention" time=2025-05-23T07:15:57.428Z level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-1fa8532d986d729117d6b5ac2c884824d0717c9468094554fd1d36412c740cfc --ctx-size 4096 --batch-size 512 --n-gpu-layers 26 --threads 6 --flash-attn --kv-cache-type q8_0 --parallel 1 --port 44161" time=2025-05-23T07:15:57.428Z level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T07:15:57.428Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T07:15:57.428Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-23T07:15:57.437Z level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T07:15:57.437Z level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:44161" time=2025-05-23T07:15:57.460Z level=INFO source=ggml.go:73 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=43 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2025-05-23T07:15:57.503Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-05-23T07:15:57.588Z level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="6.2 GiB" time=2025-05-23T07:15:57.588Z level=INFO source=ggml.go:299 msg="model weights" buffer=CUDA0 size="8.2 GiB" time=2025-05-23T07:15:57.684Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-23T07:15:59.346Z level=INFO source=ggml.go:556 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="152.0 MiB" time=2025-05-23T07:15:59.346Z level=INFO source=ggml.go:556 msg="compute graph" backend=CPU buffer_type=CPU size="148.0 MiB" time=2025-05-23T07:15:59.441Z level=INFO source=server.go:630 msg="llama runner started in 2.01 seconds" ``` ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 14:46:07 -05:00
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

The precedence is global (env variables) -> per-model attributes -> API call. It's unusual for a model to have a context size set, this may have been done to increase from the old ollama default of 2048. At the moment the only remedy available is to modify the Modefile of the model to remove the parameter, or specify num_ctx in API calls.

<!-- gh-comment-id:2903727803 --> @rick-github commented on GitHub (May 23, 2025): The precedence is global (env variables) -> per-model attributes -> API call. It's unusual for a model to have a context size set, this may have been done to increase from the old ollama default of 2048. At the moment the only remedy available is to modify the Modefile of the model to remove the parameter, or specify `num_ctx` in API calls.
Author
Owner

@dguembel-itomig commented on GitHub (May 23, 2025):

Hi,

thank you very much for explaining this - so there is

  • an unusual model file, but that’s for the maintainer to decide if it needs changing (I'd vote for "yes" ;-)
  • additionally, there is a bug in Ollama's behavior because the environment variable should take precedence over the model file, which is not currently the case.

<=> Correct?

<!-- gh-comment-id:2903786846 --> @dguembel-itomig commented on GitHub (May 23, 2025): Hi, thank you very much for explaining this - so there is - an unusual model file, but that’s for the maintainer to decide if it needs changing (I'd vote for "yes" ;-) - additionally, there is a bug in Ollama's behavior because the environment variable should take precedence over the model file, which is not currently the case. <=> Correct?
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

  • an unusual model file, but that’s for the maintainer to decide if it needs changing (I'd vote for "yes" ;-)

Yes.

  • additionally, there is a bug in Ollama's behavior because the environment variable should take precedence over the model file, which is not currently the case.

No. The application of settings goes from coarse-grained to fine-grained. Environment variables are most coarse-grained, they affect every model. Model settings are finer-grained, affecting just the model. API calls are most fine-grained, affecting just the inference request.

<!-- gh-comment-id:2903797566 --> @rick-github commented on GitHub (May 23, 2025): > * an unusual model file, but that’s for the maintainer to decide if it needs changing (I'd vote for "yes" ;-) Yes. > * additionally, there is a bug in Ollama's behavior because the environment variable should take precedence over the model file, which is not currently the case. No. The application of settings goes from coarse-grained to fine-grained. Environment variables are most coarse-grained, they affect every model. Model settings are finer-grained, affecting just the model. API calls are most fine-grained, affecting just the inference request.
Author
Owner

@dguembel-itomig commented on GitHub (May 23, 2025):

Thanks. Then, I believe this is not an ollama bug and can be closed.
How can I contact the owner of the mistral-small3.1 model(file)?

<!-- gh-comment-id:2903825552 --> @dguembel-itomig commented on GitHub (May 23, 2025): Thanks. Then, I believe this is not an ollama bug and can be closed. How can I contact the owner of the mistral-small3.1 model(file)?
Author
Owner

@rick-github commented on GitHub (May 23, 2025):

There's no specific owner for the ollama library models. Open a new issue and mark it as "model request" and indicate the required change. In my experience, getting a model adjusted after it's been published is a bit hit and miss. Create the request, but then modify your local copy because it may be some time before the model is updated.

<!-- gh-comment-id:2903840663 --> @rick-github commented on GitHub (May 23, 2025): There's no specific owner for the ollama library models. Open a new issue and mark it as "model request" and indicate the required change. In my experience, getting a model adjusted after it's been published is a bit hit and miss. Create the request, but then modify your local copy because it may be some time before the model is updated.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32870