[GH-ISSUE #11964] context size larger than set #54457

Closed
opened 2026-04-29 06:00:41 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @StarPet on GitHub (Aug 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11964

What is the issue?

In my python app I use 'num_ctx': 3182 to reduce the context size of the gpt-oss:20b model so it'll fit into my 2x RTX5060 (16GB). However the runner process gets a different value as --ctx-size parameter.

localai  3573918 3102624 99 11:46 ?        00:03:17 /home/localai/bin/ollama runner --ollama-engine --model /home/localai/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 40960 --batch-size 512 --n-gpu-layers 17 --threads 20 --parallel 5 --tensor-split 9,8 --port 34017

Note: before starting ollama, I used the following environment variable, too:

export OLLAMA_CONTEXT_LENGTH=2048

I checked the running process:

# strings /proc/3573918/environ |grep OLLAMA_CONTEXT
OLLAMA_CONTEXT_LENGTH=2048

So, the model is running on CPU not GPU.

Either I didn't really understand the relations of the OLLAMA_CONTEXT_LENGTH, the num_ctx parameter or there is an issue with recent changes (as it used to work just fine).

Relevant log output

time=2025-08-19T12:07:50.750+02:00 level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/localai/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:5 OLLAMA_ORIGINS:[http://10.1.0.65:8123 http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-19T12:07:50.768+02:00 level=INFO source=images.go:477 msg="total blobs: 446"
time=2025-08-19T12:07:50.771+02:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-19T12:07:50.772+02:00 level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)"
time=2025-08-19T12:07:50.772+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-19T12:07:51.572+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-363e5ad1-be76-53a7-0086-e28ee69fe5b8 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5060 Ti" total="15.5 GiB" available="15.3 GiB"
time=2025-08-19T12:07:51.572+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ffafa4fa-6852-5a5a-10da-1a1de150a7e0 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5060 Ti" total="15.5 GiB" available="15.3 GiB"
[GIN] 2025/08/19 - 12:08:42 | 200 |   40.084812ms |       10.1.0.65 | POST     "/api/show"
[GIN] 2025/08/19 - 12:08:42 | 200 |   39.906831ms |       10.1.0.65 | POST     "/api/show"
[GIN] 2025/08/19 - 12:08:42 | 200 |   69.366485ms |       10.1.0.65 | POST     "/api/show"
time=2025-08-19T12:08:46.512+02:00 level=INFO source=server.go:135 msg="system memory" total="125.2 GiB" free="117.0 GiB" free_swap="1.6 GiB"
time=2025-08-19T12:08:46.512+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=17 layers.split=9,8 memory.available="[15.3 GiB 15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="35.0 GiB" memory.required.partial="30.4 GiB" memory.required.kv="1.4 GiB" memory.required.allocations="[15.3 GiB 15.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="10.0 GiB" memory.graph.partial="10.0 GiB"
time=2025-08-19T12:08:46.552+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/home/localai/bin/ollama runner --ollama-engine --model /home/localai/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 40960 --batch-size 512 --n-gpu-layers 17 --threads 20 --parallel 5 --tensor-split 9,8 --port 33321"
time=2025-08-19T12:08:46.553+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-19T12:08:46.553+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-19T12:08:46.553+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-19T12:08:46.560+02:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-19T12:08:46.560+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:33321"
time=2025-08-19T12:08:46.595+02:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /home/localai/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /home/localai/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-19T12:08:46.760+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-19T12:08:46.804+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:365 msg="offloading 17 repeating layers to GPU"
time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:376 msg="offloaded 17/25 layers to GPU"
time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="4.0 GiB"
time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="3.6 GiB"
time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="5.3 GiB"
time=2025-08-19T12:08:46.934+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="10.2 GiB"
time=2025-08-19T12:08:46.934+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="10.1 GiB"
time=2025-08-19T12:08:46.934+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 GiB"
time=2025-08-19T12:08:48.828+02:00 level=INFO source=server.go:637 msg="llama runner started in 2.27 seconds"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.11.4

Originally created by @StarPet on GitHub (Aug 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11964 ### What is the issue? In my python app I use 'num_ctx': 3182 to reduce the context size of the gpt-oss:20b model so it'll fit into my 2x RTX5060 (16GB). However the runner process gets a different value as --ctx-size parameter. ``` localai 3573918 3102624 99 11:46 ? 00:03:17 /home/localai/bin/ollama runner --ollama-engine --model /home/localai/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 40960 --batch-size 512 --n-gpu-layers 17 --threads 20 --parallel 5 --tensor-split 9,8 --port 34017 ``` Note: before starting ollama, I used the following environment variable, too: ``` export OLLAMA_CONTEXT_LENGTH=2048 ``` I checked the running process: ``` # strings /proc/3573918/environ |grep OLLAMA_CONTEXT OLLAMA_CONTEXT_LENGTH=2048 ``` So, the model is running on CPU not GPU. Either I didn't really understand the relations of the OLLAMA_CONTEXT_LENGTH, the num_ctx parameter or there is an issue with recent changes (as it used to work just fine). ### Relevant log output ```shell time=2025-08-19T12:07:50.750+02:00 level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/localai/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:5 OLLAMA_ORIGINS:[http://10.1.0.65:8123 http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-19T12:07:50.768+02:00 level=INFO source=images.go:477 msg="total blobs: 446" time=2025-08-19T12:07:50.771+02:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-19T12:07:50.772+02:00 level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)" time=2025-08-19T12:07:50.772+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-19T12:07:51.572+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-363e5ad1-be76-53a7-0086-e28ee69fe5b8 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5060 Ti" total="15.5 GiB" available="15.3 GiB" time=2025-08-19T12:07:51.572+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ffafa4fa-6852-5a5a-10da-1a1de150a7e0 library=cuda variant=v12 compute=12.0 driver=12.9 name="NVIDIA GeForce RTX 5060 Ti" total="15.5 GiB" available="15.3 GiB" [GIN] 2025/08/19 - 12:08:42 | 200 | 40.084812ms | 10.1.0.65 | POST "/api/show" [GIN] 2025/08/19 - 12:08:42 | 200 | 39.906831ms | 10.1.0.65 | POST "/api/show" [GIN] 2025/08/19 - 12:08:42 | 200 | 69.366485ms | 10.1.0.65 | POST "/api/show" time=2025-08-19T12:08:46.512+02:00 level=INFO source=server.go:135 msg="system memory" total="125.2 GiB" free="117.0 GiB" free_swap="1.6 GiB" time=2025-08-19T12:08:46.512+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=17 layers.split=9,8 memory.available="[15.3 GiB 15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="35.0 GiB" memory.required.partial="30.4 GiB" memory.required.kv="1.4 GiB" memory.required.allocations="[15.3 GiB 15.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="10.0 GiB" memory.graph.partial="10.0 GiB" time=2025-08-19T12:08:46.552+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/home/localai/bin/ollama runner --ollama-engine --model /home/localai/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 40960 --batch-size 512 --n-gpu-layers 17 --threads 20 --parallel 5 --tensor-split 9,8 --port 33321" time=2025-08-19T12:08:46.553+02:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-19T12:08:46.553+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-19T12:08:46.553+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-19T12:08:46.560+02:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-19T12:08:46.560+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:33321" time=2025-08-19T12:08:46.595+02:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes load_backend: loaded CUDA backend from /home/localai/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /home/localai/lib/ollama/libggml-cpu-alderlake.so time=2025-08-19T12:08:46.760+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-19T12:08:46.804+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:365 msg="offloading 17 repeating layers to GPU" time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:376 msg="offloaded 17/25 layers to GPU" time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="4.0 GiB" time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="3.6 GiB" time=2025-08-19T12:08:46.874+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="5.3 GiB" time=2025-08-19T12:08:46.934+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="10.2 GiB" time=2025-08-19T12:08:46.934+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="10.1 GiB" time=2025-08-19T12:08:46.934+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="10.0 GiB" time=2025-08-19T12:08:48.828+02:00 level=INFO source=server.go:637 msg="llama runner started in 2.27 seconds" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.4
GiteaMirror added the bug label 2026-04-29 06:00:41 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 19, 2025):

gpt-oss has a minimum context length, 8192 for machines with more than 20GB. You also have OLLAMA_NUM_PARALLEL=5, so total context is 8192 * 5 = 40960. If you upgrade to the 0.11.5 series, you will benefit from some changes that reduce the size of the gpt-oss allocations.

<!-- gh-comment-id:3200317596 --> @rick-github commented on GitHub (Aug 19, 2025): gpt-oss has a minimum context length, 8192 for machines with more than 20GB. You also have `OLLAMA_NUM_PARALLEL=5`, so total context is 8192 * 5 = 40960. If you upgrade to the 0.11.5 series, you will benefit from some changes that reduce the size of the gpt-oss allocations.
Author
Owner

@pdevine commented on GitHub (Aug 19, 2025):

I'm going to go ahead and close the issue. I'd recommend reducing OLLAMA_NUM_PARALLEL unless you need it. You may also want to try OLLAMA_NEW_ESTIMATES=1 to try out the memory optimizations we've included in 0.11.5.

<!-- gh-comment-id:3202673775 --> @pdevine commented on GitHub (Aug 19, 2025): I'm going to go ahead and close the issue. I'd recommend reducing `OLLAMA_NUM_PARALLEL` unless you need it. You may also want to try `OLLAMA_NEW_ESTIMATES=1` to try out the memory optimizations we've included in 0.11.5.
Author
Owner

@StarPet commented on GitHub (Aug 20, 2025):

I'd recommend reducing OLLAMA_NUM_PARALLEL unless you need it. You may also want to try OLLAMA_NEW_ESTIMATES=1 to try out the memory optimizations we've included in 0.11.5.
Done that. Works fine. Thanks all!

<!-- gh-comment-id:3204623252 --> @StarPet commented on GitHub (Aug 20, 2025): > I'd recommend reducing `OLLAMA_NUM_PARALLEL` unless you need it. You may also want to try `OLLAMA_NEW_ESTIMATES=1` to try out the memory optimizations we've included in 0.11.5. Done that. Works fine. Thanks all!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54457