[GH-ISSUE #15906] granite4.1 models ignoring Ollama default context window size on Ollama 0.22.0 #72191

Open
opened 2026-05-05 03:36:47 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @cwthomas-llu on GitHub (Apr 30, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15906

What is the issue?

I was running granite4.1:30b (17 GB) and noticed it was running slow on my hardware, given the GPUs I have. When I ran ollama ps, I saw that the model was using 97 GB . I only have roughly 48 GB of VRAM, so the model spilled over to the CPU. This is where the slowness came from. I believe this is because the context size is set to 131072 and not the ollama's default context size (8k tokens?). I tested this not only with a simple python app using the ollama library, but also by making sure no models are loaded and then run ollama run granite4.1:30b.

I tested the smaller granite4.1:8b model as well and this 5.3 GB was 55 GB. This is also because it is using the whole allowable context window and not the default size.

It would be nice if the granite4.1 models would use the default context size (if not specified) so that these models can fit on my GPUs.

Relevant log output

$ nvidia-smi
Thu Apr 30 11:23:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:41:00.0 Off |                  Off |
|  0%   43C    P8             20W /  450W |      18MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:83:00.0  On |                  Off |
|  0%   48C    P8             18W /  450W |     104MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3741      G   /usr/bin/gnome-shell                            6MiB |
|    1   N/A  N/A      3741      G   /usr/bin/gnome-shell                           66MiB |
|    1   N/A  N/A      3829      G   /usr/bin/Xwayland                               8MiB |
+-----------------------------------------------------------------------------------------+

$ ollama ps
NAME              ID              SIZE     PROCESSOR          CONTEXT    UNTIL
granite4.1:30b    3f3e5df8a021    97 GB    52%/48% CPU/GPU    131072     4 minutes from now


$ ollama ps
NAME             ID              SIZE     PROCESSOR          CONTEXT    UNTIL
granite4.1:8b    444af1c4b2fe    55 GB    15%/85% CPU/GPU    131072     4 minutes from now

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.22.0

Originally created by @cwthomas-llu on GitHub (Apr 30, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15906 ### What is the issue? I was running granite4.1:30b (17 GB) and noticed it was running slow on my hardware, given the GPUs I have. When I ran `ollama ps`, I saw that the model was using 97 GB . I only have roughly 48 GB of VRAM, so the model spilled over to the CPU. This is where the slowness came from. I believe this is because the context size is set to `131072` and not the ollama's default context size (8k tokens?). I tested this not only with a simple python app using the ollama library, but also by making sure no models are loaded and then run `ollama run granite4.1:30b`. I tested the smaller granite4.1:8b model as well and this 5.3 GB was 55 GB. This is also because it is using the whole allowable context window and not the default size. It would be nice if the granite4.1 models would use the default context size (if not specified) so that these models can fit on my GPUs. ### Relevant log output ```shell $ nvidia-smi Thu Apr 30 11:23:46 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off | | 0% 43C P8 20W / 450W | 18MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:83:00.0 On | Off | | 0% 48C P8 18W / 450W | 104MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3741 G /usr/bin/gnome-shell 6MiB | | 1 N/A N/A 3741 G /usr/bin/gnome-shell 66MiB | | 1 N/A N/A 3829 G /usr/bin/Xwayland 8MiB | +-----------------------------------------------------------------------------------------+ $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL granite4.1:30b 3f3e5df8a021 97 GB 52%/48% CPU/GPU 131072 4 minutes from now $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL granite4.1:8b 444af1c4b2fe 55 GB 15%/85% CPU/GPU 131072 4 minutes from now ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.22.0
GiteaMirror added the bug label 2026-05-05 03:36:48 -05:00
Author
Owner

@rick-github commented on GitHub (May 3, 2026):

Set OLLAMA_CONTEXT_LENGTH.

https://github.com/ollama/ollama/issues/14116

<!-- gh-comment-id:4367125073 --> @rick-github commented on GitHub (May 3, 2026): Set `OLLAMA_CONTEXT_LENGTH`. https://github.com/ollama/ollama/issues/14116
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72191