[GH-ISSUE #14332] GPU memory regression when upgrading #55835

Closed
opened 2026-04-29 09:47:23 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @bdrosen96 on GitHub (Feb 20, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14332

What is the issue?

When I upgraded recently, I found that a model that was previously 100% on the GPU was now partially on the GPU and partially on the CPU. When I tried different versions, I found that the last working version was 0.15.5 and the first broken one was 0.15.6.

Working shows:

> ollama ps
NAME                                                           ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
hf.co/bartowski/TheDrummer_Behemoth-X-123B-v2.1-GGUF:Q5_K_M    2e83eeb2fe7f    88 GB    100% GPU     4096       4 minutes from now    

> nvidia-smi 
Fri Feb 20 08:22:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01              Driver Version: 590.44.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   38C    P8              5W /  300W |   84635MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 4500 Blac...    On  |   00000000:08:00.0 Off |                  Off |
| 30%   27C    P8              8W /  200W |      17MiB /  32623MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3489      G   /usr/bin/gnome-shell                      8MiB |
|    0   N/A  N/A         4064980      C   /mnt/data/local/bin/ollama            84608MiB |
|    1   N/A  N/A            3489      G   /usr/bin/gnome-shell                      3MiB |
+-----------------------------------------------------------------------------------------+

Broken shows:

> ollama ps
NAME                                                           ID              SIZE      PROCESSOR          CONTEXT    UNTIL              
hf.co/bartowski/TheDrummer_Behemoth-X-123B-v2.1-GGUF:Q5_K_M    2e83eeb2fe7f    186 GB    30%/70% CPU/GPU    131072     4 minutes from now    

> nvidia-smi 
Fri Feb 20 09:00:45 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01              Driver Version: 590.44.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 6000 Blac...    On  |   00000000:01:00.0 Off |                  Off |
| 30%   52C    P1             80W /  300W |   71025MiB /  97887MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 4500 Blac...    On  |   00000000:08:00.0 Off |                  Off |
| 30%   33C    P1             30W /  200W |    5177MiB /  32623MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3489      G   /usr/bin/gnome-shell                      8MiB |
|    0   N/A  N/A         4074154      C   /mnt/data/local/bin/ollama            70930MiB |
|    1   N/A  N/A            3489      G   /usr/bin/gnome-shell                      3MiB |
|    1   N/A  N/A         4074154      C   /mnt/data/local/bin/ollama             5086MiB |
+-----------------------------------------------------------------------------------------+

Originally I thought that this was related to the change in 0.15.5:

Ollama will now default to the following context lengths based on VRAM:
< 24 GiB VRAM: 4,096 context
24-48 GiB VRAM: 32,768 context
>= 48 GiB VRAM: 262,144 context

except it triggered in 0.15.6. In the working case I think it runs with n_ctx 4096 vs n_ctxt 131072 . So something in 0.15.6 causes it to try the larger context supported by the model instead of sticking with the smaller one that did not need the CPU .

Relevant log output

From the logs I can see in the working version:


 load_tensors: loading model tensors, this can take a while... (mmap = false)
 load_tensors: offloading 88 repeating layers to GPU
 load_tensors: offloading output layer to GPU
 load_tensors: offloaded 89/89 layers to GPU
 load_tensors:          CPU model buffer size =   264.00 MiB
 load_tensors:        CUDA0 model buffer size = 82216.92 MiB
 llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: n_ctx         = 4096
 llama_context: n_ctx_seq     = 4096
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = false
 llama_context: freq_base     = 1000000.0
 llama_context: freq_scale    = 1
 llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
 llama_context:  CUDA_Host  output buffer size =     0.17 MiB
 llama_kv_cache:      CUDA0 KV buffer size =  1408.00 MiB
 llama_kv_cache: size = 1408.00 MiB (  4096 cells,  88 layers,  1/1 seqs), K (f16):  704.00 MiB, V (f16):  704.00 MiB
 llama_context: Flash Attention was auto, set to enabled
 llama_context:      CUDA0 compute buffer size =   252.01 MiB
 llama_context:  CUDA_Host compute buffer size =    32.01 MiB
 llama_context: graph nodes  = 2735
 llama_context: graph splits = 2



In the broken one I see that too several times, but at the end I also see:


 load_tensors: loading model tensors, this can take a while... (mmap = false)
 load_tensors: offloading 51 repeating layers to GPU
 load_tensors: offloaded 51/89 layers to GPU
 load_tensors:          CPU model buffer size =   264.00 MiB
 load_tensors:        CUDA0 model buffer size = 44581.31 MiB
 load_tensors:        CUDA1 model buffer size =  2861.44 MiB
 load_tensors:    CUDA_Host model buffer size = 34774.17 MiB
 llama_context: constructing llama_context
 llama_context: n_seq_max     = 1
 llama_context: <span style="background-color:#FFFFFF"><font color="#300A24">n_ctx</font></span>         = 131072
 llama_context: <span style="background-color:#FFFFFF"><font color="#300A24">n_ctx</font></span>_seq     = 131072
 llama_context: n_batch       = 512
 llama_context: n_ubatch      = 512
 llama_context: causal_attn   = 1
 llama_context: flash_attn    = auto
 llama_context: kv_unified    = false
 llama_context: freq_base     = 1000000.0
 llama_context: freq_scale    = 1
 llama_context:        CPU  output buffer size =     0.17 MiB
 llama_kv_cache:        CPU KV buffer size = 18944.00 MiB
 llama_kv_cache:      CUDA0 KV buffer size = 24576.00 MiB
 llama_kv_cache:      CUDA1 KV buffer size =  1536.00 MiB
 llama_kv_cache: size = 45056.00 MiB (131072 cells,  88 layers,  1/1 seqs), K (f16): 22528.00 MiB, V (f16): 22528.00 MiB
 llama_context: Flash Attention was auto, set to enabled
 llama_context:      CUDA0 compute buffer size =  1043.00 MiB
 llama_context:      CUDA1 compute buffer size =   344.01 MiB
 llama_context:  CUDA_Host compute buffer size =   280.01 MiB
 llama_context: graph nodes  = 2735
 llama_context: graph splits = 412 (with bs=512), 4 (with bs=1)

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.15.6

Originally created by @bdrosen96 on GitHub (Feb 20, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14332 ### What is the issue? When I upgraded recently, I found that a model that was previously 100% on the GPU was now partially on the GPU and partially on the CPU. When I tried different versions, I found that the last working version was 0.15.5 and the first broken one was 0.15.6. Working shows: ``` > ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL hf.co/bartowski/TheDrummer_Behemoth-X-123B-v2.1-GGUF:Q5_K_M 2e83eeb2fe7f 88 GB 100% GPU 4096 4 minutes from now > nvidia-smi Fri Feb 20 08:22:25 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:01:00.0 Off | Off | | 30% 38C P8 5W / 300W | 84635MiB / 97887MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX PRO 4500 Blac... On | 00000000:08:00.0 Off | Off | | 30% 27C P8 8W / 200W | 17MiB / 32623MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3489 G /usr/bin/gnome-shell 8MiB | | 0 N/A N/A 4064980 C /mnt/data/local/bin/ollama 84608MiB | | 1 N/A N/A 3489 G /usr/bin/gnome-shell 3MiB | +-----------------------------------------------------------------------------------------+ ``` Broken shows: ``` > ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL hf.co/bartowski/TheDrummer_Behemoth-X-123B-v2.1-GGUF:Q5_K_M 2e83eeb2fe7f 186 GB 30%/70% CPU/GPU 131072 4 minutes from now > nvidia-smi Fri Feb 20 09:00:45 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:01:00.0 Off | Off | | 30% 52C P1 80W / 300W | 71025MiB / 97887MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX PRO 4500 Blac... On | 00000000:08:00.0 Off | Off | | 30% 33C P1 30W / 200W | 5177MiB / 32623MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3489 G /usr/bin/gnome-shell 8MiB | | 0 N/A N/A 4074154 C /mnt/data/local/bin/ollama 70930MiB | | 1 N/A N/A 3489 G /usr/bin/gnome-shell 3MiB | | 1 N/A N/A 4074154 C /mnt/data/local/bin/ollama 5086MiB | +-----------------------------------------------------------------------------------------+ ``` Originally I thought that this was related to the change in 0.15.5: ``` Ollama will now default to the following context lengths based on VRAM: < 24 GiB VRAM: 4,096 context 24-48 GiB VRAM: 32,768 context >= 48 GiB VRAM: 262,144 context ``` except it triggered in 0.15.6. In the working case I think it runs with n_ctx 4096 vs n_ctxt 131072 . So something in 0.15.6 causes it to try the larger context supported by the model instead of sticking with the smaller one that did not need the CPU . ### Relevant log output ```shell From the logs I can see in the working version: load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 88 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 89/89 layers to GPU load_tensors: CPU model buffer size = 264.00 MiB load_tensors: CUDA0 model buffer size = 82216.92 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 0.17 MiB llama_kv_cache: CUDA0 KV buffer size = 1408.00 MiB llama_kv_cache: size = 1408.00 MiB ( 4096 cells, 88 layers, 1/1 seqs), K (f16): 704.00 MiB, V (f16): 704.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CUDA0 compute buffer size = 252.01 MiB llama_context: CUDA_Host compute buffer size = 32.01 MiB llama_context: graph nodes = 2735 llama_context: graph splits = 2 In the broken one I see that too several times, but at the end I also see: load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 51 repeating layers to GPU load_tensors: offloaded 51/89 layers to GPU load_tensors: CPU model buffer size = 264.00 MiB load_tensors: CUDA0 model buffer size = 44581.31 MiB load_tensors: CUDA1 model buffer size = 2861.44 MiB load_tensors: CUDA_Host model buffer size = 34774.17 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: <span style="background-color:#FFFFFF"><font color="#300A24">n_ctx</font></span> = 131072 llama_context: <span style="background-color:#FFFFFF"><font color="#300A24">n_ctx</font></span>_seq = 131072 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: CPU output buffer size = 0.17 MiB llama_kv_cache: CPU KV buffer size = 18944.00 MiB llama_kv_cache: CUDA0 KV buffer size = 24576.00 MiB llama_kv_cache: CUDA1 KV buffer size = 1536.00 MiB llama_kv_cache: size = 45056.00 MiB (131072 cells, 88 layers, 1/1 seqs), K (f16): 22528.00 MiB, V (f16): 22528.00 MiB llama_context: Flash Attention was auto, set to enabled llama_context: CUDA0 compute buffer size = 1043.00 MiB llama_context: CUDA1 compute buffer size = 344.01 MiB llama_context: CUDA_Host compute buffer size = 280.01 MiB llama_context: graph nodes = 2735 llama_context: graph splits = 412 (with bs=512), 4 (with bs=1) ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.15.6
GiteaMirror added the bug label 2026-04-29 09:47:23 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 20, 2026):

#14116

<!-- gh-comment-id:3935349460 --> @rick-github commented on GitHub (Feb 20, 2026): #14116
Author
Owner

@rick-github commented on GitHub (Feb 20, 2026):

Upgrading to 0.15.6 makes Ollama offload part of large GGUF models to CPU, splitting the context and killing GPU performance.

Because the context is larger, as discussed in #14116.

Force the model to stay on GPU by reinstalling the last known good version

What?

Reference: related PR #4523.

Double what? Why are you referencing a bug that affects v0.1.38?

<!-- gh-comment-id:3937637043 --> @rick-github commented on GitHub (Feb 20, 2026): > Upgrading to 0.15.6 makes Ollama offload part of large GGUF models to CPU, splitting the context and killing GPU performance. Because the context is larger, as discussed in #14116. > Force the model to stay on GPU by reinstalling the last known good version What? > Reference: related PR #4523. Double what? Why are you referencing a bug that affects v0.1.38?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55835