[GH-ISSUE #5022] GPU VRAM estimate not accounting for flash attetion #49690

Closed
opened 2026-04-28 12:43:07 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @theasp on GitHub (Jun 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5022

What is the issue?

Hi,

I'm using a q6_K quant of of codestral-22 with a 18k context and flash attention enabled. I'm trying to get a higher context configured, but I always have VRAM left. It appears that the estimate does not account for the use of flash attention as I still have 2882 GB left.

NAME                            ID              SIZE    PROCESSOR       UNTIL
DEFAULT/codestral-22b:latest    cd78ecba62ae    25 GB   100% GPU        Forever
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   40C    P8             34W /  420W |   21694MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3659350      C   ...unners/cuda_v11/ollama_llama_server      21684MiB |
+-----------------------------------------------------------------------------------------+
ollama-1  | time=2024-06-13T13:02:05.501Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=57 memory.available="23.3 GiB" memory.required.full="23.3 GiB" memory.required.partial="23.3 GiB" memory.required.kv="3.9 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.7 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="1.8 GiB" memory.graph.partial="1.8 GiB"
ollama-1  | time=2024-06-13T13:02:05.501Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama151144087/runners/cuda_v11/ollama_llama_server --model
 /root/.ollama/models/blobs/sha256-83d371fdab7d62c12eb780a034bf9b5ea89403e4d69e46d332d9bdaeff765c31 --ctx-size 18432 --batch-size 512 --embedding --log-disable --n-gpu-layers 57 --flash-attn --parallel 1 --port 35209"
[...]
time=2024-06-13T13:02:05.953Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding"
llm_load_tensors: offloading 56 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 57/57 layers to GPU
llm_load_tensors:        CPU buffer size =   157.50 MiB
llm_load_tensors:      CUDA0 buffer size = 17248.90 MiB
time=2024-06-13T13:02:06.656Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 18432
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  4032.00 MiB
llama_new_context_with_model: KV self size  = 4032.00 MiB, K (f16): 2016.00 MiB, V (f16): 2016.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   130.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    48.01 MiB
llama_new_context_with_model: graph nodes  = 1575
llama_new_context_with_model: graph splits = 2

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.43

Originally created by @theasp on GitHub (Jun 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5022 ### What is the issue? Hi, I'm using a q6_K quant of of codestral-22 with a 18k context and flash attention enabled. I'm trying to get a higher context configured, but I always have VRAM left. It appears that the estimate does not account for the use of flash attention as I still have 2882 GB left. ``` NAME ID SIZE PROCESSOR UNTIL DEFAULT/codestral-22b:latest cd78ecba62ae 25 GB 100% GPU Forever ``` ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A | | 0% 40C P8 34W / 420W | 21694MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3659350 C ...unners/cuda_v11/ollama_llama_server 21684MiB | +-----------------------------------------------------------------------------------------+ ``` ``` ollama-1 | time=2024-06-13T13:02:05.501Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=57 memory.available="23.3 GiB" memory.required.full="23.3 GiB" memory.required.partial="23.3 GiB" memory.required.kv="3.9 GiB" memory.weights.total="16.8 GiB" memory.weights.repeating="16.7 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="1.8 GiB" memory.graph.partial="1.8 GiB" ollama-1 | time=2024-06-13T13:02:05.501Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama151144087/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-83d371fdab7d62c12eb780a034bf9b5ea89403e4d69e46d332d9bdaeff765c31 --ctx-size 18432 --batch-size 512 --embedding --log-disable --n-gpu-layers 57 --flash-attn --parallel 1 --port 35209" [...] time=2024-06-13T13:02:05.953Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding" llm_load_tensors: offloading 56 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 57/57 layers to GPU llm_load_tensors: CPU buffer size = 157.50 MiB llm_load_tensors: CUDA0 buffer size = 17248.90 MiB time=2024-06-13T13:02:06.656Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 18432 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 4032.00 MiB llama_new_context_with_model: KV self size = 4032.00 MiB, K (f16): 2016.00 MiB, V (f16): 2016.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB llama_new_context_with_model: CUDA0 compute buffer size = 130.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 48.01 MiB llama_new_context_with_model: graph nodes = 1575 llama_new_context_with_model: graph splits = 2 ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.43
GiteaMirror added the bug label 2026-04-28 12:43:07 -05:00
Author
Owner

@hdnh2006 commented on GitHub (Oct 18, 2024):

I am facing a very similar problem but with another model (QWEN).

I have dropped a comment here: https://github.com/ollama/ollama/issues/3078#issuecomment-2421960239.

Any help to set ollama use the 100% of GPU would be fantastic.

<!-- gh-comment-id:2421991455 --> @hdnh2006 commented on GitHub (Oct 18, 2024): I am facing a very similar problem but with another model (QWEN). I have dropped a comment here: https://github.com/ollama/ollama/issues/3078#issuecomment-2421960239. Any help to set ollama use the 100% of GPU would be fantastic.
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.

<!-- gh-comment-id:3330114600 --> @jessegross commented on GitHub (Sep 24, 2025): I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49690