[GH-ISSUE #6864] Memory Allocation on VRAM when model size is bigger than the size of VRAM #66371

Open
opened 2026-05-04 03:14:20 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Gking-a on GitHub (Sep 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6864

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

when the num_gpu=22 ,the model will be loaded into VRAM and RAM correctly,but when i change num_gpu=24,just one small step(47layers in total),it is expected to use shared VRAM to contain about 1GB datas.But it doesn't work.It will load 12GB into shared VRAM,and then shutdown due to out of memory.Is it a bug,or is it a feature because CUDA dosenot support?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.13 Driver Version: 556.13 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 42C P8 15W / 140W | 64MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 14768 C+G D:\Program\QQNT\QQ.exe N/A |
+-----------------------------------------------------------------------------------------+

server.log is submitted

Screenshot 2024-09-19 100805
Screenshot 2024-09-19 100817
Screenshot 2024-09-19 100100
server.log

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.10

Originally created by @Gking-a on GitHub (Sep 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6864 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? when the num_gpu=22 ,the model will be loaded into VRAM and RAM correctly,but when i change num_gpu=24,just one small step(47layers in total),it is expected to use shared VRAM to contain about 1GB datas.But it doesn't work.It will load 12GB into shared VRAM,and then shutdown due to out of memory.Is it a bug,or is it a feature because CUDA dosenot support? +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.13 Driver Version: 556.13 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 ... WDDM | 00000000:01:00.0 Off | N/A | | N/A 42C P8 15W / 140W | 64MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 14768 C+G D:\Program\QQNT\QQ.exe N/A | +-----------------------------------------------------------------------------------------+ server.log is submitted ![Screenshot 2024-09-19 100805](https://github.com/user-attachments/assets/cbe56a81-41db-4ca5-85cb-724d70c93f6b) ![Screenshot 2024-09-19 100817](https://github.com/user-attachments/assets/e4d516cb-be42-483e-b7ce-1f574a79487f) ![Screenshot 2024-09-19 100100](https://github.com/user-attachments/assets/df9447a5-500b-4b2d-9b56-bdabd90fcbb9) [server.log](https://github.com/user-attachments/files/17052524/server.log) ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.10
GiteaMirror added the memorywindowsbug labels 2026-05-04 03:14:21 -05:00
Author
Owner

@Willian7004 commented on GitHub (Sep 19, 2024):

My NVIDIA GPU also don't work on v0.3.11,but ollama used the Integrated graphics on my AMD CPU instead.It make the models larger than my VRAM run faster on my computer but it make the models smaller than my VRAM run slower.

<!-- gh-comment-id:2360958895 --> @Willian7004 commented on GitHub (Sep 19, 2024): My NVIDIA GPU also don't work on v0.3.11,but ollama used the Integrated graphics on my AMD CPU instead.It make the models larger than my VRAM run faster on my computer but it make the models smaller than my VRAM run slower.
Author
Owner

@Willian7004 commented on GitHub (Sep 19, 2024):

My NVIDIA GPU also don't work on v0.3.11,but ollama used the Integrated graphics on my AMD CPU instead.It make the models larger than my VRAM run faster on my computer but it make the models smaller than my VRAM run slower.

I checked again and found it actually runs on CPU and there is "WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support"in the log,maybe an error occurred while updating to v0.3.11 .

<!-- gh-comment-id:2361818390 --> @Willian7004 commented on GitHub (Sep 19, 2024): > My NVIDIA GPU also don't work on v0.3.11,but ollama used the Integrated graphics on my AMD CPU instead.It make the models larger than my VRAM run faster on my computer but it make the models smaller than my VRAM run slower. I checked again and found it actually runs on CPU and there is "WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support"in the log,maybe an error occurred while updating to v0.3.11 .
Author
Owner
<!-- gh-comment-id:2363917768 --> @lee-b commented on GitHub (Sep 20, 2024): This may be https://old.reddit.com/r/buildapc/comments/dop2fz/windows_10_reserves_20_of_gpu_vram_is_there_a_way/
Author
Owner

@dhiltgen commented on GitHub (Sep 21, 2024):

@Gking-a It looks like many of the crashes in your log are host allocation failures.

ggml_cuda_host_malloc: failed to allocate 16736.15 MiB of pinned memory: out of memory

Windows isn't able to allocate a ~16G buffer. According to the logs, you have ~21G free, so I'm not sure exactly what's going on.

@Willian7004 failure to discover you GPU is most likely unrelated to this issue. If you haven't resolved it already, I'd suggest opening a new issue so we can help troubleshoot. Please include the server log when you do.

<!-- gh-comment-id:2364780572 --> @dhiltgen commented on GitHub (Sep 21, 2024): @Gking-a It looks like many of the crashes in your log are host allocation failures. ``` ggml_cuda_host_malloc: failed to allocate 16736.15 MiB of pinned memory: out of memory ``` Windows isn't able to allocate a ~16G buffer. According to the logs, you have ~21G free, so I'm not sure exactly what's going on. @Willian7004 failure to discover you GPU is most likely unrelated to this issue. If you haven't resolved it already, I'd suggest opening a new issue so we can help troubleshoot. Please include the server log when you do.
Author
Owner

@Gking-a commented on GitHub (Sep 21, 2024):

This may be https://old.reddit.com/r/buildapc/comments/dop2fz/windows_10_reserves_20_of_gpu_vram_is_there_a_way/

thanks,but it failed.I try to use regedit tools,it almost work,but finally shutdown.This time,the allocation is likely changed.When i check log,CUDA error: out of memory,too.And i try to use Integrated graphics to display,make sure the Nvidia gpu is free.This just change 12GB in shard VRAM to 14GB.

<!-- gh-comment-id:2364983668 --> @Gking-a commented on GitHub (Sep 21, 2024): > This may be https://old.reddit.com/r/buildapc/comments/dop2fz/windows_10_reserves_20_of_gpu_vram_is_there_a_way/ thanks,but it failed.I try to use regedit tools,it almost work,but finally shutdown.This time,the allocation is likely changed.When i check log,CUDA error: out of memory,too.And i try to use Integrated graphics to display,make sure the Nvidia gpu is free.This just change 12GB in shard VRAM to 14GB.
Author
Owner

@Gking-a commented on GitHub (Sep 21, 2024):

Maybe cuda can only us GPU's VRAM and can't use shared RAM?May someone explain this question?

<!-- gh-comment-id:2364986711 --> @Gking-a commented on GitHub (Sep 21, 2024): Maybe cuda can only us GPU's VRAM and can't use shared RAM?May someone explain this question?
Author
Owner

@dhiltgen commented on GitHub (Sep 24, 2024):

@Gking-a ollama uses dedicated VRAM on GPUs for performance reasons, however Windows has a VRAM paging model to support overcommitting video memory. We try to avoid that, as performance is negatively impacted. I'm not sure what settings you're changing, but the crashes I noticed in your logs were system memory allocations failing, so this might be related to you assigning more system memory to your integrated GPU possibly.

<!-- gh-comment-id:2371917928 --> @dhiltgen commented on GitHub (Sep 24, 2024): @Gking-a ollama uses dedicated VRAM on GPUs for performance reasons, however Windows has a VRAM paging model to support overcommitting video memory. We try to avoid that, as performance is negatively impacted. I'm not sure what settings you're changing, but the crashes I noticed in your logs were system memory allocations failing, so this might be related to you assigning more system memory to your integrated GPU possibly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66371