[GH-ISSUE #12727] Ollama server fails to load big models with Nvidia GPU installed: out of pinned memory #8445

Open
opened 2026-04-12 21:07:35 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @xdnv on GitHub (Oct 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12727

Originally assigned to: @jessegross on GitHub.

What is the issue?

Hi!
The system has RTX 4090 24GB, 1.1 TB of RAM and running Win11 Pro.
Due to OS limitations, max Nvidia pinned memory is only ~600 GB, nearly a half of the RAM available. It's quite common question on their dev forums.

Models that fit memory limit run fine, but bigger ones (i.e. Deepseek flavours 700+ GB in size) bring Ollama to few unsuccessful retries. Please see log output.
They still run fine on pure CPU renderer though without GPU installed.

It looks like some bug in Ollama memory calculation: not considering actual pinned memory size and not switching to some split/revert logic.

Relevant log output

...
time=2025-10-21T23:41:12.833+03:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-21T23:41:12.833+03:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=32 efficiency=0 threads=64
time=2025-10-21T23:41:12.833+03:00 level=INFO source=server.go:505 msg="system memory" total="1151.6 GiB" free="1119.9 GiB" free_swap="1134.6 GiB"
time=2025-10-21T23:41:12.836+03:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=62 layers.offload=0 layers.split=[] memory.available="[23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="734.9 GiB" memory.required.partial="0 B" memory.required.kv="8.1 GiB" memory.required.allocations="[1.5 GiB]" memory.weights.total="725.2 GiB" memory.weights.repeating="723.5 GiB" memory.weights.nonrepeating="1.7 GiB" memory.graph.full="300.3 MiB" memory.graph.partial="1019.5 MiB"
time=2025-10-21T23:41:12.870+03:00 level=INFO source=runner.go:893 msg="starting go runner"
load_backend: loaded CPU backend from %%%\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-%%%
load_backend: loaded CUDA backend from %%%\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-10-21T23:41:12.976+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
...
load_tensors: loading model tensors, this can take a while... (mmap = false)
ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 24893034496 total: 25757220864
ggml_cuda_host_malloc: failed to allocate 744405.63 MiB of pinned memory: out of memory
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/62 layers to GPU
load_tensors:          CPU model buffer size = 744405.63 MiB

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.12.6

Originally created by @xdnv on GitHub (Oct 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12727 Originally assigned to: @jessegross on GitHub. ### What is the issue? Hi! The system has RTX 4090 24GB, 1.1 TB of RAM and running Win11 Pro. Due to OS limitations, max Nvidia pinned memory is only ~600 GB, nearly a half of the RAM available. It's quite common question on their dev forums. Models that fit memory limit run fine, but bigger ones (i.e. Deepseek flavours 700+ GB in size) bring Ollama to few unsuccessful retries. Please see log output. They still run fine on pure CPU renderer though without GPU installed. It looks like some bug in Ollama memory calculation: not considering actual pinned memory size and not switching to some split/revert logic. ### Relevant log output ```shell ... time=2025-10-21T23:41:12.833+03:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-21T23:41:12.833+03:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=32 efficiency=0 threads=64 time=2025-10-21T23:41:12.833+03:00 level=INFO source=server.go:505 msg="system memory" total="1151.6 GiB" free="1119.9 GiB" free_swap="1134.6 GiB" time=2025-10-21T23:41:12.836+03:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=62 layers.offload=0 layers.split=[] memory.available="[23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="734.9 GiB" memory.required.partial="0 B" memory.required.kv="8.1 GiB" memory.required.allocations="[1.5 GiB]" memory.weights.total="725.2 GiB" memory.weights.repeating="723.5 GiB" memory.weights.nonrepeating="1.7 GiB" memory.graph.full="300.3 MiB" memory.graph.partial="1019.5 MiB" time=2025-10-21T23:41:12.870+03:00 level=INFO source=runner.go:893 msg="starting go runner" load_backend: loaded CPU backend from %%%\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-%%% load_backend: loaded CUDA backend from %%%\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-10-21T23:41:12.976+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) ... load_tensors: loading model tensors, this can take a while... (mmap = false) ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 24893034496 total: 25757220864 ggml_cuda_host_malloc: failed to allocate 744405.63 MiB of pinned memory: out of memory load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/62 layers to GPU load_tensors: CPU model buffer size = 744405.63 MiB ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.6
GiteaMirror added the bug label 2026-04-12 21:07:35 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 21, 2025):

time=2025-10-21T23:41:12.836+03:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1
 layers.model=62 layers.offload=0 layers.split=[] memory.available="[23.2 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="734.9 GiB" memory.required.partial="0 B" memory.required.kv="8.1 GiB"
 memory.required.allocations="[1.5 GiB]" memory.weights.total="725.2 GiB" memory.weights.repeating="723.5 GiB"
 memory.weights.nonrepeating="1.7 GiB" memory.graph.full="300.3 MiB" memory.graph.partial="1019.5 MiB"

The model is not running in CPU because of the lack of pinned memory, it's because the estimation logic thought it couldn't fit any layers on the GPU. A full log (not the partial posted) with OLLAMA_DEBUG=2 might reveal more details. Note this will log the prompt.

<!-- gh-comment-id:3429649240 --> @rick-github commented on GitHub (Oct 21, 2025): ``` time=2025-10-21T23:41:12.836+03:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=62 layers.offload=0 layers.split=[] memory.available="[23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="734.9 GiB" memory.required.partial="0 B" memory.required.kv="8.1 GiB" memory.required.allocations="[1.5 GiB]" memory.weights.total="725.2 GiB" memory.weights.repeating="723.5 GiB" memory.weights.nonrepeating="1.7 GiB" memory.graph.full="300.3 MiB" memory.graph.partial="1019.5 MiB" ``` The model is not running in CPU because of the lack of pinned memory, it's because the estimation logic thought it couldn't fit any layers on the GPU. A full log (not the partial posted) with `OLLAMA_DEBUG=2` might reveal more details. Note this will log the prompt.
Author
Owner

@xdnv commented on GitHub (Oct 21, 2025):

Okay, I'll make the log with debug=2 setting. But it's still strange to me that Ollama sticks to shorter pinned memory even in CPU mode and fails constantly while having double size of paged memory. Maybe I was not clear enough, big models work fine in CPU mode if no pinned memory available (no GPU installed). And I see no easy way to switch this behaviour on per-model basis.

<!-- gh-comment-id:3429690163 --> @xdnv commented on GitHub (Oct 21, 2025): Okay, I'll make the log with debug=2 setting. But it's still strange to me that Ollama sticks to shorter pinned memory even in CPU mode and fails constantly while having double size of paged memory. Maybe I was not clear enough, big models work fine in CPU mode if no pinned memory available (no GPU installed). And I see no easy way to switch this behaviour on per-model basis.
Author
Owner

@xdnv commented on GitHub (Oct 21, 2025):

@rick-github The logfile with debug=2. There were multiple attempts to load model.
server_.log

<!-- gh-comment-id:3429759634 --> @xdnv commented on GitHub (Oct 21, 2025): @rick-github The logfile with debug=2. There were multiple attempts to load model. [server_.log](https://github.com/user-attachments/files/23032516/server_.log)
Author
Owner

@jessegross commented on GitHub (Oct 21, 2025):

Deepseek currently defaults to running on the old llama engine. That engine tries to allocate CUDA host memory if there is a CUDA-capable GPU installed. The Ollama engine does not require CUDA host memory so presumably should not have this issue. There is a Deepseek implementation for it that you can test out by setting OLLAMA_NEW_ENGINE=1.

<!-- gh-comment-id:3429898640 --> @jessegross commented on GitHub (Oct 21, 2025): Deepseek currently defaults to running on the old llama engine. That engine tries to allocate CUDA host memory if there is a CUDA-capable GPU installed. The Ollama engine does not require CUDA host memory so presumably should not have this issue. There is a Deepseek implementation for it that you can test out by setting OLLAMA_NEW_ENGINE=1.
Author
Owner

@xdnv commented on GitHub (Oct 22, 2025):

Deepseek currently defaults to running on the old llama engine. That engine tries to allocate CUDA host memory if there is a CUDA-capable GPU installed. The Ollama engine does not require CUDA host memory so presumably should not have this issue. There is a Deepseek implementation for it that you can test out by setting OLLAMA_NEW_ENGINE=1.

Thank you, with NEW_ENGINE=1 it fails right after query is sent on the following assert:

ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
time=2025-10-22T09:21:54.457+03:00 level=INFO source=sched.go:450 msg="Load failed" model=%%%\blobs\sha256-2cf3c7aa2faac9b5aabeb7ea75b481d599f329a390e95d7d96e205e054464088 error="do load request: Post \"http://127.0.0.1:64355/load\": read tcp 127.0.0.1:64361->127.0.0.1:64355: wsarecv: An existing connection was forcibly closed by the remote host."
time=2025-10-22T09:21:54.457+03:00 level=DEBUG source=server.go:1720 msg="stopping llama server" pid=28400
[GIN] 2025/10/22 - 09:21:54 | 500 |    513.6728ms |             ::1 | POST     "/api/generate"
time=2025-10-22T09:21:54.478+03:00 level=ERROR source=server.go:426 msg="llama runner terminated" error="exit status 0xc0000409"

Here is he full log with DEBUG=2

server_new-engine_.log

<!-- gh-comment-id:3430692588 --> @xdnv commented on GitHub (Oct 22, 2025): > Deepseek currently defaults to running on the old llama engine. That engine tries to allocate CUDA host memory if there is a CUDA-capable GPU installed. The Ollama engine does not require CUDA host memory so presumably should not have this issue. There is a Deepseek implementation for it that you can test out by setting OLLAMA_NEW_ENGINE=1. Thank you, with NEW_ENGINE=1 it fails right after query is sent on the following assert: ``` ggml.c:1648: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed time=2025-10-22T09:21:54.457+03:00 level=INFO source=sched.go:450 msg="Load failed" model=%%%\blobs\sha256-2cf3c7aa2faac9b5aabeb7ea75b481d599f329a390e95d7d96e205e054464088 error="do load request: Post \"http://127.0.0.1:64355/load\": read tcp 127.0.0.1:64361->127.0.0.1:64355: wsarecv: An existing connection was forcibly closed by the remote host." time=2025-10-22T09:21:54.457+03:00 level=DEBUG source=server.go:1720 msg="stopping llama server" pid=28400 [GIN] 2025/10/22 - 09:21:54 | 500 | 513.6728ms | ::1 | POST "/api/generate" time=2025-10-22T09:21:54.478+03:00 level=ERROR source=server.go:426 msg="llama runner terminated" error="exit status 0xc0000409" ``` Here is he full log with DEBUG=2 [server_new-engine_.log](https://github.com/user-attachments/files/23047527/server_new-engine_.log)
Author
Owner

@jessegross commented on GitHub (Oct 22, 2025):

Sorry, I see that you are using Deepseek 3.1 and we don't fully support the latest version in the Ollama engine implementation. That's part of the reason why we haven't switched it over to this version by default yet. Hopefully, this should be coming relatively soon.

CC: @gr4ceG

<!-- gh-comment-id:3434514891 --> @jessegross commented on GitHub (Oct 22, 2025): Sorry, I see that you are using Deepseek 3.1 and we don't fully support the latest version in the Ollama engine implementation. That's part of the reason why we haven't switched it over to this version by default yet. Hopefully, this should be coming relatively soon. CC: @gr4ceG
Author
Owner

@xdnv commented on GitHub (Oct 23, 2025):

Sorry, I see that you are using Deepseek 3.1 and we don't fully support the latest version in the Ollama engine implementation. That's part of the reason why we haven't switched it over to this version by default yet. Hopefully, this should be coming relatively soon.

I see the same situation with Deepseek R1 0508, if it brings some additional details.
So, if I understood correctly, current approach is to find the most stable combo of Ollama release and system settings and wait for go live of new engine.

<!-- gh-comment-id:3434867603 --> @xdnv commented on GitHub (Oct 23, 2025): > Sorry, I see that you are using Deepseek 3.1 and we don't fully support the latest version in the Ollama engine implementation. That's part of the reason why we haven't switched it over to this version by default yet. Hopefully, this should be coming relatively soon. I see the same situation with Deepseek R1 0508, if it brings some additional details. So, if I understood correctly, current approach is to find the most stable combo of Ollama release and system settings and wait for go live of new engine.
Author
Owner

@jessegross commented on GitHub (Oct 23, 2025):

I'm not sure what the source of Deepseek R1 you are using is but it should work with the version in Ollama's library using the new engine: ollama run deepseek-r1:671b

<!-- gh-comment-id:3438513306 --> @jessegross commented on GitHub (Oct 23, 2025): I'm not sure what the source of Deepseek R1 you are using is but it should work with the version in Ollama's library using the new engine: `ollama run deepseek-r1:671b`
Author
Owner

@xdnv commented on GitHub (Nov 4, 2025):

I reverted temporarily to Ollama 0.12.3, it runs fast & smooth without asserts and memory problems, and with GPU acceleration enabled.
One of the models tested is DeepSeek-V3.1:UD-Q8_K_XL from unsloth.
Q4 library models give slightly worse results on comparison.

<!-- gh-comment-id:3486276400 --> @xdnv commented on GitHub (Nov 4, 2025): I reverted temporarily to Ollama 0.12.3, it runs fast & smooth without asserts and memory problems, and with GPU acceleration enabled. One of the models tested is DeepSeek-V3.1:UD-Q8_K_XL from unsloth. Q4 library models give slightly worse results on comparison.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8445