[GH-ISSUE #8734] Out of Memory Error #5666

Closed
opened 2026-04-12 16:57:46 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @ghostplant on GitHub (Jan 31, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8734

What is the issue?

I am running deepseach-r1:671b on a single GPU (180GB), it reports errors like this:

[GIN] 2025/01/31 - 20:48:18 | 200 |     117.056µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/01/31 - 20:48:18 | 200 |   17.354684ms |       127.0.0.1 | POST     "/api/show"time=2025-01-31T20:48:19.586Z level=INFO source=server.go:104 msg="system memory" total="1690.2 GiB" free="358.4 GiB" free_swap="7.8 GiB"
time=2025-01-31T20:48:19.587Z level=WARN source=server.go:136 msg="model request too large for system" requested="391.0 GiB" available=393216983040 total="1690.2 GiB" free="358.4 GiB" swap="7.8 GiB"
time=2025-01-31T20:48:19.587Z level=INFO source=sched.go:428 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 error="model requires more system memory (391.0 GiB) than is available (366.2 GiB)"
[GIN] 2025/01/31 - 20:48:19 | 500 |  1.394289898s |       127.0.0.1 | POST     "/api/generate"
Error: model requires more system memory (391.0 GiB) than is available (366.2 GiB)

Is it expected? or I need to configure something else to let the large model only offload partial data to GPU device?

OS

Linux

GPU

Nvidia

CPU

Other

Ollama version

0.5.7

Originally created by @ghostplant on GitHub (Jan 31, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8734 ### What is the issue? I am running deepseach-r1:671b on a single GPU (180GB), it reports errors like this: ```sh [GIN] 2025/01/31 - 20:48:18 | 200 | 117.056µs | 127.0.0.1 | HEAD "/" [GIN] 2025/01/31 - 20:48:18 | 200 | 17.354684ms | 127.0.0.1 | POST "/api/show" ⠹ time=2025-01-31T20:48:19.586Z level=INFO source=server.go:104 msg="system memory" total="1690.2 GiB" free="358.4 GiB" free_swap="7.8 GiB" time=2025-01-31T20:48:19.587Z level=WARN source=server.go:136 msg="model request too large for system" requested="391.0 GiB" available=393216983040 total="1690.2 GiB" free="358.4 GiB" swap="7.8 GiB" time=2025-01-31T20:48:19.587Z level=INFO source=sched.go:428 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 error="model requires more system memory (391.0 GiB) than is available (366.2 GiB)" [GIN] 2025/01/31 - 20:48:19 | 500 | 1.394289898s | 127.0.0.1 | POST "/api/generate" Error: model requires more system memory (391.0 GiB) than is available (366.2 GiB) ``` Is it expected? or I need to configure something else to let the large model only offload partial data to GPU device? ### OS Linux ### GPU Nvidia ### CPU Other ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-12 16:57:46 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

model requires more system memory (391.0 GiB) than is available (366.2 GiB)

Add 40G of swap or close some programs and free up some of the 1690.2 GiB of RAM.

<!-- gh-comment-id:2628389890 --> @rick-github commented on GitHub (Jan 31, 2025): ``` model requires more system memory (391.0 GiB) than is available (366.2 GiB) ``` Add 40G of swap or close some programs and free up some of the 1690.2 GiB of RAM.
Author
Owner

@ghostplant commented on GitHub (Jan 31, 2025):

Thanks. Which exact command arguments should I use? and it is for ollama serve -xxx or ollama run -xxx?

<!-- gh-comment-id:2628432020 --> @ghostplant commented on GitHub (Jan 31, 2025): Thanks. Which exact command arguments should I use? and it is for `ollama serve -xxx` or `ollama run -xxx`?
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

Add swap.

# dd if=/dev/zero of=/new-swap-file bs=1G count=40
# mkswap /new-swap-file
# swapon /new-swap-file

If you don't know what the above commands do and the potential impact on your system, don't run them. Reach out to your sysadmin.

Close programs. You have ~1.7T of RAM and 1.3T is being used. Identify what is using the RAM and close them.

# ps ax -o 'pid,%mem,cmd' --sort '-%mem' | head
<!-- gh-comment-id:2628454232 --> @rick-github commented on GitHub (Jan 31, 2025): Add swap. ```console # dd if=/dev/zero of=/new-swap-file bs=1G count=40 # mkswap /new-swap-file # swapon /new-swap-file ``` If you don't know what the above commands do and the potential impact on your system, don't run them. Reach out to your sysadmin. Close programs. You have ~1.7T of RAM and 1.3T is being used. Identify what is using the RAM and close them. ```console # ps ax -o 'pid,%mem,cmd' --sort '-%mem' | head ```
Author
Owner

@ghostplant commented on GitHub (Jan 31, 2025):

Oh, I actually used GPU (180GB) to accelerate it, while the host CPU memory is enough large (2T). The 671b model is 400GB so it won't need disk swap memory, but I just wondering how to split the model partially on GPU and partially on CPU?

<!-- gh-comment-id:2628559639 --> @ghostplant commented on GitHub (Jan 31, 2025): Oh, I actually used GPU (180GB) to accelerate it, while the host CPU memory is enough large (2T). The 671b model is 400GB so it won't need disk swap memory, but I just wondering how to split the model partially on GPU and partially on CPU?
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

ollama will split the model automatically. it will load what it can in GPU VRAM, the rest will be in system RAM.

<!-- gh-comment-id:2628562866 --> @rick-github commented on GitHub (Jan 31, 2025): ollama will split the model automatically. it will load what it can in GPU VRAM, the rest will be in system RAM.
Author
Owner

@rick-github commented on GitHub (Feb 1, 2025):

Server logs will reveal details.

<!-- gh-comment-id:2628598765 --> @rick-github commented on GitHub (Feb 1, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will reveal details.
Author
Owner

@gkraker04 commented on GitHub (Feb 5, 2025):

I'm running into a similar issue, just the proportions are different... i'm running unsloths 1.58bit quant as we speak. not on through ollama as i would like but llama-server directly.

ollamas log:
time=2025-02-04T20:24:48.989-06:00 level=INFO source=sched.go:428 msg="NewLlamaServer failed" model=g:\models\blobs\sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 error="model requires more system memory (126.7 GiB) than is available (16.3 GiB)"

the thing is llama-server doesnt complain. failing the "system memory required" check in server.go line 124 should just warn, not fail. IMO

on windows:
llama-server --model G:\models\DeepSeek-R1-UD-IQ1_S\DeepSeek-R1-UD-IQ1_S.gguf --port 10000 --cache-type-k q4_0 --threads 6 --prio 2 --temp 0.6 --min-p 0.1 --ctx-size 8192 --seed 12345 --n-gpu-layers 7

<!-- gh-comment-id:2636227306 --> @gkraker04 commented on GitHub (Feb 5, 2025): I'm running into a similar issue, just the proportions are different... i'm running unsloths 1.58bit quant as we speak. not on through ollama as i would like but llama-server directly. ollamas log: `time=2025-02-04T20:24:48.989-06:00 level=INFO source=sched.go:428 msg="NewLlamaServer failed" model=g:\models\blobs\sha256-a542caee8df72af41ad48d75b94adacb5fbc61856930460bd599d835400fb3b6 error="model requires more system memory (126.7 GiB) than is available (16.3 GiB)"` the thing is llama-server doesnt complain. failing the "system memory required" check in server.go line 124 should just warn, not fail. IMO on windows: llama-server --model G:\models\DeepSeek-R1-UD-IQ1_S\DeepSeek-R1-UD-IQ1_S.gguf --port 10000 --cache-type-k q4_0 --threads 6 --prio 2 --temp 0.6 --min-p 0.1 --ctx-size 8192 --seed 12345 --n-gpu-layers 7
Author
Owner

@ghostplant commented on GitHub (Feb 6, 2025):

ling the "system memory required" check in server.go lin

Does small --n-gpu-layers fix your issue? My current solution is to let some transformer layers and experts to use 0bit so it narrowly fit on GPU.

<!-- gh-comment-id:2638415655 --> @ghostplant commented on GitHub (Feb 6, 2025): > ling the "system memory required" check in server.go lin Does small `--n-gpu-layers` fix your issue? My current solution is to let some transformer layers and experts to use 0bit so it narrowly fit on GPU.
Author
Owner

@gkraker04 commented on GitHub (Feb 6, 2025):

no amount of layer offloading helped while using ollama. it did its calculation and stopped before letting its version llama try.

<!-- gh-comment-id:2639051953 --> @gkraker04 commented on GitHub (Feb 6, 2025): no amount of layer offloading helped while using ollama. it did its calculation and stopped before letting its version llama try.
Author
Owner

@dimon222 commented on GitHub (Mar 6, 2025):

why was this closed?

<!-- gh-comment-id:2702388347 --> @dimon222 commented on GitHub (Mar 6, 2025): why was this closed?
Author
Owner

@rick-github commented on GitHub (Mar 6, 2025):

Because the question was answered (https://github.com/ollama/ollama/issues/8734#issuecomment-2628454232) and a follow up request for info (https://github.com/ollama/ollama/issues/8734#issuecomment-2628598765) was not fulfilled.

<!-- gh-comment-id:2702426986 --> @rick-github commented on GitHub (Mar 6, 2025): Because the question was answered (https://github.com/ollama/ollama/issues/8734#issuecomment-2628454232) and a follow up request for info (https://github.com/ollama/ollama/issues/8734#issuecomment-2628598765) was not fulfilled.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5666