[GH-ISSUE #8910] In Linux, the model running deepseek R1:671b with ollama reports an error:Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer llama_load_model_from_file: failed to load model #52288

Closed
opened 2026-04-28 22:55:03 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @xinshou-xin on GitHub (Feb 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8910

What is the issue?

(base) root@EDserver:~# ollama -v
ollama version is 0.5.7

(base) root@EDserver:~# ollama run deepseek-r1:671b
Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer
llama_load_model_from_file: failed to load model

Is there any solution?

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @xinshou-xin on GitHub (Feb 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8910 ### What is the issue? (base) root@EDserver:~# ollama -v ollama version is 0.5.7 (base) root@EDserver:~# ollama run deepseek-r1:671b Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer llama_load_model_from_file: failed to load model Is there any solution? ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-28 22:55:03 -05:00
Author
Owner

@xinshou-xin commented on GitHub (Feb 7, 2025):

Image

<!-- gh-comment-id:2641831016 --> @xinshou-xin commented on GitHub (Feb 7, 2025): ![Image](https://github.com/user-attachments/assets/55f9b6c2-e621-451b-b7da-beb584f55af4)
Author
Owner

@ipfgao commented on GitHub (Feb 7, 2025):

在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。

Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

修改完示例之后如下:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin"
Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target
<!-- gh-comment-id:2641948262 --> @ipfgao commented on GitHub (Feb 7, 2025): 在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。 ``` Environment="OLLAMA_LOAD_TIMEOUT=90m" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" Environment="OLLAMA_GPU_OVERHEAD=536870912" Environment="OLLAMA_FLASH_ATTENTION=1" ``` 修改完示例之后如下: ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin" Environment="OLLAMA_LOAD_TIMEOUT=90m" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" Environment="OLLAMA_GPU_OVERHEAD=536870912" Environment="OLLAMA_FLASH_ATTENTION=1" [Install] WantedBy=default.target ```
Author
Owner

@rick-github commented on GitHub (Feb 7, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2642290907 --> @rick-github commented on GitHub (Feb 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@mikestut commented on GitHub (Feb 7, 2025):

The deepseek 671b have 404GB,You must have 400G gpu VRAM BUT YOU HAVE 40G gpu VRAM....

<!-- gh-comment-id:2642905330 --> @mikestut commented on GitHub (Feb 7, 2025): The deepseek 671b have 404GB,You must have 400G gpu VRAM BUT YOU HAVE 40G gpu VRAM....
Author
Owner

@rick-github commented on GitHub (Feb 7, 2025):

If the machine has 364G free memory it's loadable. Server logs will show why it didn't.

<!-- gh-comment-id:2643041703 --> @rick-github commented on GitHub (Feb 7, 2025): If the machine has 364G free memory it's loadable. Server logs will show why it didn't.
Author
Owner

@relic-yuexi commented on GitHub (Feb 8, 2025):

在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。

Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

修改完示例之后如下:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin"
Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target

能加载了,但是变成了zombie process以及

Error: POST predict: Post "http://127.0.0.1:44325/completion": EOF
<!-- gh-comment-id:2644499385 --> @relic-yuexi commented on GitHub (Feb 8, 2025): > 在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。 > > ``` > Environment="OLLAMA_LOAD_TIMEOUT=90m" > Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" > Environment="OLLAMA_GPU_OVERHEAD=536870912" > Environment="OLLAMA_FLASH_ATTENTION=1" > ``` > > 修改完示例之后如下: > > ``` > [Unit] > Description=Ollama Service > After=network-online.target > > [Service] > ExecStart=/usr/local/bin/ollama serve > User=ollama > Group=ollama > Restart=always > RestartSec=3 > Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin" > Environment="OLLAMA_LOAD_TIMEOUT=90m" > Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" > Environment="OLLAMA_GPU_OVERHEAD=536870912" > Environment="OLLAMA_FLASH_ATTENTION=1" > > [Install] > WantedBy=default.target > ``` 能加载了,但是变成了zombie process以及 ``` Error: POST predict: Post "http://127.0.0.1:44325/completion": EOF ```
Author
Owner

@SongXiaoMao commented on GitHub (Feb 8, 2025):

在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。

Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

修改完示例之后如下:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin"
Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target

OK Thank you very much. But why are my 671b think tags empty?

Image

<!-- gh-comment-id:2644506153 --> @SongXiaoMao commented on GitHub (Feb 8, 2025): > 在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。 > > ``` > Environment="OLLAMA_LOAD_TIMEOUT=90m" > Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" > Environment="OLLAMA_GPU_OVERHEAD=536870912" > Environment="OLLAMA_FLASH_ATTENTION=1" > ``` > > 修改完示例之后如下: > > ``` > [Unit] > Description=Ollama Service > After=network-online.target > > [Service] > ExecStart=/usr/local/bin/ollama serve > User=ollama > Group=ollama > Restart=always > RestartSec=3 > Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin" > Environment="OLLAMA_LOAD_TIMEOUT=90m" > Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" > Environment="OLLAMA_GPU_OVERHEAD=536870912" > Environment="OLLAMA_FLASH_ATTENTION=1" > > [Install] > WantedBy=default.target > ``` OK Thank you very much. But why are my 671b think tags empty? ![Image](https://github.com/user-attachments/assets/256581df-a9d0-4a1c-b38e-5d75670090f4)
Author
Owner

@SongXiaoMao commented on GitHub (Feb 8, 2025):

Environment="OLLAMA_FLASH_ATTENTION=1"
Removing this line of command will lead to thought

<!-- gh-comment-id:2644509229 --> @SongXiaoMao commented on GitHub (Feb 8, 2025): Environment="OLLAMA_FLASH_ATTENTION=1" Removing this line of command will lead to thought
Author
Owner

@SongXiaoMao commented on GitHub (Feb 8, 2025):

Image

Very strange only the first question asked and thought about it

<!-- gh-comment-id:2644514233 --> @SongXiaoMao commented on GitHub (Feb 8, 2025): ![Image](https://github.com/user-attachments/assets/acb2c0bb-8a15-41a9-8d30-de88a9e0e1b3) Very strange only the first question asked and thought about it
Author
Owner

@ant-ob-hengtang commented on GitHub (Mar 5, 2025):

在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。

Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

修改完示例之后如下:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin"
Environment="OLLAMA_LOAD_TIMEOUT=90m"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target

能加载了,但是变成了zombie process以及

Error: POST predict: Post "http://127.0.0.1:44325/completion": EOF

我也遇到了一样的这个报错,请问你解决了吗?如何解决的

<!-- gh-comment-id:2700236560 --> @ant-ob-hengtang commented on GitHub (Mar 5, 2025): > > 在/etc/systemd/system/ollama.service加上下面的环境变量设置,重启服务就行了。 > > ``` > > Environment="OLLAMA_LOAD_TIMEOUT=90m" > > Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" > > Environment="OLLAMA_GPU_OVERHEAD=536870912" > > Environment="OLLAMA_FLASH_ATTENTION=1" > > ``` > > > > > > > > > > > > > > > > > > > > > > > > 修改完示例之后如下: > > ``` > > [Unit] > > Description=Ollama Service > > After=network-online.target > > > > [Service] > > ExecStart=/usr/local/bin/ollama serve > > User=ollama > > Group=ollama > > Restart=always > > RestartSec=3 > > Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin" > > Environment="OLLAMA_LOAD_TIMEOUT=90m" > > Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" > > Environment="OLLAMA_GPU_OVERHEAD=536870912" > > Environment="OLLAMA_FLASH_ATTENTION=1" > > > > [Install] > > WantedBy=default.target > > ``` > > 能加载了,但是变成了zombie process以及 > > ``` > Error: POST predict: Post "http://127.0.0.1:44325/completion": EOF > ``` 我也遇到了一样的这个报错,请问你解决了吗?如何解决的
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52288