[GH-ISSUE #8872] After a period of conversation with DeepSeek R1, the memory of the server keeps decreasing and never release #52261

Closed
opened 2026-04-28 22:41:58 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @gallery2016 on GitHub (Feb 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8872

What is the issue?

Thank you for your attention.

Below is my detailed environment and configuration

Ollama client version is 0.5.7

CentOS Linux release 7.9.2009 (Core)

The GPU is: 8 * L40

Driver Version: 550.54.15 CUDA Version: 12.4

Configuration of ollam startup:
export OLLAMA_LOAD_TIMEOUT=90m
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
export OLLAMA_GPU_OVERHEAD=536870912
export OLLAMA_FLASH_ATTENTION=1

ollama serve
ollama run deepseek-r1:671b

The total GPU memory is 48*8G=384G, if use OLLAMA to run the Q4 model of 671B, and there is not enough GPU memory, then set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, it will utilize the server's memory, and everything will be fine at startup.

However, after a period of conversation, I found that the memory of the server keeps decreasing until it runs out, and the last conversation is very laggy.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @gallery2016 on GitHub (Feb 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8872 ### What is the issue? Thank you for your attention. Below is my detailed environment and configuration Ollama client version is 0.5.7 CentOS Linux release 7.9.2009 (Core) The GPU is: 8 * L40 Driver Version: 550.54.15 CUDA Version: 12.4 Configuration of ollam startup: export OLLAMA_LOAD_TIMEOUT=90m export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 export OLLAMA_GPU_OVERHEAD=536870912 export OLLAMA_FLASH_ATTENTION=1 ollama serve ollama run deepseek-r1:671b The total GPU memory is 48*8G=384G, if use OLLAMA to run the Q4 model of 671B, and there is not enough GPU memory, then set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, it will utilize the server's memory, and everything will be fine at startup. However, after a period of conversation, I found that the memory of the server keeps decreasing until it runs out, and the last conversation is very laggy. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bugneeds more info labels 2026-04-28 22:42:06 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

Server logs may aid in debugging.

What do you mean by runs out? Does the runner crash? Does your system fill up the swap device?

<!-- gh-comment-id:2639392287 --> @rick-github commented on GitHub (Feb 6, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. What do you mean by `runs out`? Does the runner crash? Does your system fill up the swap device?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52261