[GH-ISSUE #4790] command-r:35b uses too much memory #28779

Open
opened 2026-04-22 07:18:35 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Zig1375 on GitHub (Jun 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4790

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

My PC configuration is:

  • GPU - Nvidia RTX 4070 (12Gb)

  • 64 GB RAM

  • When I do not use Ollama: 11.9Gb RAM is used-

  • When I use Ollama with the default settings: 33.7 GB RAM is used

  • num_ctx = 4k (4,096), then 35.1 GB RAM is used

  • num_ctx = 8k (8,192), then 39.9 GB RAM is used

  • num_ctx = 12k (12,288), then 44.2 GB RAM is used

  • num_ctx = 32k (32,768), then 63.6 GB RAM is used (ALL memory is used)

The real context that is sent to the Ollama is only about 6k!!!
Even though this model supports a context of up to 128k, I'm unable to use even a 32k one. I'm not sure if this is a real bug, but it doesn't seem right to me that a 32k context would use 12GB of GPU RAM and 64GB of my PC's RAM.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.1.41

Originally created by @Zig1375 on GitHub (Jun 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4790 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? My PC configuration is: - GPU - Nvidia RTX 4070 (12Gb) - 64 GB RAM - When I do not use Ollama: 11.9Gb RAM is used- - When I use Ollama with the default settings: 33.7 GB RAM is used - `num_ctx` = 4k (4,096), then **35.1** GB RAM is used - `num_ctx` = 8k (8,192), then **39.9** GB RAM is used - `num_ctx` = 12k (12,288), then **44.2** GB RAM is used - `num_ctx` = 32k (32,768), then **63.6** GB RAM is used (ALL memory is used) The real context that is sent to the Ollama is only about 6k!!! Even though this model supports a context of up to 128k, I'm unable to use even a 32k one. I'm not sure if this is a real bug, but it doesn't seem right to me that a 32k context would use 12GB of GPU RAM and 64GB of my PC's RAM. ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.41
GiteaMirror added the memorybug labels 2026-04-22 07:18:36 -05:00
Author
Owner

@kozuch commented on GitHub (Jun 13, 2024):

My GPU run out of memory with larger context too (https://github.com/ollama/ollama/issues/4985). It would be interesting to debug this to see whether there is a bug in ollama or if larger context needs more memory for inference, I have no clue myself.

<!-- gh-comment-id:2165361185 --> @kozuch commented on GitHub (Jun 13, 2024): My GPU run out of memory with larger context too (https://github.com/ollama/ollama/issues/4985). It would be interesting to debug this to see whether there is a bug in ollama or if larger context needs more memory for inference, I have no clue myself.
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

Can you try disabling mmap to see if that has an impact on system memory consumption in your scenario?

curl http://localhost:11434/api/generate -d '{
  "model": "command-r:35b",
  "prompt": "Why is the sky blue?",
  "stream": false, "options": {"use_mmap": false}
}'
<!-- gh-comment-id:2177120201 --> @dhiltgen commented on GitHub (Jun 18, 2024): Can you try disabling mmap to see if that has an impact on system memory consumption in your scenario? ``` curl http://localhost:11434/api/generate -d '{ "model": "command-r:35b", "prompt": "Why is the sky blue?", "stream": false, "options": {"use_mmap": false} }' ```
Author
Owner

@Zig1375 commented on GitHub (Jun 19, 2024):

Tested on version 0.1.44.

{
    "model": "command-r",
    "messages": [
        {
            "role": "user",
            "content": "Why is the sky blue?"
        }
    ],
    "options": {
        "num_ctx": 32768,
        "use_mmap": false
    },
    "stream": true
}

The issue is still there, ollama uses all the available memory.

  • num_ctx = 4k (4,096), then 35.1 GB RAM is used
  • num_ctx = 8k (8,192), then 39.9 GB RAM is used
  • num_ctx = 12k (12,288), then 44.2 GB RAM is used
  • num_ctx = 32k (32,768), then 63.6 GB RAM is used (ALL memory is used)
<!-- gh-comment-id:2178247470 --> @Zig1375 commented on GitHub (Jun 19, 2024): Tested on version 0.1.44. ```json { "model": "command-r", "messages": [ { "role": "user", "content": "Why is the sky blue?" } ], "options": { "num_ctx": 32768, "use_mmap": false }, "stream": true } ``` The issue is still there, ollama uses all the available memory. - `num_ctx` = 4k (4,096), then **35.1** GB RAM is used - `num_ctx` = 8k (8,192), then **39.9** GB RAM is used - `num_ctx` = 12k (12,288), then **44.2** GB RAM is used - `num_ctx` = 32k (32,768), then **63.6** GB RAM is used (ALL memory is used)
Author
Owner

@kozuch commented on GitHub (Jun 25, 2024):

Looks like the memory usage depends on model. For phi3-mini-128k with num_ctx=64000 the RAM usage is 32GB (on CPU). Also, the inference speed is completely crippled (1 token in 10s). With 32k context RAM usage is 17GB and the inference runs still pretty fast.

<!-- gh-comment-id:2189486755 --> @kozuch commented on GitHub (Jun 25, 2024): Looks like the memory usage depends on model. For phi3-mini-128k with num_ctx=64000 the RAM usage is 32GB (on CPU). Also, the inference speed is completely crippled (1 token in 10s). With 32k context RAM usage is 17GB and the inference runs still pretty fast.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28779