[GH-ISSUE #6095] Keeps switching between cached and wired memory #3812

Open
opened 2026-04-12 14:38:52 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @chigkim on GitHub (Jul 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6095

What is the issue?

I offloaded 47 out of 127 layers of Llama 3.1 405b q2 on an M3 Max with 64GB of RAM.
When I run the inference, the memory usage shows only about 8GB, while the cached memory is 56GB. This state persists most of the time, likely indicating that the CPU is in use and data is streaming directly from the disk.
Occasionally, the cached memory decreases, and the wired memory and memory usage increase, suggesting that the GPU is being utilized. Then, the memory usage drops back down to 8GB with the cached memory size at 56GB, repeating the cycle.
Shouldn't the 47 layers be kept in wired memory at all times instead of cached to avoid the constant switching between cached and wired memory? It takes a while to transfer between cached and wired.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

v0.3.0

Originally created by @chigkim on GitHub (Jul 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6095 ### What is the issue? I offloaded 47 out of 127 layers of Llama 3.1 405b q2 on an M3 Max with 64GB of RAM. When I run the inference, the memory usage shows only about 8GB, while the cached memory is 56GB. This state persists most of the time, likely indicating that the CPU is in use and data is streaming directly from the disk. Occasionally, the cached memory decreases, and the wired memory and memory usage increase, suggesting that the GPU is being utilized. Then, the memory usage drops back down to 8GB with the cached memory size at 56GB, repeating the cycle. Shouldn't the 47 layers be kept in wired memory at all times instead of cached to avoid the constant switching between cached and wired memory? It takes a while to transfer between cached and wired. ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version v0.3.0
GiteaMirror added the bug label 2026-04-12 14:38:52 -05:00
Author
Owner

@igorschlum commented on GitHub (Aug 10, 2024):

Hi @chigkim,

It sounds like you're facing some challenges with running the Llama3.1:405b_q2 model on your Mac with 64GB of RAM. Based on the requirements for this model, you would typically need more memory, specifically around 151GB for GPU and additional memory for the display GPU, which suggests that the 192GB version of the MacStudio would be more appropriate.

I noticed you mentioned managing to run part of the model. I’m curious about how you’ve managed to only load certain layers of the model using Ollama on macOS. Specifically:

How did you determine which layers to load and execute? Did you use any specific criteria or methodology to select these layers?
Could you share more about the approach or tools you used to load only these specific layers? Understanding your strategy might help others.

Looking forward to hearing more about your approach!

<!-- gh-comment-id:2280613883 --> @igorschlum commented on GitHub (Aug 10, 2024): Hi @chigkim, It sounds like you're facing some challenges with running the Llama3.1:405b_q2 model on your Mac with 64GB of RAM. Based on the requirements for this model, you would typically need more memory, specifically around 151GB for GPU and additional memory for the display GPU, which suggests that the 192GB version of the MacStudio would be more appropriate. I noticed you mentioned managing to run part of the model. I’m curious about how you’ve managed to only load certain layers of the model using Ollama on macOS. Specifically: How did you determine which layers to load and execute? Did you use any specific criteria or methodology to select these layers? Could you share more about the approach or tools you used to load only these specific layers? Understanding your strategy might help others. Looking forward to hearing more about your approach!
Author
Owner

@chigkim commented on GitHub (Aug 10, 2024):

Hi @chigkim,

It sounds like you're facing some challenges with running the Llama3.1:405b_q2 model on your Mac with 64GB of RAM. Based on the requirements for this model, you would typically need more memory, specifically around 151GB for GPU and additional memory for the display GPU, which suggests that the 192GB version of the MacStudio would be more appropriate.

I noticed you mentioned managing to run part of the model. I’m curious about how you’ve managed to only load certain layers of the model using Ollama on macOS. Specifically:

How did you determine which layers to load and execute? Did you use any specific criteria or methodology to select these layers? Could you share more about the approach or tools you used to load only these specific layers? Understanding your strategy might help others.

Looking forward to hearing more about your approach!

Hi, why does this sound like LLM? Does Ollama now deploy a LLM bot to manage Github issues?

Anyways, Ollama does it automatically. It determines how many layers to offload, load the model accordingly.

<!-- gh-comment-id:2281434629 --> @chigkim commented on GitHub (Aug 10, 2024): > Hi @chigkim, > > It sounds like you're facing some challenges with running the Llama3.1:405b_q2 model on your Mac with 64GB of RAM. Based on the requirements for this model, you would typically need more memory, specifically around 151GB for GPU and additional memory for the display GPU, which suggests that the 192GB version of the MacStudio would be more appropriate. > > I noticed you mentioned managing to run part of the model. I’m curious about how you’ve managed to only load certain layers of the model using Ollama on macOS. Specifically: > > How did you determine which layers to load and execute? Did you use any specific criteria or methodology to select these layers? Could you share more about the approach or tools you used to load only these specific layers? Understanding your strategy might help others. > > Looking forward to hearing more about your approach! Hi, why does this sound like LLM? Does Ollama now deploy a LLM bot to manage Github issues? Anyways, Ollama does it automatically. It determines how many layers to offload, load the model accordingly.
Author
Owner

@igorschlum commented on GitHub (Aug 11, 2024):

@chigkim I'm french and my English is not so good and I use LLM to help me understand issues and help me write answers.

I have a 192GB MacStation, but when I try ollama run llama3.1:405b I get an error because, I don't have enough memory. But I tried with ollama 0.3.2 and the first version of llama3.1:405b

I will try again.

Yesterday I pull 405b-instruct-q3_K_S and when I made a prompt, it was using the full 180 Gb of RAM that i allowed my mac to use for GPU.

What tool do you use to see how many layers are used by a model?

<!-- gh-comment-id:2282788516 --> @igorschlum commented on GitHub (Aug 11, 2024): @chigkim I'm french and my English is not so good and I use LLM to help me understand issues and help me write answers. I have a 192GB MacStation, but when I try ollama run llama3.1:405b I get an error because, I don't have enough memory. But I tried with ollama 0.3.2 and the first version of llama3.1:405b I will try again. Yesterday I pull [405b-instruct-q3_K_S](https://ollama.com/library/llama3.1:405b-instruct-q3_K_S) and when I made a prompt, it was using the full 180 Gb of RAM that i allowed my mac to use for GPU. What tool do you use to see how many layers are used by a model?
Author
Owner

@chigkim commented on GitHub (Aug 11, 2024):

Sorry, I thought you were a bot!
If you look at the log file, and search for the last line with the word "cmd". You'll see the exact command that Ollama used to launch llama.cpp server.
Also, you can create your custom modelfile and set PARAMETER num_gpu to specify how many layers to offload.

<!-- gh-comment-id:2282868893 --> @chigkim commented on GitHub (Aug 11, 2024): Sorry, I thought you were a bot! If you look at the log file, and search for the last line with the word "cmd". You'll see the exact command that Ollama used to launch llama.cpp server. Also, you can create your custom modelfile and set `PARAMETER num_gpu` to specify how many layers to offload.
Author
Owner

@igorschlum commented on GitHub (Aug 11, 2024):

I pull Llama3.1:205b q2_K and Q3_K_S on my mac Station. I could run both of them, but not the q3_K_M where a got an error because not enough memory. If you don't load all the layers in memory, the model will swap a lot between cached and wired memory. I think it the normal behavior. llama.cpp needs all the layers to process the prompt as I understand.

<!-- gh-comment-id:2282932327 --> @igorschlum commented on GitHub (Aug 11, 2024): I pull Llama3.1:205b q2_K and Q3_K_S on my mac Station. I could run both of them, but not the q3_K_M where a got an error because not enough memory. If you don't load all the layers in memory, the model will swap a lot between cached and wired memory. I think it the normal behavior. llama.cpp needs all the layers to process the prompt as I understand.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3812