[GH-ISSUE #1826] MacOS: Ollama ignores changes to the iogpu.wired_limit_mb tunable when deciding whether to run on GPU or CPU #63078

Closed
opened 2026-05-03 11:39:57 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @easp on GitHub (Jan 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1826

MacOS 14.2.1 on a 32GB M1 Max MBP

% ollama run dolphin-mixtral:8x7b-v2.7-q3_K_M
Error: model requires at least 48 GB of memory

This error appears immediately, it does not seem to try to load the model.

I tried pulling the model again. Same behavior. I've been running this model without issue on 0.1.17.

I tried upping the memory MacOS makes available to the GPU but it didn't help
sudo sysctl iogpu.wired_limit_mb=26624

Also an issue with mixtral:8x7b-instruct-v0.1-q3_K_M. nous-hermes2:34b-yi-q3_K_M runs, as does nous-hermes2:34b.

On 0.1.18, nous-hermes2:34b's memory requirements, according to final ggml_metal_add_buffer: entry in the log, is
19675.33 MB and 21845.34 MB are available to the GPU

On 0.1.17, dolphin-mixtral:8x7b-v2.7-q3_K_M's
19964.30 MB

On 0.1.17 mixtral:8x7b-instruct-v0.1-q3_K_M:
19965.17 MB

So, 0.1.18 runs a model that seems to require more memory than the q3_K_M mixtral variants that it refuses to run.

Has the memory requirement for the mixtral models increased dramatically in 0.1.18, or is this new feature of estimating and enforcing memory requirements causing problems?

Originally created by @easp on GitHub (Jan 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1826 MacOS 14.2.1 on a 32GB M1 Max MBP ``` % ollama run dolphin-mixtral:8x7b-v2.7-q3_K_M Error: model requires at least 48 GB of memory ``` This error appears immediately, it does not seem to try to load the model. I tried pulling the model again. Same behavior. I've been running this model without issue on 0.1.17. I tried upping the memory MacOS makes available to the GPU but it didn't help `sudo sysctl iogpu.wired_limit_mb=26624` Also an issue with mixtral:8x7b-instruct-v0.1-q3_K_M. nous-hermes2:34b-yi-q3_K_M runs, as does nous-hermes2:34b. On 0.1.18, nous-hermes2:34b's memory requirements, according to final `ggml_metal_add_buffer:` entry in the log, is 19675.33 MB and 21845.34 MB are available to the GPU On 0.1.17, dolphin-mixtral:8x7b-v2.7-q3_K_M's 19964.30 MB On 0.1.17 mixtral:8x7b-instruct-v0.1-q3_K_M: 19965.17 MB So, 0.1.18 runs a model that seems to require more memory than the q3_K_M mixtral variants that it refuses to run. Has the memory requirement for the mixtral models increased dramatically in 0.1.18, or is this new feature of estimating and enforcing memory requirements causing problems?
Author
Owner

@ageorgios commented on GitHub (Jan 8, 2024):

+1
would be nice to have an option to disable the check for power users.

<!-- gh-comment-id:1880482184 --> @ageorgios commented on GitHub (Jan 8, 2024): +1 would be nice to have an option to disable the check for power users.
Author
Owner

@data-stepper commented on GitHub (Jan 8, 2024):

Or maybe we can just add a CLI argument that disables the check?

<!-- gh-comment-id:1881079037 --> @data-stepper commented on GitHub (Jan 8, 2024): Or maybe we can just add a CLI argument that disables the check?
Author
Owner

@SwagMuffinMcYoloPants commented on GitHub (Jan 8, 2024):

https://github.com/jmorganca/ollama/compare/v0.1.17...v0.1.18#diff-f4b356a7b15ee425318c5d670a1cd20a6f91441a484282a10e0cf1a68b1bd94aR54

case "47B": requiredMemory = 48 * format.GigaByte

Looks like they never had any sort of RAM checking for the 47B parameter models. Now it's just being enforced. I do agree that there should be some sort of ignore the check type flag

<!-- gh-comment-id:1881485107 --> @SwagMuffinMcYoloPants commented on GitHub (Jan 8, 2024): https://github.com/jmorganca/ollama/compare/v0.1.17...v0.1.18#diff-f4b356a7b15ee425318c5d670a1cd20a6f91441a484282a10e0cf1a68b1bd94aR54 `case "47B": requiredMemory = 48 * format.GigaByte` Looks like they never had any sort of RAM checking for the 47B parameter models. Now it's just being enforced. I do agree that there should be some sort of ignore the check type flag
Author
Owner

@jimscard commented on GitHub (Jan 8, 2024):

The most expedient path would be to add an –ignore-memcheck type flag, but the ultimate fix would probably be to change how the check is performed.

Currently, the check does an assumption based on the parameter count, without considering quantization. So, the problem we’re having is that it’s assuming that the mixtral:latest model is a 47B model that needs 48GB, when in reality, per https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/blob/main/README.md, the recommended Q4_K_M quant only needs a maximum of 28.94GB, which fits in a system memory size of 32GB.

Jim

Jim Scardelis, M.S., CISSP, CISA, CEH, CCSK, PCI Secure Software, Secure SLC, P2PE, P2PE Application & 3DS Assessor, PCIP, CIPP/US, CIPP/C, CIPP/E, CIPT, CTT+

✉️ @.@.> |🌐 http://www.linkedin.com/in/jimscard/
Any views or opinions contained in this communication are solely those of the author, and do not necessarily represent those of any organizations or entities the author may be associated with.

From: SwagMuffinMcYoloPants @.>
Reply-To: jmorganca/ollama @.
>
Date: Monday, January 8, 2024 at 09:04
To: jmorganca/ollama @.>
Cc: Jim Scardelis @.
>, Manual @.***>
Subject: Re: [jmorganca/ollama] Ollama refuses to run model on 0.1.18 that worked fine on 0.1.17 due to insufficient memory (MacOS). (Issue #1826)

v0.1.17...v0.1.18#diff-f4b356a7b15ee425318c5d670a1cd20a6f91441a484282a10e0cf1a68b1bd94aR54https://github.com/jmorganca/ollama/compare/v0.1.17...v0.1.18#diff-f4b356a7b15ee425318c5d670a1cd20a6f91441a484282a10e0cf1a68b1bd94aR54

Looks like they never had any sort of RAM checking for the 47B parameter models. Now it's just being enforced. I do agree that there should be some sort of ignore the check type flag


Reply to this email directly, view it on GitHubhttps://github.com/jmorganca/ollama/issues/1826#issuecomment-1881485107, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGKZMWQ35AKB35SD237MLXTYNQRJNAVCNFSM6AAAAABBPVT432VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGQ4DKMJQG4.
You are receiving this because you are subscribed to this thread.Message ID: @.***>

<!-- gh-comment-id:1881628569 --> @jimscard commented on GitHub (Jan 8, 2024): The most expedient path would be to add an –ignore-memcheck type flag, but the ultimate fix would probably be to change how the check is performed. Currently, the check does an assumption based on the parameter count, without considering quantization. So, the problem we’re having is that it’s assuming that the mixtral:latest model is a 47B model that needs 48GB, when in reality, per https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/blob/main/README.md, the recommended Q4_K_M quant only needs a maximum of 28.94GB, which fits in a system memory size of 32GB. Jim ------------------------------------ Jim Scardelis, M.S., CISSP, CISA, CEH, CCSK, PCI Secure Software, Secure SLC, P2PE, P2PE Application & 3DS Assessor, PCIP, CIPP/US, CIPP/C, CIPP/E, CIPT, CTT+ ✉️ ***@***.******@***.***> |🌐 http://www.linkedin.com/in/jimscard/ Any views or opinions contained in this communication are solely those of the author, and do not necessarily represent those of any organizations or entities the author may be associated with. From: SwagMuffinMcYoloPants ***@***.***> Reply-To: jmorganca/ollama ***@***.***> Date: Monday, January 8, 2024 at 09:04 To: jmorganca/ollama ***@***.***> Cc: Jim Scardelis ***@***.***>, Manual ***@***.***> Subject: Re: [jmorganca/ollama] Ollama refuses to run model on 0.1.18 that worked fine on 0.1.17 due to insufficient memory (MacOS). (Issue #1826) v0.1.17...v0.1.18#diff-f4b356a7b15ee425318c5d670a1cd20a6f91441a484282a10e0cf1a68b1bd94aR54<https://github.com/jmorganca/ollama/compare/v0.1.17...v0.1.18#diff-f4b356a7b15ee425318c5d670a1cd20a6f91441a484282a10e0cf1a68b1bd94aR54> Looks like they never had any sort of RAM checking for the 47B parameter models. Now it's just being enforced. I do agree that there should be some sort of ignore the check type flag — Reply to this email directly, view it on GitHub<https://github.com/jmorganca/ollama/issues/1826#issuecomment-1881485107>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGKZMWQ35AKB35SD237MLXTYNQRJNAVCNFSM6AAAAABBPVT432VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGQ4DKMJQG4>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
Author
Owner

@demogit-code commented on GitHub (Jan 9, 2024):

maybe unrelated, if it helps:

After upgrading from version 16 to version 18 of ollama,
ollama run llama2 and others fail with the message:

Error: Post "http://127.0.0.1:11434/api/generate": EOF

if it helps, the journalctl logs:
journalctl.part.txt

--

maybe related to #186 ?

--

the issue disappeared in version 0.1.19 for me

<!-- gh-comment-id:1883147798 --> @demogit-code commented on GitHub (Jan 9, 2024): maybe unrelated, if it helps: After upgrading from version 16 to version 18 of ollama, ollama run llama2 and others fail with the message: `Error: Post "http://127.0.0.1:11434/api/generate": EOF` if it helps, the journalctl logs: [journalctl.part.txt](https://github.com/jmorganca/ollama/files/13874686/journalctl.part.txt) -- maybe related to #186 ? -- the issue disappeared in version 0.1.19 for me
Author
Owner

@easp commented on GitHub (Jan 9, 2024):

0.1.19 Helps. I can run the mixtral models again.

If I use a q4 quantization and/or a larger context size it ends up silently failing over to CPU, even if I've used sysctl to tell the OS to make enough memory available to GPU. Devs are aware of this issue and will address it in a later release.

<!-- gh-comment-id:1883455117 --> @easp commented on GitHub (Jan 9, 2024): 0.1.19 Helps. I can run the mixtral models again. If I use a q4 quantization and/or a larger context size it ends up silently failing over to CPU, even if I've used sysctl to tell the OS to make enough memory available to GPU. Devs are aware of this issue and will address it in a later release.
Author
Owner

@jukofyork commented on GitHub (Jan 9, 2024):

Testing the new VRAM allocation on the latest version pulled from Github:

Qwen-72b-chat q4_0 doesn't calculate the VRAM use properly and just eats it all then quits.

I'm also seeing deepseek-coder-33b q8_0 with a 16k context leave 4gb+ VRAM unused (on a 24gb card). It seems my attempts to increase with num_gpu just get ignored too.

Using deepseek-coder-33b q8_0 with a 4k context seems to be OK though.

I think as the OP suggested, there should still be an option to overide the automatic calculation and let's us manually change the num_gpu setting if needed.

<!-- gh-comment-id:1883980588 --> @jukofyork commented on GitHub (Jan 9, 2024): Testing the new VRAM allocation on the latest version pulled from Github: Qwen-72b-chat q4_0 doesn't calculate the VRAM use properly and just eats it all then quits. I'm also seeing deepseek-coder-33b q8_0 with a 16k context leave 4gb+ VRAM unused (on a 24gb card). It seems my attempts to increase with num_gpu just get ignored too. Using deepseek-coder-33b q8_0 with a 4k context seems to be OK though. I think as the OP suggested, there should still be an option to overide the automatic calculation and let's us manually change the num_gpu setting if needed.
Author
Owner

@thony-p commented on GitHub (Jan 27, 2024):

0.1.19 Helps. I can run the mixtral models again.

If I use a q4 quantization and/or a larger context size it ends up silently failing over to CPU, even if I've used sysctl to tell the OS to make enough memory available to GPU. Devs are aware of this issue and will address it in a later release.

@easp I have the same issue. Do you have an issue number to follow the bug?

<!-- gh-comment-id:1913360161 --> @thony-p commented on GitHub (Jan 27, 2024): > 0.1.19 Helps. I can run the mixtral models again. > > If I use a q4 quantization and/or a larger context size it ends up silently failing over to CPU, even if I've used sysctl to tell the OS to make enough memory available to GPU. Devs are aware of this issue and will address it in a later release. @easp I have the same issue. Do you have an issue number to follow the bug?
Author
Owner

@easp commented on GitHub (Jan 28, 2024):

Current behavior (on v 0.1.22) is that Ollama fails over to CPU inference when it estimates that GPU memory needs exceed what's available, ignoring the user's runtime change to the OS tunable (iogpu.wired_limit_mb)

<!-- gh-comment-id:1913420993 --> @easp commented on GitHub (Jan 28, 2024): Current behavior (on v 0.1.22) is that Ollama fails over to CPU inference when it estimates that GPU memory needs exceed what's available, ignoring the user's runtime change to the OS tunable (iogpu.wired_limit_mb)
Author
Owner

@gltanaka commented on GitHub (Jan 28, 2024):

I have a M3 pro with 36GB of memory. I can run the mixtral:8x7b-instruct-v0.1-q3_K_L (20GB) with the GPU and there is 10GB of free memory when it runs, but if I go just one size up (4bit 26GB) it only runs on the CPU.

It would be amazing if this bug could be fixed. Many thanks for everyone's work on this.

<!-- gh-comment-id:1913721494 --> @gltanaka commented on GitHub (Jan 28, 2024): I have a M3 pro with 36GB of memory. I can run the mixtral:8x7b-instruct-v0.1-q3_K_L (20GB) with the GPU and there is 10GB of free memory when it runs, but if I go just one size up (4bit 26GB) it only runs on the CPU. It would be amazing if this bug could be fixed. Many thanks for everyone's work on this.
Author
Owner

@peanut256 commented on GitHub (Feb 4, 2024):

This should fix it: PR #2354

<!-- gh-comment-id:1925914347 --> @peanut256 commented on GitHub (Feb 4, 2024): This should fix it: PR #2354
Author
Owner

@thony-p commented on GitHub (Feb 7, 2024):

This should fix it: PR #2354

Thank you. It is working! 😊 I can run Dolphin-Mixtral again from the GPU memory!

<!-- gh-comment-id:1932886874 --> @thony-p commented on GitHub (Feb 7, 2024): > This should fix it: PR #2354 Thank you. It is working! 😊 I can run Dolphin-Mixtral again from the GPU memory!
Author
Owner

@peanut256 commented on GitHub (Mar 6, 2024):

The issue should be solved with Ollama v0.1.28.

<!-- gh-comment-id:1979886645 --> @peanut256 commented on GitHub (Mar 6, 2024): The issue should be solved with Ollama v0.1.28.
Author
Owner

@jmorganca commented on GitHub (May 10, 2024):

This is fixed now! Thanks for the issue.

<!-- gh-comment-id:2103666917 --> @jmorganca commented on GitHub (May 10, 2024): This is fixed now! Thanks for the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63078