[GH-ISSUE #4006] wizardlm2:8x22b-q4_0 glitched on macOS (Apple Silicon) #28243

Closed
opened 2026-04-22 06:10:46 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @joliss on GitHub (Apr 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4006

Originally assigned to: @mxyng on GitHub.

What is the issue?

wizardlm2:8x22b-q4_0 is producing gibberish on my M2 Max MacBook Pro with 96 GB RAM:

~ $ ollama run wizardlm2:8x22b-q4_0 'Hi!'

#


"▅$"



It works fine with 8x22b-q2_K and 7b-q4_0:

~ $ ollama run wizardlm2:8x22b-q2_K 'Hi!'
Hello! How can I assist you today? If you have any questions or need help with something, feel
free to ask. I'm here to help!
~ $ ollama run wizardlm2:7b-q4_0 'Hi!'
Hello there! How can I assist you today? If you have any questions or need help with something,
feel free to ask.

This is happening both on Ollama version 0.1.32 and on main.

It seems to be the same issue as #3668, which was closed by the issue reporter when it resolved after a reinstall + restart for them, but neither rebuilding from source nor restarting the machine helped for me. I also tried re-pulling the model to force a SHA check.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.1.32 & main

Originally created by @joliss on GitHub (Apr 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4006 Originally assigned to: @mxyng on GitHub. ### What is the issue? `wizardlm2:8x22b-q4_0` is producing gibberish on my M2 Max MacBook Pro with 96 GB RAM: ``` ~ $ ollama run wizardlm2:8x22b-q4_0 'Hi!' # "▅$" ``` It works fine with `8x22b-q2_K` and `7b-q4_0`: ``` ~ $ ollama run wizardlm2:8x22b-q2_K 'Hi!' Hello! How can I assist you today? If you have any questions or need help with something, feel free to ask. I'm here to help! ~ $ ollama run wizardlm2:7b-q4_0 'Hi!' Hello there! How can I assist you today? If you have any questions or need help with something, feel free to ask. ``` This is happening both on Ollama version 0.1.32 and on main. It seems to be the same issue as #3668, which was closed by the issue reporter when it resolved after a reinstall + restart for them, but neither rebuilding from source nor restarting the machine helped for me. I also tried re-pulling the model to force a SHA check. ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.1.32 & main
GiteaMirror added the bug label 2026-04-22 06:10:46 -05:00
Author
Owner

@pdevine commented on GitHub (Apr 29, 2024):

I just tried this:

pdevine@MacBook-Pro-4 ollama % ./ollama run wizardlm2:8x22b-q4_0
>>> hi there
 Hello! How can I assist you today? If you have any questions or need information on a particular topic, feel free to ask. I'm here to help!

>>> Send a message (/? for help)

I seem to remember that there might be an issue w/ the ~/.ollama/assets directory having old metal files in it which may be causing issues, but it's weird that the 2 bit quantized models work fine. cc @dhiltgen

Can you confirm the ID of the model w/ ollama ls | grep wizardlm2:8x22b-q4_0? I get:

% ./ollama ls | grep wizardlm2:8x22b-q4_0
wizardlm2:8x22b-q4_0                    abda6e58fd1d    79 GB   6 minutes ago
<!-- gh-comment-id:2083086094 --> @pdevine commented on GitHub (Apr 29, 2024): I just tried this: ``` pdevine@MacBook-Pro-4 ollama % ./ollama run wizardlm2:8x22b-q4_0 >>> hi there Hello! How can I assist you today? If you have any questions or need information on a particular topic, feel free to ask. I'm here to help! >>> Send a message (/? for help) ``` I seem to remember that there might be an issue w/ the `~/.ollama/assets` directory having old `metal` files in it which may be causing issues, but it's weird that the 2 bit quantized models work fine. cc @dhiltgen Can you confirm the ID of the model w/ `ollama ls | grep wizardlm2:8x22b-q4_0`? I get: ``` % ./ollama ls | grep wizardlm2:8x22b-q4_0 wizardlm2:8x22b-q4_0 abda6e58fd1d 79 GB 6 minutes ago ```
Author
Owner

@joliss commented on GitHub (Apr 29, 2024):

$ find ~/.ollama/ | grep -i metalollama ls | grep wizardlm2:8x22b-q4_0
$ ollama ls | grep wizardlm2:8x22b-q4_0
wizardlm2:8x22b-q4_0              	abda6e58fd1d	79 GB 	27 hours ago

@pdevine Seems to be looking OK. What exact MacBook model are you on? Mine is a 16-inch MacBook Pro, M2 Max, 38-core GPU, 96 GB RAM, running Sonoma 14.4.1.

<!-- gh-comment-id:2083478072 --> @joliss commented on GitHub (Apr 29, 2024): ``` $ find ~/.ollama/ | grep -i metalollama ls | grep wizardlm2:8x22b-q4_0 $ ollama ls | grep wizardlm2:8x22b-q4_0 wizardlm2:8x22b-q4_0 abda6e58fd1d 79 GB 27 hours ago ``` @pdevine Seems to be looking OK. What exact MacBook model are you on? Mine is a 16-inch MacBook Pro, M2 Max, 38-core GPU, 96 GB RAM, running Sonoma 14.4.1.
Author
Owner

@pdevine commented on GitHub (Apr 29, 2024):

@joliss I'm using an M3 w/ 128 GB of RAM. Sonoma 14.2.1.

<!-- gh-comment-id:2083536320 --> @pdevine commented on GitHub (Apr 29, 2024): @joliss I'm using an M3 w/ 128 GB of RAM. Sonoma 14.2.1.
Author
Owner

@pdevine commented on GitHub (Apr 29, 2024):

Hey @joliss , I think there were some changes that went in recently for metal around memory offloading, and I'm wondering if that is the issue here. These are some big models that you're trying out. It's possible that I'm not hitting it because I have more memory.

<!-- gh-comment-id:2083567574 --> @pdevine commented on GitHub (Apr 29, 2024): Hey @joliss , I think there were some changes that went in recently for metal around memory offloading, and I'm wondering if that is the issue here. These are some big models that you're trying out. It's possible that I'm not hitting it because I have more memory.
Author
Owner

@joliss commented on GitHub (May 1, 2024):

@pdevine Issues with memory offloading seems like a plausible explanation. In the table in #4006, you can see various model sizes that I tested, and I'm noticing that the problem occurs exactly with the largest models that fit into my RAM, and thus cause high memory pressure. The models that work fine are the smaller ones, as well as the ones don't fit into RAM, where Ollama seems to be streaming them from disk rather than loading into memory (judging by how the memory pressure never goes up).

Is there any chance you could try mixtral:8x22b-instruct-v0.1-q5_1 (106 GB), mixtral:8x22b-instruct-v0.1-q6_K (116 GB) and mixtral:8x22b-text-v0.1-q8_0 (149 GB) on your 128 GB machine? I'd be curious if it reproduces at one of these sizes for you.

<!-- gh-comment-id:2088126342 --> @joliss commented on GitHub (May 1, 2024): @pdevine Issues with memory offloading seems like a plausible explanation. In the table in #4006, you can see various model sizes that I tested, and I'm noticing that the problem occurs exactly with the largest models that fit into my RAM, and thus cause high memory pressure. The models that work fine are the smaller ones, as well as the ones don't fit into RAM, where Ollama seems to be streaming them from disk rather than loading into memory (judging by how the memory pressure never goes up). Is there any chance you could try `mixtral:8x22b-instruct-v0.1-q5_1` (106 GB), `mixtral:8x22b-instruct-v0.1-q6_K` (116 GB) and `mixtral:8x22b-text-v0.1-q8_0` (149 GB) on your 128 GB machine? I'd be curious if it reproduces at one of these sizes for you.
Author
Owner

@pdevine commented on GitHub (May 1, 2024):

Hey @joliss , I think there is a fix for metal offloading which should be in 0.1.33. Can you test out the prerelease? I think should fix the problem.

<!-- gh-comment-id:2088458009 --> @pdevine commented on GitHub (May 1, 2024): Hey @joliss , I think there is a fix for metal offloading which should be in `0.1.33`. Can you test out the [prerelease](https://github.com/ollama/ollama/releases)? I think should fix the problem.
Author
Owner

@joliss commented on GitHub (May 1, 2024):

@pdevine Just tried the 0.1.33 prerelease, and it's still happening with both wizardlm2 and mixtral.

<!-- gh-comment-id:2088950975 --> @joliss commented on GitHub (May 1, 2024): @pdevine Just tried the 0.1.33 prerelease, and it's still happening with both wizardlm2 and mixtral.
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

@joliss can you share the server log so we can see the memory calculations and how many layers we tried to load? From the sounds of it we may be overshooting, and then the OS is paging part of the model out under memory pressure.

<!-- gh-comment-id:2091042071 --> @dhiltgen commented on GitHub (May 2, 2024): @joliss can you share the server log so we can see the memory calculations and how many layers we tried to load? From the sounds of it we may be overshooting, and then the OS is paging part of the model out under memory pressure.
Author
Owner

@mxyng commented on GitHub (May 2, 2024):

Related to #4028. I'm going to close this one as duplicate since #4028 captures the mixtral:8x22b of which wizardlm:8x22b is a fine tune.

<!-- gh-comment-id:2091566784 --> @mxyng commented on GitHub (May 2, 2024): Related to #4028. I'm going to close this one as duplicate since #4028 captures the mixtral:8x22b of which wizardlm:8x22b is a fine tune.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28243