[GH-ISSUE #5547] Mixtral 8x22b inference output is empty or gibberish #49978

Open
opened 2026-04-28 13:37:34 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @PLK2 on GitHub (Jul 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5547

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Mixtral 8x22b instruct outputs are either empty or gibberish.
I have tried various quantizations: q4, q4_k_m, q5, etc. All seem problematic.
Other models (e.g., llama3, command-r, Mistral, etc) work fine.

Running 2x Nvidia 3090 GPUs = 48gb vram, 4.9 GHz AMD Ryzen 9 5950X, 128gb ram.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.48

Originally created by @PLK2 on GitHub (Jul 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5547 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Mixtral 8x22b instruct outputs are either empty or gibberish. I have tried various quantizations: q4, q4_k_m, q5, etc. All seem problematic. Other models (e.g., llama3, command-r, Mistral, etc) work fine. Running 2x Nvidia 3090 GPUs = 48gb vram, 4.9 GHz AMD Ryzen 9 5950X, 128gb ram. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.48
GiteaMirror added the memorybugnvidia labels 2026-04-28 13:37:34 -05:00
Author
Owner

@PLK2 commented on GitHub (Jul 8, 2024):

Have also tested on LM Studio and it works fine.

<!-- gh-comment-id:2215050415 --> @PLK2 commented on GitHub (Jul 8, 2024): Have also tested on LM Studio and it works fine.
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

I don't have an identical setup, but on a dual 3060 setup, the model loads and works, albeit slowly. (1.64 tps)

My suspicion is we're off slightly in our memory predictions and loading ~1 too many layers leading to some sort of corruption. Can you share your server log so we can see the memory predictions and layer counts? You can also experiment and try getting it to load with fewer layers and see if that gets it working properly.

% curl http://localhost:11434/api/generate -d '{
  "model": "mixtral:8x22b",
  "prompt": "hello",
  "stream": false, "options": {"num_gpu": 12 }
}'

Please make sure to upgrade to the latest version too.

<!-- gh-comment-id:2246634204 --> @dhiltgen commented on GitHub (Jul 24, 2024): I don't have an identical setup, but on a dual 3060 setup, the model loads and works, albeit slowly. (1.64 tps) My suspicion is we're off slightly in our memory predictions and loading ~1 too many layers leading to some sort of corruption. Can you share your server log so we can see the memory predictions and layer counts? You can also experiment and try getting it to load with fewer layers and see if that gets it working properly. ``` % curl http://localhost:11434/api/generate -d '{ "model": "mixtral:8x22b", "prompt": "hello", "stream": false, "options": {"num_gpu": 12 } }' ``` Please make sure to upgrade to the latest version too.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49978