[GH-ISSUE #7402] ollama run aya-expanse:32b gives nonsensical output #51219

Closed
opened 2026-04-28 18:56:26 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @lefromage on GitHub (Oct 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7402

What is the issue?

aya-expanse 8b runs fine, but 32b produces nonsensical output as shown below

ollama run aya-expanse:8b

hi
Hello! How can I help you today?

ollama run aya-expanse:32b

hi
LKJOLE6IEGU;F<B9DN:FM4VNOUSV7I=5<UNHBGUQTUR=GOG;LRN<CLE<;7BV@>T:8ND5>>;<34LR;C;D7M6<QO5UIOI7BBUG9:?<D6UM<SO:MVKA6OAUFIK67UD@LKJOLE6IEGU;F<B9DN:FM4VNOUSV7I=5<UNHBGUQTUR=GOG;LRN<CLE<;7BV@>T:8ND5>>;<34LR;C;D7M6<QO5UIOI7BBUG9:?<D6UM<SO:MVKA6OAUFIK67UD@M67JHD?N4:J? ..... keeps producing nonsensical output

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.3.14

Originally created by @lefromage on GitHub (Oct 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7402 ### What is the issue? aya-expanse 8b runs fine, but 32b produces nonsensical output as shown below ollama run aya-expanse:8b >>> hi Hello! How can I help you today? ollama run aya-expanse:32b >>> hi L<PAD>KJ<PAD>OLE6IEGU;F<B9DN:FM4VNOUSV7I=5<UNHBGUQTUR=GOG;<PAD>LRN<CLE<;7BV@>T:8ND5>>;<34<PAD>LR;C;D7M6<QO5UIOI7BBUG9:?<D6UM<SO:MVKA6OAUFIK67UD@L<PAD>KJ<PAD>OLE6IEGU;F<B9DN:FM4VNOUSV7I=5<UNHBGUQTUR=GOG;<PAD>LRN<CLE<;7BV@>T:8ND5>>;<34<PAD>LR;C;D7M6<QO5UIOI7BBUG9:?<D6UM<SO:MVKA6OAUFIK67UD@M67<PAD>JHD?N4:J? ..... keeps producing nonsensical output ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.3.14
GiteaMirror added the bug label 2026-04-28 18:56:26 -05:00
Author
Owner

@lefromage commented on GitHub (Oct 28, 2024):

same problem as above , tested on M1 and M2 macos sequoia laptops

<!-- gh-comment-id:2442162956 --> @lefromage commented on GitHub (Oct 28, 2024): same problem as above , tested on M1 and M2 macos sequoia laptops
Author
Owner

@rick-github commented on GitHub (Oct 28, 2024):

Might be a macos issue. Does it work better if you type /set parameter num_gpu 0 before you type hi?

<!-- gh-comment-id:2442447519 --> @rick-github commented on GitHub (Oct 28, 2024): Might be a macos issue. Does it work better if you type `/set parameter num_gpu 0` before you type `hi`?
Author
Owner

@lefromage commented on GitHub (Oct 28, 2024):

it does work around the issue with
/set parameter num_gpu 0

but is slow and that negates the speed advantage of the mps GPU on macs
which usually ollama can handle.

community-mlx has it working for 8bit (40GB) which uses mps GPU I believe.
I am looking forward to using the smaller 4bit(20GB) on a 32GB RAM mac with GPU.

<!-- gh-comment-id:2442648993 --> @lefromage commented on GitHub (Oct 28, 2024): it does work around the issue with /set parameter num_gpu 0 but is slow and that negates the speed advantage of the mps GPU on macs which usually ollama can handle. community-mlx has it working for 8bit (40GB) which uses mps GPU I believe. I am looking forward to using the smaller 4bit(20GB) on a 32GB RAM mac with GPU.
Author
Owner

@jessatech commented on GitHub (Mar 22, 2025):

This is happening for me with all on models pulled from Ollama as well on my M3 Max on Sequoia 15.2, so clearly some kind of issue offloading to the GPU. I see this sometimes with LM Studio and have found reducing the num_gpu to whatever the max is -1 fixes it, not sure how to determine the max ngpu for a model though.

Edit*
After some poking I found in the logs you can see the repeating layers offloaded to the GPU to find the max layers. Setting num_gpu to the number of total repeating layers fixes the issue for me. There is an additional output layer offloaded to the GPU after the repeating layers which when counted in num_gpu results in garbage output for me.

Filed issue #9948

<!-- gh-comment-id:2745899181 --> @jessatech commented on GitHub (Mar 22, 2025): This is happening for me with all on models pulled from Ollama as well on my M3 Max on Sequoia 15.2, so clearly some kind of issue offloading to the GPU. I see this sometimes with LM Studio and have found reducing the num_gpu to whatever the max is -1 fixes it, not sure how to determine the max ngpu for a model though. Edit* After some poking I found in the logs you can see the repeating layers offloaded to the GPU to find the max layers. Setting num_gpu to the number of total repeating layers fixes the issue for me. There is an additional output layer offloaded to the GPU after the repeating layers which when counted in num_gpu results in garbage output for me. Filed issue #9948
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51219