[GH-ISSUE #4033] incomprehensible answers from Gemma:7b #64540

Closed
opened 2026-05-03 18:01:03 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @kukidevs on GitHub (Apr 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4033

What is the issue?

image

Mistral:7b works fine, so I suppose that the issue is realted to the model

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.1.32

Originally created by @kukidevs on GitHub (Apr 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4033 ### What is the issue? <img width="627" alt="image" src="https://github.com/ollama/ollama/assets/113847173/34b3d3e0-70b2-4695-a86f-f824178e1b68"> Mistral:7b works fine, so I suppose that the issue is realted to the model ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-05-03 18:01:04 -05:00
Author
Owner

@pdevine commented on GitHub (Apr 29, 2024):

This should work fine. Here's my output on a MBP:

% ./ollama run gemma:7b
>>> hi there
Hi there, and welcome to my chat! 👋

I'm glad you decided to talk to me today. What would you like to talk about?

Two things:

  1. how much memory do you have, and is there memory pressure?
  2. can you paste the output for ollama ls | grep gemma:7b?

The latter should look like:

% ./ollama ls | grep gemma:7b
gemma:7b                                430ed3535049    5.2 GB  3 weeks ago
<!-- gh-comment-id:2083557173 --> @pdevine commented on GitHub (Apr 29, 2024): This should work fine. Here's my output on a MBP: ``` % ./ollama run gemma:7b >>> hi there Hi there, and welcome to my chat! 👋 I'm glad you decided to talk to me today. What would you like to talk about? ``` Two things: 1. how much memory do you have, and is there memory pressure? 2. can you paste the output for `ollama ls | grep gemma:7b`? The latter should look like: ``` % ./ollama ls | grep gemma:7b gemma:7b 430ed3535049 5.2 GB 3 weeks ago ```
Author
Owner

@thinkverse commented on GitHub (Apr 29, 2024):

The latter should look like:

% ./ollama ls | grep gemma:7b
gemma:7b                                430ed3535049    5.2 GB  3 weeks ago

Running Ollama 0.1.32, Mac OS Sonoma with 16 GB, can vouch that the latest version - a72c7f4d0a15, works without any modifications.

ollama ls | grep gemma:7b
gemma:7b                             	a72c7f4d0a15	5.0 GB	7 days ago 	
ollama run gemma:7b

>>> hi there
Hi there! 👋

It's great to hear from you. What would you like to chat about today? 😊
<!-- gh-comment-id:2083627693 --> @thinkverse commented on GitHub (Apr 29, 2024): > The latter should look like: > > ``` > % ./ollama ls | grep gemma:7b > gemma:7b 430ed3535049 5.2 GB 3 weeks ago > ``` Running Ollama 0.1.32, Mac OS Sonoma with 16 GB, can vouch that the latest version - [a72c7f4d0a15](https://ollama.com/library/gemma:7b), works without any modifications. ```shell ollama ls | grep gemma:7b gemma:7b a72c7f4d0a15 5.0 GB 7 days ago ``` ```shell ollama run gemma:7b >>> hi there Hi there! 👋 It's great to hear from you. What would you like to chat about today? 😊 ```
Author
Owner

@kukidevs commented on GitHub (Apr 29, 2024):

@pdevine

  1. how much memory do you have, and is there memory pressure?
  2. can you paste the output for ollama ls | grep gemma:7b?
  1. Mac Mini M2 8GB. apparently it should be able to run since mistral 7b can
  2. gemma:7b a72c7f4d0a15 5.0 GB 35 seconds ago
<!-- gh-comment-id:2083711073 --> @kukidevs commented on GitHub (Apr 29, 2024): @pdevine > 1. how much memory do you have, and is there memory pressure? > 2. can you paste the output for `ollama ls | grep gemma:7b`? 1. Mac Mini M2 8GB. apparently it should be able to run since mistral 7b can 2. gemma:7b a72c7f4d0a15 5.0 GB 35 seconds ago
Author
Owner

@pdevine commented on GitHub (Apr 29, 2024):

I realize I had the outdated version. :-D I think it's almost certainly a memory pressure issue w/ metal cc @mxyng

<!-- gh-comment-id:2083713941 --> @pdevine commented on GitHub (Apr 29, 2024): I realize *I* had the outdated version. :-D I think it's almost certainly a memory pressure issue w/ metal cc @mxyng
Author
Owner

@kukidevs commented on GitHub (Apr 29, 2024):

@pdevine
but can it be the problem here?
more:
image

it doesn't refuse to generate a reply or takes long to generate, but it just returns a bunch of "unused" tags, as if it's something that needs to be decoded.
also I might have misunderstood your response

<!-- gh-comment-id:2083736525 --> @kukidevs commented on GitHub (Apr 29, 2024): @pdevine but can it be the problem here? more: <img width="569" alt="image" src="https://github.com/ollama/ollama/assets/113847173/77dfdd6c-d655-47ed-aa87-884e6a50078a"> it doesn't refuse to generate a reply or takes long to generate, but it just returns a bunch of "unused" tags, as if it's something that needs to be decoded. also I might have misunderstood your response
Author
Owner

@pdevine commented on GitHub (Apr 29, 2024):

@pdevine but can it be the problem here? [...]
it doesn't refuse to generate a reply or takes long to generate, but it just returns a bunch of "unused" tags, as if it's something that needs to be decoded. also I might have misunderstood your response

I think maybe? It tends to do weird things when it runs out of memory. It's weird that mistral:7b works fine though.

<!-- gh-comment-id:2083828491 --> @pdevine commented on GitHub (Apr 29, 2024): > @pdevine but can it be the problem here? [...] > it doesn't refuse to generate a reply or takes long to generate, but it just returns a bunch of "unused" tags, as if it's something that needs to be decoded. also I might have misunderstood your response I think maybe? It tends to do weird things when it runs out of memory. It's weird that mistral:7b works fine though.
Author
Owner

@mkmohangb commented on GitHub (Apr 30, 2024):

I faced the same issue yesterday on my M1 Mac 8GB with Gemma 7B. Gemma 2b model worked fine. Also no issues with llama3 8B.

Seeing this in the server logs for Gemma 7b, looks like out of memory issue,

ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   506.00 MiB, ( 6793.05 /  5461.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model:      Metal compute buffer size =   506.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 931
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: command buffer 0 failed with status 5

Disabling GPU (couldn't find this info under docs folder. Referred this reddit thread) works but it is painfully slow.

ollama run gemma:7b                                                                                           ─╯
>>> what is 2 + 2?
<unused27><unused32><unused16><unused5><2mass><mask><unused5><unused3><unused12><unused18><unused21><unused29><<unused27><unused32><unused16><unused5><2mass><mask><unused5><unused3><unused12><unused18><unused21><unused29><unused6><unused20><unused7><unused2><unused30><unused2><unused18><unused20><unused7><unused2><unused12><unused7>nused6><unused20><unused7><unused2><unused30><unused2><unused18><unused20><unused7><unused2><unused12><unused7><unused10><unused32><unused0><unused21>

>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> what is 2 + 2?
**4**

2 + 2 = 4

>>>
<!-- gh-comment-id:2084841685 --> @mkmohangb commented on GitHub (Apr 30, 2024): I faced the same issue yesterday on my M1 Mac 8GB with Gemma 7B. Gemma 2b model worked fine. Also no issues with llama3 8B. Seeing this in the server logs for Gemma 7b, looks like out of memory issue, ``` ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 506.00 MiB, ( 6793.05 / 5461.34)ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size llama_new_context_with_model: Metal compute buffer size = 506.00 MiB llama_new_context_with_model: CPU compute buffer size = 10.01 MiB llama_new_context_with_model: graph nodes = 931 llama_new_context_with_model: graph splits = 2 ggml_metal_graph_compute: command buffer 0 failed with status 5 ``` Disabling GPU (couldn't find this info under docs folder. Referred this [reddit thread](https://www.reddit.com/r/ollama/comments/1c0vw5w/how_can_i_run_ollama_in_cpumode/)) works but it is painfully slow. ``` ollama run gemma:7b ─╯ >>> what is 2 + 2? <unused27><unused32><unused16><unused5><2mass><mask><unused5><unused3><unused12><unused18><unused21><unused29><<unused27><unused32><unused16><unused5><2mass><mask><unused5><unused3><unused12><unused18><unused21><unused29><unused6><unused20><unused7><unused2><unused30><unused2><unused18><unused20><unused7><unused2><unused12><unused7>nused6><unused20><unused7><unused2><unused30><unused2><unused18><unused20><unused7><unused2><unused12><unused7><unused10><unused32><unused0><unused21> >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> what is 2 + 2? **4** 2 + 2 = 4 >>> ```
Author
Owner

@pdevine commented on GitHub (May 1, 2024):

@mkmohangb the team had been talking about this earlier today and we suspected that this might be the issue. Good catch.
There is a fix in the 0.1.33 pre-release which may fix this. @kukidevs (or anyone commenting on the issue) would you be able to try it out?

<!-- gh-comment-id:2087980014 --> @pdevine commented on GitHub (May 1, 2024): @mkmohangb the team had been talking about this earlier today and we suspected that this might be the issue. Good catch. There is a fix in the 0.1.33 pre-release which may fix this. @kukidevs (or anyone commenting on the issue) would you be able to try it out?
Author
Owner

@mkmohangb commented on GitHub (May 1, 2024):

@pdevine I tried Gemma with 0.1.33-rc5 version. It works now but is slow. I see in the server logs that not all the layers are sent to the GPU. How do you decide the upper limit for this?

<!-- gh-comment-id:2088033810 --> @mkmohangb commented on GitHub (May 1, 2024): @pdevine I tried Gemma with 0.1.33-rc5 version. It works now but is slow. I see in the server logs that not all the layers are sent to the GPU. How do you decide the upper limit for this?
Author
Owner

@pdevine commented on GitHub (May 1, 2024):

@pdevine I tried Gemma with 0.1.33-rc5 version. It works now but is slow. I see in the server logs that not all the layers are sent to the GPU. How do you decide the upper limit for this?

We compute the graph of the memory footprint for the model and see if you have enough memory.

I'll go ahead and close the issue. The workaround for now is to /set num_gpu to a lower number, but otherwise just upgrade to 0.1.33.

<!-- gh-comment-id:2088444386 --> @pdevine commented on GitHub (May 1, 2024): > @pdevine I tried Gemma with 0.1.33-rc5 version. It works now but is slow. I see in the server logs that not all the layers are sent to the GPU. How do you decide the upper limit for this? We compute the graph of the memory footprint for the model and see if you have enough memory. I'll go ahead and close the issue. The workaround for now is to `/set num_gpu` to a lower number, but otherwise just upgrade to 0.1.33.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64540