[GH-ISSUE #15235] can not run unsloth gemma-4 #56256

Open
opened 2026-04-29 10:29:26 -05:00 by GiteaMirror · 23 comments
Owner

Originally created by @iwater on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15235

What is the issue?

v0.20 got 500 Internal Server Error

ollama run hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
Error: 500 Internal Server Error: unable to load model: /Users/iwater/.ollama/models/blobs/sha256-8520b6bc9bbd9b6432bc1c37ee460eb861b4d821ddd6f56ef3a7ebcfca7e6005

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.20.0-rc1

Originally created by @iwater on GitHub (Apr 2, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15235 ### What is the issue? v0.20 got 500 Internal Server Error ``` ollama run hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M Error: 500 Internal Server Error: unable to load model: /Users/iwater/.ollama/models/blobs/sha256-8520b6bc9bbd9b6432bc1c37ee460eb861b4d821ddd6f56ef3a7ebcfca7e6005 ``` ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.20.0-rc1
GiteaMirror added the bug label 2026-04-29 10:29:26 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 3, 2026):

Unlsloth versions of gemma4 will not work until the next vendor sync. The unsloth models (and any other source that uses llama.cpp to quantize the model) are split models, separate text and vision GGUFs. Split models must run on the llama.cpp backend, which does not yet support the gemma4 architecture. The next vendor sync will merge llama.cpp support for gemma4.

<!-- gh-comment-id:4181221561 --> @rick-github commented on GitHub (Apr 3, 2026): Unlsloth versions of gemma4 will not work until the next vendor sync. The unsloth models (and any other source that uses llama.cpp to quantize the model) are split models, separate text and vision GGUFs. Split models must run on the llama.cpp backend, which does not yet support the gemma4 architecture. The next vendor sync will merge llama.cpp support for gemma4.
Author
Owner

@yie2984 commented on GitHub (Apr 3, 2026):

me too

<!-- gh-comment-id:4181891272 --> @yie2984 commented on GitHub (Apr 3, 2026): me too
Author
Owner

@francisoliverlee commented on GitHub (Apr 3, 2026):

me too
Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-27ee8bbb31f274a8e6e37217198bce873611c8c605566d2c2ec95d064c9a213b

<!-- gh-comment-id:4184007201 --> @francisoliverlee commented on GitHub (Apr 3, 2026): me too Error: 500 Internal Server Error: unable to load model: /root/.ollama/models/blobs/sha256-27ee8bbb31f274a8e6e37217198bce873611c8c605566d2c2ec95d064c9a213b
Author
Owner

@Cyberschorsch commented on GitHub (Apr 4, 2026):

Meanwhile you can try and "remove" the vision model part:

Run this command to create a Modelfile:
ollama show --modelfile hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > gemma_4_unsloth_modelfile

Then open the gemma_4_unsloth_modelfile and add a # character in front of the second FROM line:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M

FROM /root/.ollama/models/blobs/sha256-2f8672b0c2cca8dedfb8782815c2769ccdaa6512788f3ee87b32cf117f0dffc1
#FROM /root/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27
TEMPLATE "{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
"
PARAMETER stop <bos>
PARAMETER stop <|turn>
PARAMETER stop <turn|>
PARAMETER stop <|turn>user

Then run:

ollama create gemma-4-unsloth -f gemma_4_unsloth_modelfile

Now you can run the unsloth version without vision:

ollama run gemma-4-unsloth

<!-- gh-comment-id:4187108500 --> @Cyberschorsch commented on GitHub (Apr 4, 2026): Meanwhile you can try and "remove" the vision model part: Run this command to create a Modelfile: `ollama show --modelfile hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > gemma_4_unsloth_modelfile` Then open the gemma_4_unsloth_modelfile and add a # character in front of the second FROM line: ``` # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M FROM /root/.ollama/models/blobs/sha256-2f8672b0c2cca8dedfb8782815c2769ccdaa6512788f3ee87b32cf117f0dffc1 #FROM /root/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27 TEMPLATE "{{ if .System }}<bos><|turn>system {{ .System }}<turn|> {{ end }}{{ if .Prompt }}<|turn>user {{ .Prompt }}<turn|> {{ end }}<|turn>model {{ .Response }}<turn|> " PARAMETER stop <bos> PARAMETER stop <|turn> PARAMETER stop <turn|> PARAMETER stop <|turn>user ``` Then run: `ollama create gemma-4-unsloth -f gemma_4_unsloth_modelfile` Now you can run the unsloth version without vision: ollama run gemma-4-unsloth
Author
Owner

@uprootnetworks commented on GitHub (Apr 4, 2026):

Meanwhile you can try and "remove" the vision model part:

Run this command to create a Modelfile: ollama show --modelfile hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > gemma_4_unsloth_modelfile

Then open the gemma_4_unsloth_modelfile and add a # character in front of the second FROM line:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M

FROM /root/.ollama/models/blobs/sha256-2f8672b0c2cca8dedfb8782815c2769ccdaa6512788f3ee87b32cf117f0dffc1
#FROM /root/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27
TEMPLATE "{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
"
PARAMETER stop <bos>
PARAMETER stop <|turn>
PARAMETER stop <turn|>
PARAMETER stop <|turn>user

Then run:

ollama create gemma-4-unsloth -f gemma_4_unsloth_modelfile

Now you can run the unsloth version without vision:

ollama run gemma-4-unsloth

This worked for me, thank you!

<!-- gh-comment-id:4187189657 --> @uprootnetworks commented on GitHub (Apr 4, 2026): > Meanwhile you can try and "remove" the vision model part: > > Run this command to create a Modelfile: `ollama show --modelfile hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > gemma_4_unsloth_modelfile` > > Then open the gemma_4_unsloth_modelfile and add a # character in front of the second FROM line: > > ``` > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this, replace FROM with: > # FROM hf.co/unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M > > FROM /root/.ollama/models/blobs/sha256-2f8672b0c2cca8dedfb8782815c2769ccdaa6512788f3ee87b32cf117f0dffc1 > #FROM /root/.ollama/models/blobs/sha256-fc2ebf4c44528daa2cea7b39891712847ca5e4f87dcf578054a06c46bfe6da27 > TEMPLATE "{{ if .System }}<bos><|turn>system > {{ .System }}<turn|> > {{ end }}{{ if .Prompt }}<|turn>user > {{ .Prompt }}<turn|> > {{ end }}<|turn>model > {{ .Response }}<turn|> > " > PARAMETER stop <bos> > PARAMETER stop <|turn> > PARAMETER stop <turn|> > PARAMETER stop <|turn>user > ``` > > Then run: > > `ollama create gemma-4-unsloth -f gemma_4_unsloth_modelfile` > > Now you can run the unsloth version without vision: > > ollama run gemma-4-unsloth This worked for me, thank you!
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

This commenting of the 2nd FROM line works but what if I want to use the vision capabilities? I tried updating to the current latest 0.20.3 and keep both FROM lines uncommented but it still fails to load the model.

honza@desktop:~/Documents$ ollama run gemma-4-E4B-it-UD-Q4_K_XL what is this image? ./Pictures/Screenshots/s.png Error: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-e8cd1a4c5162dfa355c567b4099206793793b5548951f077f376a0076dc0a058

<!-- gh-comment-id:4197570431 --> @aznohonza commented on GitHub (Apr 7, 2026): This commenting of the 2nd FROM line works but what if I want to use the vision capabilities? I tried updating to the current latest 0.20.3 and keep both FROM lines uncommented but it still fails to load the model. `honza@desktop:~/Documents$ ollama run gemma-4-E4B-it-UD-Q4_K_XL what is this image? ./Pictures/Screenshots/s.png Error: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-e8cd1a4c5162dfa355c567b4099206793793b5548951f077f376a0076dc0a058 `
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

https://github.com/ollama/ollama/issues/15235#issuecomment-4181221561

<!-- gh-comment-id:4197591691 --> @rick-github commented on GitHub (Apr 7, 2026): https://github.com/ollama/ollama/issues/15235#issuecomment-4181221561
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

Thx, so I need to wait for llama.cpp to support it, then for ollama to update to the new llama.cpp version?
If so are there any other quantized versions of gemma4 that currently work with ollama supporting vision?

<!-- gh-comment-id:4197616927 --> @aznohonza commented on GitHub (Apr 7, 2026): Thx, so I need to wait for llama.cpp to support it, then for ollama to update to the new llama.cpp version? If so are there any other quantized versions of gemma4 that currently work with ollama supporting vision?
Author
Owner

@Cyberschorsch commented on GitHub (Apr 7, 2026):

llama.cpp should already support it. You could use LocalAI in the meantime with an updated llama.cpp backend which already supports gemma4 models with vision.

<!-- gh-comment-id:4197827232 --> @Cyberschorsch commented on GitHub (Apr 7, 2026): llama.cpp should already support it. You could use LocalAI in the meantime with an updated llama.cpp backend which already supports gemma4 models with vision.
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

Ah alright, thx. Can I install localAI without breaking ollama on my system?

<!-- gh-comment-id:4197869871 --> @aznohonza commented on GitHub (Apr 7, 2026): Ah alright, thx. Can I install localAI without breaking ollama on my system?
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

Or just use the gemma4 model from the ollama library.

<!-- gh-comment-id:4197899271 --> @rick-github commented on GitHub (Apr 7, 2026): Or just use the gemma4 model from the ollama library.
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

but that isnt quantized or is it?

<!-- gh-comment-id:4197913088 --> @aznohonza commented on GitHub (Apr 7, 2026): but that isnt quantized or is it?
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

The ollama library has q4, q8 and bf16 quants of the model. Since you are trying to run the q4 quant from unsloth, you might as well use the q4 from ollama. There will be slight differences in some tensors but it's unlikely to make a significant difference to the output. If you want to use a different quant, for example q6 because you want to trade space for precision without going as large as q8, then you need to wait for the llama.cpp sync.

<!-- gh-comment-id:4197967461 --> @rick-github commented on GitHub (Apr 7, 2026): The ollama library has [q4](https://ollama.com/library/gemma4:e4b-it-q4_K_M), [q8](https://ollama.com/library/gemma4:e4b-it-q8_0) and [bf16](https://ollama.com/library/gemma4:e4b-it-bf16) quants of the model. Since you are trying to run the q4 quant from unsloth, you might as well use the q4 from ollama. There will be slight differences in some tensors but it's unlikely to make a significant difference to the output. If you want to use a different quant, for example q6 because you want to trade space for precision without going as large as q8, then you need to wait for the llama.cpp sync.
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

I downloaded the ollama q4 but unfortunately it doesnt have the mmproj included so it does not support vision, so guess Ill just wait.

ollama show --modelfile gemma4:e4b-it-q4_K_M

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM gemma4:e4b-it-q4_K_M

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a
TEMPLATE {{ .Prompt }}
RENDERER gemma4
PARSER gemma4
PARAMETER top_p 0.95
PARAMETER temperature 1
PARAMETER top_k 64```
<!-- gh-comment-id:4198276227 --> @aznohonza commented on GitHub (Apr 7, 2026): I downloaded the ollama q4 but unfortunately it doesnt have the mmproj included so it does not support vision, so guess Ill just wait. ``` ollama show --modelfile gemma4:e4b-it-q4_K_M # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM gemma4:e4b-it-q4_K_M FROM /usr/share/ollama/.ollama/models/blobs/sha256-4c27e0f5b5adf02ac956c7322bd2ee7636fe3f45a8512c9aba5385242cb6e09a TEMPLATE {{ .Prompt }} RENDERER gemma4 PARSER gemma4 PARAMETER top_p 0.95 PARAMETER temperature 1 PARAMETER top_k 64```
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

Ollama models combine the vision (mmproj) and text weights into a single GGUF file, that's why only one FROM line is found in the Modelfile. gemma4:e4b-it-q4_K_M will process images.

$ ollama show gemma4:e4b-it-q4_K_M
...
  Capabilities
    completion    
    vision        
    audio         
    tools         
    thinking      
<!-- gh-comment-id:4198471714 --> @rick-github commented on GitHub (Apr 7, 2026): Ollama models combine the vision (mmproj) and text weights into a single GGUF file, that's why only one `FROM` line is found in the Modelfile. `gemma4:e4b-it-q4_K_M` will process images. ```console $ ollama show gemma4:e4b-it-q4_K_M ... Capabilities completion vision audio tools thinking ```
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

Oh alright, thanks. And could you please share the way to supply the model with an image? My old method that worked for gemma3 does not seem to work for this gemma4 model.

 honza@desktop:~/Documents$ cat a.py 
import ollama

response = ollama.chat(
    #model="gemma-4-E4B-it-UD-Q4_K_XL",
    #model="gemma-3-12b-it-qat-UD-Q4_K_XL",
    model="gemma4:e4b-it-q4_K_M",
    messages=[
        {"role": "user", "content": "Describe the image", "images": ["s.png"]}
    ],
)

print(response["message"]['role'])
print(response["message"]['content'])
honza@desktop:~/Documents$ python3 a.py 
assistant
I cannot describe an image because no image was provided.

You have pasted a large block of academic or literary notes, but there is no visual content attached.

If you would like me to help you with the text you provided (e.g., summarizing the literary movements, explaining the life of a specific author, or formatting the notes), please let me know!
<!-- gh-comment-id:4198545671 --> @aznohonza commented on GitHub (Apr 7, 2026): Oh alright, thanks. And could you please share the way to supply the model with an image? My old method that worked for gemma3 does not seem to work for this gemma4 model. ``` honza@desktop:~/Documents$ cat a.py import ollama response = ollama.chat( #model="gemma-4-E4B-it-UD-Q4_K_XL", #model="gemma-3-12b-it-qat-UD-Q4_K_XL", model="gemma4:e4b-it-q4_K_M", messages=[ {"role": "user", "content": "Describe the image", "images": ["s.png"]} ], ) print(response["message"]['role']) print(response["message"]['content']) honza@desktop:~/Documents$ python3 a.py assistant I cannot describe an image because no image was provided. You have pasted a large block of academic or literary notes, but there is no visual content attached. If you would like me to help you with the text you provided (e.g., summarizing the literary movements, explaining the life of a specific author, or formatting the notes), please let me know! ```
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

Works for me.

$ cp image1.png s.png
$ python3 a.py 
assistant
This is a charming, brightly lit portrait of an extremely fluffy, white puppy.

Is s.png an image?

<!-- gh-comment-id:4198579327 --> @rick-github commented on GitHub (Apr 7, 2026): Works for me. ```console $ cp image1.png s.png $ python3 a.py assistant This is a charming, brightly lit portrait of an extremely fluffy, white puppy. ``` Is `s.png` an image?
Author
Owner

@aznohonza commented on GitHub (Apr 7, 2026):

Yes s.png is an image. I dont know why it doesnt work for me. But if I pass an image as bytes instead it works fine. (Tried a different image than s.png, though s.png also worked with this new script). I have the latest ollama 0.20.3.

$ cat b.py 
import ollama
from pathlib import Path

# Path to your image
#image_path = "/home/honza/Pictures/Screenshots/s.png"
image_path = "/home/honza/Pictures/wallpapers/3.jpg"

# Read the image as raw bytes (safest approach for multimodal models)
img_bytes = Path(image_path).read_bytes()

response = ollama.chat(
    model="gemma4:e4b-it-q4_K_M",
    messages=[
        {
            "role": "user",
            "content": "Describe the image, who is the main subject?",
            "images": [img_bytes] # Pass the raw bytes instead of the path string
        }
    ],
)

# Use dot notation to access the properties!
print(response.message.role)
print(response.message.content)
$ python3 b.py 
assistant
**Description of the Image:**

This is a highly dramatic, stylized, and cinematic image dominated by a futuristic vehicle. The overall aesthetic screams **Cyberpunk**—a blend of high-tech extravagance and gritty, dystopian urban decay.

**The Main Subject:**
The main subject is a vividly customized, low-slung **sports car**.

**Detailed Description:**

1.  **The Car:** The car is painted a striking, vibrant **mustard yellow or gold**. It is heavily modified with futuristic, aggressive styling. Its most notable features are the mechanical details and the lighting:
    *   **Wheels:** The wheels are massive and highly detailed, featuring glowing **electric blue** accents that give them a mechanical, cybernetic look.
    *   **Body:** The body is angular and sleek, featuring visible glowing trim and graphic decals, including the word "CYBERPUNK" on the side.
    *   **Front:** The front fascia is aggressive, featuring glowing marker slots marked "XXX."
    *   **Impact:** The car appears extremely powerful, almost mechanical in its excess of glowing parts and visible technology.

2.  **The Setting:** The car is parked or driving slowly on a dark, slick, reflective asphalt street. The background is a massive, overwhelming **futuristic cityscape**. The buildings are towering and indistinct, covered in layers of glowing neon advertisements, digital displays, and multi-colored lights (pinks, blues, and purples). The setting feels dense, rainy, or perpetually damp, contributing to the glossy reflection on the ground.

3.  **Mood and Lighting:** The lighting is dark and moody, relying heavily on artificial, saturated light sources. The glowing blue elements of the car are particularly prominent, contrasting sharply with the yellow paint and the deep, mysterious tones of the background. The overall atmosphere is intensely futuristic, dramatic, and restless.

**In summary, the image captures a highly stylized, neon-drenched cyber-race car in a sprawling, dystopian metropolis.**
<!-- gh-comment-id:4198740946 --> @aznohonza commented on GitHub (Apr 7, 2026): Yes s.png is an image. I dont know why it doesnt work for me. But if I pass an image as bytes instead it works fine. (Tried a different image than s.png, though s.png also worked with this new script). I have the latest ollama 0.20.3. ``` $ cat b.py import ollama from pathlib import Path # Path to your image #image_path = "/home/honza/Pictures/Screenshots/s.png" image_path = "/home/honza/Pictures/wallpapers/3.jpg" # Read the image as raw bytes (safest approach for multimodal models) img_bytes = Path(image_path).read_bytes() response = ollama.chat( model="gemma4:e4b-it-q4_K_M", messages=[ { "role": "user", "content": "Describe the image, who is the main subject?", "images": [img_bytes] # Pass the raw bytes instead of the path string } ], ) # Use dot notation to access the properties! print(response.message.role) print(response.message.content) $ python3 b.py assistant **Description of the Image:** This is a highly dramatic, stylized, and cinematic image dominated by a futuristic vehicle. The overall aesthetic screams **Cyberpunk**—a blend of high-tech extravagance and gritty, dystopian urban decay. **The Main Subject:** The main subject is a vividly customized, low-slung **sports car**. **Detailed Description:** 1. **The Car:** The car is painted a striking, vibrant **mustard yellow or gold**. It is heavily modified with futuristic, aggressive styling. Its most notable features are the mechanical details and the lighting: * **Wheels:** The wheels are massive and highly detailed, featuring glowing **electric blue** accents that give them a mechanical, cybernetic look. * **Body:** The body is angular and sleek, featuring visible glowing trim and graphic decals, including the word "CYBERPUNK" on the side. * **Front:** The front fascia is aggressive, featuring glowing marker slots marked "XXX." * **Impact:** The car appears extremely powerful, almost mechanical in its excess of glowing parts and visible technology. 2. **The Setting:** The car is parked or driving slowly on a dark, slick, reflective asphalt street. The background is a massive, overwhelming **futuristic cityscape**. The buildings are towering and indistinct, covered in layers of glowing neon advertisements, digital displays, and multi-colored lights (pinks, blues, and purples). The setting feels dense, rainy, or perpetually damp, contributing to the glossy reflection on the ground. 3. **Mood and Lighting:** The lighting is dark and moody, relying heavily on artificial, saturated light sources. The glowing blue elements of the car are particularly prominent, contrasting sharply with the yellow paint and the deep, mysterious tones of the background. The overall atmosphere is intensely futuristic, dramatic, and restless. **In summary, the image captures a highly stylized, neon-drenched cyber-race car in a sprawling, dystopian metropolis.** ```
Author
Owner

@rparo20 commented on GitHub (Apr 8, 2026):

We ran into the same issue with unsloth Gemma 4 GGUFs on Ollama 0.20.

After some debugging, we found the root cause is the split model format + architecture mismatch in Ollama's llama.cpp backend. So we quantized directly from Google's official weights using latest llama.cpp (build 400ac8e, includes the Gemma 4 BOS fix).

Everything loads and runs correctly on Ollama 0.20.2. Tested on real hardware:

Model Size VRAM Mac mini M4 (16GB) M4 Max (128GB) Tool Calling
batiai/gemma4-e2b:q4 3.2GB 7.1GB 107.8 t/s 132.5 t/s inconsistent
batiai/gemma4-e4b:q4 5.0GB 10GB 57.1 t/s 84.0 t/s works
batiai/gemma4-26b:q3 13GB 20GB needs 24GB+ 70.7 t/s works
batiai/gemma4-26b:q4 16GB 23GB needs 32GB+ 74.9 t/s works
ollama pull batiai/gemma4-e4b:q4
ollama pull batiai/gemma4-26b:q3

Source weights (BF16) -> GGUF conversion -> Q4_K_M / Q6_K quantization. No re-quantization from existing GGUFs.

HuggingFace repos with full benchmark data:

Hope this helps anyone stuck on this.

<!-- gh-comment-id:4204548549 --> @rparo20 commented on GitHub (Apr 8, 2026): We ran into the same issue with unsloth Gemma 4 GGUFs on Ollama 0.20. After some debugging, we found the root cause is the split model format + architecture mismatch in Ollama's llama.cpp backend. So we quantized directly from Google's official weights using latest llama.cpp (build `400ac8e`, includes the Gemma 4 BOS fix). Everything loads and runs correctly on Ollama 0.20.2. Tested on real hardware: | Model | Size | VRAM | Mac mini M4 (16GB) | M4 Max (128GB) | Tool Calling | |-------|------|------|---------------------|----------------|-------------| | `batiai/gemma4-e2b:q4` | 3.2GB | 7.1GB | 107.8 t/s | 132.5 t/s | inconsistent | | `batiai/gemma4-e4b:q4` | 5.0GB | 10GB | 57.1 t/s | 84.0 t/s | works | | `batiai/gemma4-26b:q3` | 13GB | 20GB | needs 24GB+ | 70.7 t/s | works | | `batiai/gemma4-26b:q4` | 16GB | 23GB | needs 32GB+ | 74.9 t/s | works | ``` ollama pull batiai/gemma4-e4b:q4 ollama pull batiai/gemma4-26b:q3 ``` Source weights (BF16) -> GGUF conversion -> Q4_K_M / Q6_K quantization. No re-quantization from existing GGUFs. HuggingFace repos with full benchmark data: - https://huggingface.co/batiai/gemma-4-26B-A4B-it-GGUF - https://huggingface.co/batiai/gemma-4-E4B-it-GGUF - https://huggingface.co/batiai/gemma-4-E2B-it-GGUF Hope this helps anyone stuck on this.
Author
Owner

@rick-github commented on GitHub (Apr 8, 2026):

You realize that these models are not multi-modal, right?

<!-- gh-comment-id:4206160254 --> @rick-github commented on GitHub (Apr 8, 2026): You realize that these models are not multi-modal, right?
Author
Owner

@rparo20 commented on GitHub (Apr 11, 2026):

Yes, that's intentional — our GGUFs are text-only by design.

We originally included mmproj (vision) in the Modelfile, but ran into the same issues reported in ollama#15352 and llama.cpp#21402: adding the vision projector caused the model to fail to load entirely on Ollama 0.20.x.

So we ship text-only builds that reliably load and run on Ollama 0.20+. They're also ~1GB smaller per tag since mmproj is excluded. For users who need vision later, we've uploaded the mmproj-BF16.gguf files separately to HuggingFace — they can be added back via Modelfile once the ecosystem catches up:

The goal was "text chat + tool calling that actually works on latest Ollama today," not full multimodal. Apologies if the framing was unclear in the original comment.

<!-- gh-comment-id:4227994862 --> @rparo20 commented on GitHub (Apr 11, 2026): Yes, that's intentional — our GGUFs are text-only by design. We originally included mmproj (vision) in the Modelfile, but ran into the same issues reported in [ollama#15352](https://github.com/ollama/ollama/issues/15352) and [llama.cpp#21402](https://github.com/ggml-org/llama.cpp/issues/21402): adding the vision projector caused the model to fail to load entirely on Ollama 0.20.x. So we ship text-only builds that reliably load and run on Ollama 0.20+. They're also ~1GB smaller per tag since mmproj is excluded. For users who need vision later, we've uploaded the `mmproj-BF16.gguf` files separately to HuggingFace — they can be added back via Modelfile once the ecosystem catches up: - https://huggingface.co/batiai/gemma-4-E2B-it-GGUF - https://huggingface.co/batiai/gemma-4-E4B-it-GGUF - https://huggingface.co/batiai/gemma-4-26B-A4B-it-GGUF - https://huggingface.co/batiai/gemma-4-31B-it-GGUF The goal was "text chat + tool calling that actually works on latest Ollama today," not full multimodal. Apologies if the framing was unclear in the original comment.
Author
Owner

@TByte007 commented on GitHub (Apr 12, 2026):

Yes, that's intentional — our GGUFs are text-only by design.

We originally included mmproj (vision) in the Modelfile, but ran into the same issues reported in ollama#15352 and llama.cpp#21402: adding the vision projector caused the model to fail to load entirely on Ollama 0.20.x.

So we ship text-only builds that reliably load and run on Ollama 0.20+. They're also ~1GB smaller per tag since mmproj is excluded. For users who need vision later, we've uploaded the mmproj-BF16.gguf files separately to HuggingFace — they can be added back via Modelfile once the ecosystem catches up:

* https://huggingface.co/batiai/gemma-4-E2B-it-GGUF

* https://huggingface.co/batiai/gemma-4-E4B-it-GGUF

* https://huggingface.co/batiai/gemma-4-26B-A4B-it-GGUF

* https://huggingface.co/batiai/gemma-4-31B-it-GGUF

The goal was "text chat + tool calling that actually works on latest Ollama today," not full multimodal. Apologies if the framing was unclear in the original comment.

Thank you for those models, they work perfectly and are fast. I can even fit 40k context on 3090 :)

<!-- gh-comment-id:4232958190 --> @TByte007 commented on GitHub (Apr 12, 2026): > Yes, that's intentional — our GGUFs are text-only by design. > > We originally included mmproj (vision) in the Modelfile, but ran into the same issues reported in [ollama#15352](https://github.com/ollama/ollama/issues/15352) and [llama.cpp#21402](https://github.com/ggml-org/llama.cpp/issues/21402): adding the vision projector caused the model to fail to load entirely on Ollama 0.20.x. > > So we ship text-only builds that reliably load and run on Ollama 0.20+. They're also ~1GB smaller per tag since mmproj is excluded. For users who need vision later, we've uploaded the `mmproj-BF16.gguf` files separately to HuggingFace — they can be added back via Modelfile once the ecosystem catches up: > > * https://huggingface.co/batiai/gemma-4-E2B-it-GGUF > > * https://huggingface.co/batiai/gemma-4-E4B-it-GGUF > > * https://huggingface.co/batiai/gemma-4-26B-A4B-it-GGUF > > * https://huggingface.co/batiai/gemma-4-31B-it-GGUF > > > The goal was "text chat + tool calling that actually works on latest Ollama today," not full multimodal. Apologies if the framing was unclear in the original comment. Thank you for those models, they work perfectly and are fast. I can even fit 40k context on 3090 :)
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15235
Analyzed: 2026-04-18T18:22:52.244357

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274310799 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15235 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15235 **Analyzed**: 2026-04-18T18:22:52.244357 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56256