[GH-ISSUE #14212] Can't run huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S with thinking enabled. #9257

Closed
opened 2026-04-12 22:08:01 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @ghost on GitHub (Feb 11, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14212

What is the issue?

Am I missing something? GLM-4.7-Flash:latest (ollama's Q4 quant of somekind) supports thinking.
I wanted a quant that fits itself and some context on a 16GB iGPU, the model is really quick in the Q3 form, but lacks thinking.

Relevant log output

ollama run huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S --verbose --think true
Error: 400 Bad Request: "huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S" does not support thinking

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.15.4

Originally created by @ghost on GitHub (Feb 11, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14212 ### What is the issue? Am I missing something? GLM-4.7-Flash:latest (ollama's Q4 quant of somekind) supports thinking. I wanted a quant that fits itself and some context on a 16GB iGPU, the model is really quick in the Q3 form, but lacks thinking. ### Relevant log output ```shell ollama run huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S --verbose --think true Error: 400 Bad Request: "huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S" does not support thinking ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.15.4
GiteaMirror added the bug label 2026-04-12 22:08:01 -05:00
Author
Owner

@acaracappa commented on GitHub (Feb 12, 2026):

GLM-4.7-Flash doesn't support the --think flag because lacks the deeper reasoning/thinking training found in non-Flash GLM-4.7 models.

Additionally, Unsloth's documentation for this model explicitly advises against using their GGUF with Ollama but for other reasons.
https://unsloth.ai/docs/models/glm-4.7-flash

<!-- gh-comment-id:3887913884 --> @acaracappa commented on GitHub (Feb 12, 2026): GLM-4.7-Flash doesn't support the --think flag because lacks the deeper reasoning/thinking training found in non-Flash GLM-4.7 models. Additionally, Unsloth's documentation for this model explicitly advises against using their GGUF with Ollama but for other reasons. https://unsloth.ai/docs/models/glm-4.7-flash
Author
Owner

@ghost commented on GitHub (Feb 12, 2026):

@acaracappa
Oh I understand it now. So it has no think parameter, but there's extra metadata denoting the thinking zone of the model and non-thinking zone.
So it's defacto always thinking.

Ollama variant:
Image
unsloth variant:
Image

<!-- gh-comment-id:3887930757 --> @ghost commented on GitHub (Feb 12, 2026): @acaracappa Oh I understand it now. So it has no think parameter, but there's extra metadata denoting the thinking zone of the model and non-thinking zone. So it's defacto always thinking. Ollama variant: <img width="960" height="1200" alt="Image" src="https://github.com/user-attachments/assets/3a12714f-2233-4861-b628-54dab8453576" /> unsloth variant: <img width="960" height="1200" alt="Image" src="https://github.com/user-attachments/assets/0407d8ea-b629-4661-a056-5b564c71f109" />
Author
Owner

@rick-github commented on GitHub (Feb 12, 2026):

Models pulled from HuggingFace frequently don't have the template/parser/renderer needed to make full use of the capabilities of the model in ollama. Fortunately it can be remedied:

$ echo FROM huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S > Modelfile
$ echo PARSER glm-4.7 >> Modelfile
$ echo RENDERER glm-4.7 >> Modelfile
$ ollama create unsloth/glm-4.7-flash:q3_K_S
$ ollama run unsloth/glm-4.7-flash:q3_K_S
>>> /set parameter temperature 1.0
Set parameter 'temperature' to '1.0'
>>> /set parameter top_p 0.95     
Set parameter 'top_p' to '0.95'
>>> /set parameter repeat_penalty 1.0
Set parameter 'repeat_penalty' to '1.0'
>>> /set parameter num_ctx 8192      
Set parameter 'num_ctx' to '8192'
>>> Hey puter. Create a simple C11, SDL2, GLAD shader example. Say a bloom shader utilizing a classic 9-tap gaussian blur.
Thinking...
The user wants a simple C11 code example that demonstrates:
1.  **C11** language standard.
2.  **SDL2** for window and context creation.
3.  **GLAD** for OpenGL function loading.
4.  A **Bloom** shader effect.
5.  Specifically, the bloom needs to use a "classic 9-tap gaussian blur".
...
    Let's generate the response.
...done thinking.

Here is a complete, single-file C11 example. It sets up SDL2, creates an OpenGL context, and renders a checkerboard pattern with a bloom effect using a 9-tap Gaussian blur.
...

Note that the glm-4.7-flash models from HF have their architecture set to deepseek2 but with the glm4 tokenizer, which results in the model running on the llama.cpp engine rather than the native glm4moelite support of the ollama engine. Because it's been a while since a successful vendor sync of the llama.cpp engine, the glm-4.7-flash models from HF sometimes exhibit increased hallucinations and yapping in ollama.

<!-- gh-comment-id:3888343506 --> @rick-github commented on GitHub (Feb 12, 2026): Models pulled from HuggingFace frequently don't have the template/parser/renderer needed to make full use of the capabilities of the model in ollama. Fortunately it can be remedied: ```console $ echo FROM huggingface.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_S > Modelfile $ echo PARSER glm-4.7 >> Modelfile $ echo RENDERER glm-4.7 >> Modelfile $ ollama create unsloth/glm-4.7-flash:q3_K_S ``` ```console $ ollama run unsloth/glm-4.7-flash:q3_K_S >>> /set parameter temperature 1.0 Set parameter 'temperature' to '1.0' >>> /set parameter top_p 0.95 Set parameter 'top_p' to '0.95' >>> /set parameter repeat_penalty 1.0 Set parameter 'repeat_penalty' to '1.0' >>> /set parameter num_ctx 8192 Set parameter 'num_ctx' to '8192' >>> Hey puter. Create a simple C11, SDL2, GLAD shader example. Say a bloom shader utilizing a classic 9-tap gaussian blur. Thinking... The user wants a simple C11 code example that demonstrates: 1. **C11** language standard. 2. **SDL2** for window and context creation. 3. **GLAD** for OpenGL function loading. 4. A **Bloom** shader effect. 5. Specifically, the bloom needs to use a "classic 9-tap gaussian blur". ... Let's generate the response. ...done thinking. Here is a complete, single-file C11 example. It sets up SDL2, creates an OpenGL context, and renders a checkerboard pattern with a bloom effect using a 9-tap Gaussian blur. ... ``` Note that the glm-4.7-flash models from HF have their architecture set to `deepseek2` but with the `glm4` tokenizer, which results in the model running on the llama.cpp engine rather than the native `glm4moelite` support of the ollama engine. Because it's been a while since a successful vendor sync of the llama.cpp engine, the glm-4.7-flash models from HF sometimes exhibit increased hallucinations and yapping in ollama.
Author
Owner

@ghost commented on GitHub (Feb 12, 2026):

Appreciated.
For the future is there anyway to grab the PARSER and RENDERER from architecture name? I guess they're logically named.

This is where RENDERER all valid RENDERERS are at:
f8dc7c9f54/model/renderers/renderer.go (L45)
All valid PARSER names are at:
f8dc7c9f54/model/parsers / f8dc7c9f54/model/renderers/renderer.go (L45)

And I guess they aren't exactly all defined here, as there isn't any JSON defining the arch and the string returned is NIL for most archietectures:
f8dc7c9f54/x/create/client/create.go (L341) |
f8dc7c9f54/x/create/client/create.go (L436) |
f8dc7c9f54/x/create/client/create.go (L452) | unused?

I was using the search function to find it all.
To me it seems like it reads it out of the data structures in model/renderers/renderer.go:46 somewhere, just don't know where it's defined doesn't seem to be in x/create/client/create.go

Edit: I don't really write go, gave it a quick glance with github search.
Edit2: So I guess if the answer is NIL it will use "gptoss"/"gpt-oss" as renderer and parser if architecture name is "gptoss"/"gpt-oss" etc.?

<!-- gh-comment-id:3889955829 --> @ghost commented on GitHub (Feb 12, 2026): Appreciated. For the future is there anyway to grab the PARSER and RENDERER from architecture name? I guess they're logically named. This is where RENDERER all valid RENDERERS are at: https://github.com/ollama/ollama/blob/f8dc7c9f54a753d2c6d3410936e73486f9bf463d/model/renderers/renderer.go#L45 All valid PARSER names are at: https://github.com/ollama/ollama/tree/f8dc7c9f54a753d2c6d3410936e73486f9bf463d/model/parsers / https://github.com/ollama/ollama/blob/f8dc7c9f54a753d2c6d3410936e73486f9bf463d/model/renderers/renderer.go#L45 And I guess they aren't exactly all defined here, as there isn't any JSON defining the arch and the string returned is NIL for most archietectures: https://github.com/ollama/ollama/blob/f8dc7c9f54a753d2c6d3410936e73486f9bf463d/x/create/client/create.go#L341 | https://github.com/ollama/ollama/blob/f8dc7c9f54a753d2c6d3410936e73486f9bf463d/x/create/client/create.go#L436 | https://github.com/ollama/ollama/blob/f8dc7c9f54a753d2c6d3410936e73486f9bf463d/x/create/client/create.go#L452 | unused? I was using the search function to find it all. To me it seems like it reads it out of the data structures in model/renderers/renderer.go:46 somewhere, just don't know where it's defined doesn't seem to be in x/create/client/create.go Edit: I don't really write go, gave it a quick glance with github search. Edit2: So I guess if the answer is NIL it will use "gptoss"/"gpt-oss" as renderer and parser if architecture name is "gptoss"/"gpt-oss" etc.?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9257