[GH-ISSUE #14160] GLM-4.7-FLASH from huggingface do not work well #55746

Open
opened 2026-04-29 09:40:59 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @itzpingcat on GitHub (Feb 8, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14160

What is the issue?

When running any quantization of a glm 4.7 flash model from huggingface, it will act schizophrenic.
Example model I use was derestricted version.
Modelfile used:

FROM hf.co/mradermacher/GLM-4.7-Flash-Derestricted-i1-GGUF:IQ3_S
TEMPLATE {{ .Prompt }}
RENDERER glm-4.7
PARSER glm-4.7
PARAMETER temperature 1
PARAMETER top_p 0.95

The issue is not the model, as i can run this model on llama-cpp and have it work perfectly fine.

Relevant log output

Llama-cpp output:

> what is the capital of america

[Start thinking]
1.  **Identify the user's question:** The user is asking for the capital of the United States of America.
2.  **Retrieve the capital city:** I need to access my knowledge base regarding the US capital.
    *   It is "Washington, D.C.".
    *   It is often referred to simply as "Washington".
    *   Full name: District of Columbia.
3.  **Formulate the answer:**
    *   Direct answer: Washington, D.C.
    *   Context (optional but helpful): Mention it's also known as D.C. or just Washington.
4.  **Review/Refine:**
    *   Is it simple? Yes.
    *   Is it accurate? Yes.
    *   Is the capital *only* one word? No, "Washington, D.C." or "Washington D.C." are the correct terms.
    *   *Self-Correction during thought process:* Some might say "Washington". While common usage allows "Washington" to refer to D.C., strictly speaking, the "capital of America" is Washington, D.C. It's safer to use the full form or the abbreviation D.C.
5.  **Final Output Generation:** "Washington, D.C." (or "Washington D.C.") is the best answer. I will provide the full name for clarity.
[End thinking]

The capital of the United States is **Washington, D.C.** (District of Columbia). It is commonly referred to simply as Washington.

Ollama output:

>>> what is the capital of america
Thinking...
1.  **Analyze the Request:** The user wants to know what constitutes "America's" (specifically, within North
America) *capital city*.

2.  **Identify Key Concept:**
    *   Location A = USA.
        a.k.a United States
            Washington D.C is often informally shortened or confused with US capital for DC area due to name
alone; however historically Philadelphia was temporary site (de facto). Also, there have been many "nationwide"
governments in the Western Hemisphere that called themselves America and had capitals: Quito*? No...
    *   Location B = USA.
        Wait. Let's be super pedantic.

3.  **Historical Context for 'America' as a proper noun (The Continental aspect):**
       The most common meaning of "American" in the context is usually US-centric, but not always historically
consistent).

4.  **Determine Capital: Washington D.C.
        *   Renamed by Spanish from Quito? No.

5.  **Distill to Simple Answer:** To get it right every time without confusion:
    If I ask "What's your name?", and you say Smith, do not be offended if we are at a formal event requiring full
titles (Dr./Prof).

        *   DC is the current capital of US.

6.  **Drafting Response:**
**Option A:** The user made up their mind to start there after all that thinking.

7.

---

### Final Output Generation:
(Draft appears sound.)
*Self-Correction during output:* None, really clear now.


I will format this as a clean Q&A response.
(Thinking process looks solid enough).

Wait... what if they meant **Canada**?
No one uses "America" for Canada.

What about the US' capital? DC. (It is an island of Washington state).
So even though many people pronounce it 'Washing-TON', I can't say that with 100% confidence without risking being
wrong according to locals, or if they meant Wasingt*ton*d... no wait.
**Q3: What about "America" as in the *name?***

I'll leave a small note saying "(District of Columbia)" at least once.

Also consider **North America**: there is also Buenos Aires (Argentina) for South American continent...
The prompt just says 'capital OF AMERICA'. Usually implies USA or Americas broadly. I will provide DC but
acknowledge it's likely what they mean.
No, keep the answer simple first? No one has 50 years to figure out my mental model.

**Simplicity is key**: "What do you think?" -> "...Is this a trap question for some specific other thing that
isn't 'capital of america' in common usage?"

*Self-Correction on Output Structure:* I'll make the main answer DC, and add details (country context) to keep it
professional.
(There's no perfect name because US is also called American).

**Verdict:**
The capital city for **USA**, usually referred as 'America', has been a moving target based in Quito. Wait.

Final Plan:
1. State "Washington D.C." or simply "**District of Columbia... oh wait, yes).
(Think silently if needed.)

Q2 (Optional context): Why is this so confusing? Because the capital for America = USA.
Okay I will go with **USA** answer primarily.


*Self-Correction:* What happens after 1800?

Final check: Did anyone else's mind just get derailed by "What about Quito...?"
No.

I'm confident in DC. (It is also an 'America'? No).
So, capital of America = Washington D.C.
Capital *City*? Yes.


Okay I will provide the answer clearly stating **Washington,Digital_Revolution** and 1788?
2: Federal City renamed?

Let's do this without overthinking.

*(Self-Correction for final output formatting)*:
I'll format as a clean response. It reduces cognitive load.
(Drafting in head)
Capital of America is usually understood to mean "The capital city that used the name 'America'".

Okay, ready.


**What?**
"The best place I can think?" (Wait no).
... (283 lines left) (spirals infinitely)

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.15.6

Originally created by @itzpingcat on GitHub (Feb 8, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14160 ### What is the issue? When running any quantization of a glm 4.7 flash model from huggingface, it will act schizophrenic. Example model I use was derestricted version. Modelfile used: ``` FROM hf.co/mradermacher/GLM-4.7-Flash-Derestricted-i1-GGUF:IQ3_S TEMPLATE {{ .Prompt }} RENDERER glm-4.7 PARSER glm-4.7 PARAMETER temperature 1 PARAMETER top_p 0.95 ``` The issue is not the model, as i can run this model on llama-cpp and have it work perfectly fine. ### Relevant log output Llama-cpp output: ``` > what is the capital of america [Start thinking] 1. **Identify the user's question:** The user is asking for the capital of the United States of America. 2. **Retrieve the capital city:** I need to access my knowledge base regarding the US capital. * It is "Washington, D.C.". * It is often referred to simply as "Washington". * Full name: District of Columbia. 3. **Formulate the answer:** * Direct answer: Washington, D.C. * Context (optional but helpful): Mention it's also known as D.C. or just Washington. 4. **Review/Refine:** * Is it simple? Yes. * Is it accurate? Yes. * Is the capital *only* one word? No, "Washington, D.C." or "Washington D.C." are the correct terms. * *Self-Correction during thought process:* Some might say "Washington". While common usage allows "Washington" to refer to D.C., strictly speaking, the "capital of America" is Washington, D.C. It's safer to use the full form or the abbreviation D.C. 5. **Final Output Generation:** "Washington, D.C." (or "Washington D.C.") is the best answer. I will provide the full name for clarity. [End thinking] The capital of the United States is **Washington, D.C.** (District of Columbia). It is commonly referred to simply as Washington. ``` Ollama output: ``` >>> what is the capital of america Thinking... 1. **Analyze the Request:** The user wants to know what constitutes "America's" (specifically, within North America) *capital city*. 2. **Identify Key Concept:** * Location A = USA. a.k.a United States Washington D.C is often informally shortened or confused with US capital for DC area due to name alone; however historically Philadelphia was temporary site (de facto). Also, there have been many "nationwide" governments in the Western Hemisphere that called themselves America and had capitals: Quito*? No... * Location B = USA. Wait. Let's be super pedantic. 3. **Historical Context for 'America' as a proper noun (The Continental aspect):** The most common meaning of "American" in the context is usually US-centric, but not always historically consistent). 4. **Determine Capital: Washington D.C. * Renamed by Spanish from Quito? No. 5. **Distill to Simple Answer:** To get it right every time without confusion: If I ask "What's your name?", and you say Smith, do not be offended if we are at a formal event requiring full titles (Dr./Prof). * DC is the current capital of US. 6. **Drafting Response:** **Option A:** The user made up their mind to start there after all that thinking. 7. --- ### Final Output Generation: (Draft appears sound.) *Self-Correction during output:* None, really clear now. I will format this as a clean Q&A response. (Thinking process looks solid enough). Wait... what if they meant **Canada**? No one uses "America" for Canada. What about the US' capital? DC. (It is an island of Washington state). So even though many people pronounce it 'Washing-TON', I can't say that with 100% confidence without risking being wrong according to locals, or if they meant Wasingt*ton*d... no wait. **Q3: What about "America" as in the *name?*** I'll leave a small note saying "(District of Columbia)" at least once. Also consider **North America**: there is also Buenos Aires (Argentina) for South American continent... The prompt just says 'capital OF AMERICA'. Usually implies USA or Americas broadly. I will provide DC but acknowledge it's likely what they mean. No, keep the answer simple first? No one has 50 years to figure out my mental model. **Simplicity is key**: "What do you think?" -> "...Is this a trap question for some specific other thing that isn't 'capital of america' in common usage?" *Self-Correction on Output Structure:* I'll make the main answer DC, and add details (country context) to keep it professional. (There's no perfect name because US is also called American). **Verdict:** The capital city for **USA**, usually referred as 'America', has been a moving target based in Quito. Wait. Final Plan: 1. State "Washington D.C." or simply "**District of Columbia... oh wait, yes). (Think silently if needed.) Q2 (Optional context): Why is this so confusing? Because the capital for America = USA. Okay I will go with **USA** answer primarily. *Self-Correction:* What happens after 1800? Final check: Did anyone else's mind just get derailed by "What about Quito...?" No. I'm confident in DC. (It is also an 'America'? No). So, capital of America = Washington D.C. Capital *City*? Yes. Okay I will provide the answer clearly stating **Washington,Digital_Revolution** and 1788? 2: Federal City renamed? Let's do this without overthinking. *(Self-Correction for final output formatting)*: I'll format as a clean response. It reduces cognitive load. (Drafting in head) Capital of America is usually understood to mean "The capital city that used the name 'America'". Okay, ready. **What?** "The best place I can think?" (Wait no). ... (283 lines left) (spirals infinitely) ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.15.6
GiteaMirror added the bug label 2026-04-29 09:40:59 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 8, 2026):

Example model I use was derestricted version.

Which one?

<!-- gh-comment-id:3868469239 --> @rick-github commented on GitHub (Feb 8, 2026): > Example model I use was derestricted version. Which one?
Author
Owner

@itzpingcat commented on GitHub (Feb 8, 2026):

Example model I use was derestricted version.

Which one?

FROM hf.co/mradermacher/GLM-4.7-Flash-Derestricted-i1-GGUF:IQ3_S

<!-- gh-comment-id:3868547190 --> @itzpingcat commented on GitHub (Feb 8, 2026): > > Example model I use was derestricted version. > > Which one? FROM hf.co/mradermacher/GLM-4.7-Flash-Derestricted-i1-GGUF:IQ3_S
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

This is an issue with the llama.cpp implementation. The tokenizer is not supported in the ollama engine, so it falls back to using the llama.cpp runner. The last successful vendor sync for llama.cpp was in https://github.com/ollama/ollama/pull/13451 and synced to approximately b7426. There have been a few bugfixes for glm-4.7-flash since then so it's possible the problem has been resolved. The next vendor sync will be the test.

<!-- gh-comment-id:3868725788 --> @rick-github commented on GitHub (Feb 9, 2026): This is an issue with the llama.cpp implementation. The tokenizer is not supported in the ollama engine, so it falls back to using the llama.cpp runner. The last successful vendor sync for llama.cpp was in https://github.com/ollama/ollama/pull/13451 and synced to approximately [b7426](https://github.com/ggml-org/llama.cpp/releases/tag/b7426). There have been a few [bugfixes](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+glm-4.7-flash+created%3A%3E2024-12-16+is%3Aclosed) for glm-4.7-flash since then so it's possible the problem has been resolved. The next vendor sync will be the test.
Author
Owner

@itzpingcat commented on GitHub (Feb 9, 2026):

mm. how do i force it to run on the GO Engine? @rick-github

<!-- gh-comment-id:3868842332 --> @itzpingcat commented on GitHub (Feb 9, 2026): mm. how do i force it to run on the GO Engine? @rick-github
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

The HF versions of glm-4.7-flash will not run on the ollama engine because the tokenizer is not supported. The HF model also has merged KV_B tensors which are also not supported.

<!-- gh-comment-id:3868846726 --> @rick-github commented on GitHub (Feb 9, 2026): The HF versions of glm-4.7-flash will not run on the ollama engine because the tokenizer is not supported. The HF model also has merged KV_B tensors which are also not supported.
Author
Owner

@itzpingcat commented on GitHub (Feb 9, 2026):

why would the ollama version work then?

can i force the huggingface one to use the tokenizer that ollama's version uses

and unmerge the kv_b

<!-- gh-comment-id:3868852147 --> @itzpingcat commented on GitHub (Feb 9, 2026): why would the ollama version work then? can i force the huggingface one to use the tokenizer that ollama's version uses and unmerge the kv_b
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

The HF model is encoded with a deepseek2 architecture. There is no simple way to get glm-4.7-flash from HF to run on the ollama engine.

<!-- gh-comment-id:3868925319 --> @rick-github commented on GitHub (Feb 9, 2026): The HF model is encoded with a `deepseek2` architecture. There is no simple way to get glm-4.7-flash from HF to run on the ollama engine.
Author
Owner

@itzpingcat commented on GitHub (Feb 9, 2026):

what architecture does the ollama one use?

<!-- gh-comment-id:3869015308 --> @itzpingcat commented on GitHub (Feb 9, 2026): what architecture does the ollama one use?
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

$ ollama show glm-4.7-flash
  Model
    architecture        glm4moelite    
<!-- gh-comment-id:3870909741 --> @rick-github commented on GitHub (Feb 9, 2026): ```console $ ollama show glm-4.7-flash Model architecture glm4moelite ```
Author
Owner

@itzpingcat commented on GitHub (Feb 10, 2026):

can't convert deepseek2 into glm4moelite?

<!-- gh-comment-id:3874951929 --> @itzpingcat commented on GitHub (Feb 10, 2026): can't convert deepseek2 into glm4moelite?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55746