[GH-ISSUE #12505] GLM 4.6 is unsupported #8302

Closed
opened 2026-04-12 20:51:11 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @kbradsha on GitHub (Oct 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12505

What is the issue?

GLM 4.6 is unsupported by Ollama. Searching for the error reveals a huggingface post which indicates a need to update to a later version of llama.cpp runtime.

https://huggingface.co/unsloth/GLM-4.6-GGUF/discussions/7

Relevant log output

ollama run GLM-4.6-Q4
Error: 500 Internal Server Error: llama runner process has terminated: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'
llama_model_load_from_file_impl: failed to load model

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @kbradsha on GitHub (Oct 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12505 ### What is the issue? GLM 4.6 is unsupported by Ollama. Searching for the error reveals a huggingface post which indicates a need to update to a later version of llama.cpp runtime. https://huggingface.co/unsloth/GLM-4.6-GGUF/discussions/7 ### Relevant log output ```shell ollama run GLM-4.6-Q4 Error: 500 Internal Server Error: llama runner process has terminated: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight' llama_model_load_from_file_impl: failed to load model ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 20:51:11 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 11, 2025):

#12552 will pull in the required support.

<!-- gh-comment-id:3393581728 --> @rick-github commented on GitHub (Oct 11, 2025): #12552 will pull in the required support.
Author
Owner

@kbradsha commented on GitHub (Oct 12, 2025):

Well that fixed the support... sort of. The model executes but does not load any layers to the GPUs.
Oddly enough, the larger DeepSeek model does load layers to the GPUs.

sudo apt-get install nvidia-cuda-toolkit
git clone https://github.com/ollama/ollama.git
cd ollama
git fetch origin pull/12552/head:pr-12552
git switch pr-12552
make -f Makefile.sync clean
make -f Makefile.sync apply-patches
make -f Makefile.sync sync
cmake -B build --fresh && cmake --build build -j 24
cp ollama /usr/local/bin

sudo systemctl restart ollama

ollama list
NAME ID SIZE MODIFIED
GLM-4.6-Q4:latest 61f07ff15c24 202 GB 3 hours ago
DeepSeek-V3.1-Terminus:latest 0a70575d1fb5 251 GB 2 days ago
GLM-4.5-IQ4:latest 23c63da9b8c0 191 GB 3 days ago
GLM4.5-Air-Q8:latest ed0933c20fa7 127 GB 3 days ago
Qwen3-Coder-30B:latest 39384af73896 35 GB 3 days ago
Qwen3-235B-A22B:latest 24182b768ab3 134 GB 6 days ago
Qwen3-Coder-480B:latest efca1114f769 180 GB 7 days ago
chatGPT-120B:latest c5efbaa5fa0a 63 GB 3 weeks ago

ollama run GLM-4.6-Q4 --verbose

Hello who are you and who created you?
I am an AI language model developed by Zhipu AI, designed to assist with a wide range of topics and questions. My purpose is to provide helpful, accurate, and thoughtful
responses to support your needs.

How can I assist you today?

total duration: 20.149722505s
load duration: 150.728913ms
prompt eval count: 20 token(s)
prompt eval duration: 1.584086174s
prompt eval rate: 12.63 tokens/s
eval count: 51 token(s)
eval duration: 18.414156254s
eval rate: 2.77 tokens/s

/bye

ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
GLM-4.6-Q4:latest 61f07ff15c24 251 GB 100% CPU 131072 4 minutes from now

sudo systemctl restart ollama

ollama run DeepSeek-V3.1-Terminus --verbose

Hello who are you and who created you?
Hello! I'm DeepSeek, an AI assistant created by DeepSeek Company. I'm here to help you with questions, conversations, and various tasks. I'm a text-based model that can
assist with everything from writing and analysis to problem-solving and coding.

I'm currently the latest version of DeepSeek, and I'm completely free to use. I support file uploads (like images, PDFs, Word documents, etc.), have a 128K context window,
and can also search the web if you enable that feature in the interface.

Is there anything specific I can help you with today? 😊

total duration: 1m22.155020797s
load duration: 91.352879ms
prompt eval count: 17 token(s)
prompt eval duration: 33.24368s
prompt eval rate: 0.51 tokens/s
eval count: 128 token(s)
eval duration: 48.818561417s
eval rate: 2.62 tokens/s

ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
DeepSeek-V3.1-Terminus:latest 0a70575d1fb5 276 GB 68%/32% CPU/GPU 131072 4 minutes from now

<!-- gh-comment-id:3395426519 --> @kbradsha commented on GitHub (Oct 12, 2025): Well that fixed the support... sort of. The model executes but does not load any layers to the GPUs. Oddly enough, the larger DeepSeek model does load layers to the GPUs. sudo apt-get install nvidia-cuda-toolkit git clone https://github.com/ollama/ollama.git cd ollama git fetch origin pull/12552/head:pr-12552 git switch pr-12552 make -f Makefile.sync clean make -f Makefile.sync apply-patches make -f Makefile.sync sync cmake -B build --fresh && cmake --build build -j 24 cp ollama /usr/local/bin sudo systemctl restart ollama ollama list NAME ID SIZE MODIFIED GLM-4.6-Q4:latest 61f07ff15c24 202 GB 3 hours ago DeepSeek-V3.1-Terminus:latest 0a70575d1fb5 251 GB 2 days ago GLM-4.5-IQ4:latest 23c63da9b8c0 191 GB 3 days ago GLM4.5-Air-Q8:latest ed0933c20fa7 127 GB 3 days ago Qwen3-Coder-30B:latest 39384af73896 35 GB 3 days ago Qwen3-235B-A22B:latest 24182b768ab3 134 GB 6 days ago Qwen3-Coder-480B:latest efca1114f769 180 GB 7 days ago chatGPT-120B:latest c5efbaa5fa0a 63 GB 3 weeks ago ollama run GLM-4.6-Q4 --verbose >>> Hello who are you and who created you? I am an AI language model developed by Zhipu AI, designed to assist with a wide range of topics and questions. My purpose is to provide helpful, accurate, and thoughtful responses to support your needs. How can I assist you today? total duration: 20.149722505s load duration: 150.728913ms prompt eval count: 20 token(s) prompt eval duration: 1.584086174s prompt eval rate: 12.63 tokens/s eval count: 51 token(s) eval duration: 18.414156254s eval rate: 2.77 tokens/s >>> /bye ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL GLM-4.6-Q4:latest 61f07ff15c24 251 GB 100% CPU 131072 4 minutes from now sudo systemctl restart ollama ollama run DeepSeek-V3.1-Terminus --verbose >>> Hello who are you and who created you? Hello! I'm DeepSeek, an AI assistant created by DeepSeek Company. I'm here to help you with questions, conversations, and various tasks. I'm a text-based model that can assist with everything from writing and analysis to problem-solving and coding. I'm currently the latest version of DeepSeek, and I'm completely free to use. I support file uploads (like images, PDFs, Word documents, etc.), have a 128K context window, and can also search the web if you enable that feature in the interface. Is there anything specific I can help you with today? 😊 total duration: 1m22.155020797s load duration: 91.352879ms prompt eval count: 17 token(s) prompt eval duration: 33.24368s prompt eval rate: 0.51 tokens/s eval count: 128 token(s) eval duration: 48.818561417s eval rate: 2.62 tokens/s ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL DeepSeek-V3.1-Terminus:latest 0a70575d1fb5 276 GB 68%/32% CPU/GPU 131072 4 minutes from now
Author
Owner

@rick-github commented on GitHub (Oct 12, 2025):

Server log will contain details of layer assignment.

<!-- gh-comment-id:3395433653 --> @rick-github commented on GitHub (Oct 12, 2025): [Server log](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will contain details of layer assignment.
Author
Owner

@kbradsha commented on GitHub (Oct 12, 2025):

server.log
Attached with a fresh run

<!-- gh-comment-id:3395477373 --> @kbradsha commented on GitHub (Oct 12, 2025): [server.log](https://github.com/user-attachments/files/22875132/server.log) Attached with a fresh run
Author
Owner

@rick-github commented on GitHub (Oct 12, 2025):

The log is missing data because the command was not run with --no-pager.

<!-- gh-comment-id:3395479710 --> @rick-github commented on GitHub (Oct 12, 2025): The log is missing data because the command was not run with `--no-pager`.
Author
Owner

@kbradsha commented on GitHub (Oct 13, 2025):

server.log
journalctl --no-pager

<!-- gh-comment-id:3395539264 --> @kbradsha commented on GitHub (Oct 13, 2025): [server.log](https://github.com/user-attachments/files/22875447/server.log) journalctl --no-pager
Author
Owner

@rick-github commented on GitHub (Oct 13, 2025):

Oct 12 18:40:00 kbradsha3975wx ollama[2303926]: time=2025-10-12T18:40:00.239-05:00 level=INFO source=server.go:545
 msg=offload library=CUDA layers.requested=-1 layers.model=94 layers.offload=0 layers.split=[]
 memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="234.4 GiB"
 memory.required.partial="0 B" memory.required.kv="46.5 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]"
 memory.weights.total="187.9 GiB" memory.weights.repeating="187.3 GiB" memory.weights.nonrepeating="607.1 MiB"
 memory.graph.full="93.0 GiB" memory.graph.partial="93.0 GiB"

Context size of 128k requires a memory graph of 93G. This is larger than will fit on a single GPU, so the model is loaded in system RAM. Setting OLLAMA_KV_CACHE_TYPE will reduce this (at the cost of precision) although I don't have the model handy to test by how much.

<!-- gh-comment-id:3395546529 --> @rick-github commented on GitHub (Oct 13, 2025): ``` Oct 12 18:40:00 kbradsha3975wx ollama[2303926]: time=2025-10-12T18:40:00.239-05:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=94 layers.offload=0 layers.split=[] memory.available="[23.3 GiB 23.3 GiB 23.3 GiB 23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="234.4 GiB" memory.required.partial="0 B" memory.required.kv="46.5 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="187.9 GiB" memory.weights.repeating="187.3 GiB" memory.weights.nonrepeating="607.1 MiB" memory.graph.full="93.0 GiB" memory.graph.partial="93.0 GiB" ``` Context size of 128k requires a memory graph of 93G. This is larger than will fit on a single GPU, so the model is loaded in system RAM. Setting `OLLAMA_KV_CACHE_TYPE` will reduce this (at the cost of precision) although I don't have the model handy to test by how much.
Author
Owner

@kbradsha commented on GitHub (Oct 13, 2025):

Oh excellent work and thank you. I guess I will just lower the num_ctx. I guess it was confusing because DeepSeek was a larger model with the same num_ctx. I was using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 and just assumed I had plenty of ram (512GB) to spill-over. I guess it doesn't work as I expected no?

<!-- gh-comment-id:3395563674 --> @kbradsha commented on GitHub (Oct 13, 2025): Oh excellent work and thank you. I guess I will just lower the num_ctx. I guess it was confusing because DeepSeek was a larger model with the same num_ctx. I was using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 and just assumed I had plenty of ram (512GB) to spill-over. I guess it doesn't work as I expected no?
Author
Owner

@rick-github commented on GitHub (Oct 13, 2025):

Different models use different amounts of memory for a token. I haven't looked at GLM-4.6 internals yet but from the logs it's seems it is quite expensive in terms of token memory. GGML_CUDA_ENABLE_UNIFIED_MEMORY is only useful if you force layers into the GPU by setting num_gpu. Unified memory can result in reduced performance, more than just using system memory, see here.

<!-- gh-comment-id:3395575765 --> @rick-github commented on GitHub (Oct 13, 2025): Different models use different amounts of memory for a token. I haven't looked at GLM-4.6 internals yet but from the logs it's seems it is quite expensive in terms of token memory. `GGML_CUDA_ENABLE_UNIFIED_MEMORY` is only useful if you force layers into the GPU by setting `num_gpu`. Unified memory can result in reduced performance, more than just using system memory, see [here](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@kbradsha commented on GitHub (Oct 13, 2025):

Thank you very much for the clarification. I am closing this ticket extremely satisfied with the conclusion.

<!-- gh-comment-id:3395582244 --> @kbradsha commented on GitHub (Oct 13, 2025): Thank you very much for the clarification. I am closing this ticket extremely satisfied with the conclusion.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8302