[GH-ISSUE #15372] qwen3next: layer 2 missing attn_qkv/attn_gate projections #71895

Open
opened 2026-05-05 02:53:12 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @CJames1261 on GitHub (Apr 6, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15372

What is the issue?

similar to this issue but for layer 2 instead of layer 0: https://github.com/ollama/ollama/pull/15133

the error states: 500 Internal Server Error: failed to initialize model: qwen3next: layer 2 missing attn_qkv/attn_gate projections

I'm trying to use the model qwen3-coder-next

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @CJames1261 on GitHub (Apr 6, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15372 ### What is the issue? similar to this issue but for layer 2 instead of layer 0: https://github.com/ollama/ollama/pull/15133 the error states: 500 Internal Server Error: failed to initialize model: qwen3next: layer 2 missing attn_qkv/attn_gate projections I'm trying to use the model qwen3-coder-next ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-05 02:53:12 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 6, 2026):

What version of ollama?

ollama -v
<!-- gh-comment-id:4195083210 --> @rick-github commented on GitHub (Apr 6, 2026): What version of ollama? ``` ollama -v ```
Author
Owner

@CJames1261 commented on GitHub (Apr 6, 2026):

Ollama version: 0.20.2

Root cause (found through investigation):
The GGUF for this model is distributed as 4 split shard files (-00001-of-00004.gguf through -00004-of-00004.gguf). Ollama only reads tensor metadata from the first shard file, so tensors belonging to layer 2 (which are stored in a later shard) are never loaded. When Validate() runs, gdn.SSMIn, gdn.SSMQKV, and gdn.SSMQKVGate are all nil for that layer, triggering the error.

Workaround:
Merging the 4 shards into a single file using llama-gguf-split --merge resolves the error and the model loads successfully. However, inference is very slow — the merged Q5_K_M file is ~57GB which exceeds available VRAM, causing most of the model to run on CPU/RAM rather than GPU. A smaller quantization (Q4_K_M) may help with speed but the same workaround would still be required.

Suggested fix:
Ollama should support multi-file split GGUFs natively — similar to how it already handles pytorch_model--of-.bin for PyTorch models. When a path like model-00001-of-00004.gguf is provided, Ollama should automatically discover and load all sibling shards before running validation.

note: I'm new to this and this explanation was what my llm suggested was the best way to explain the problem.

<!-- gh-comment-id:4195112560 --> @CJames1261 commented on GitHub (Apr 6, 2026): Ollama version: 0.20.2 Root cause (found through investigation): The GGUF for this model is distributed as 4 split shard files (-00001-of-00004.gguf through -00004-of-00004.gguf). Ollama only reads tensor metadata from the first shard file, so tensors belonging to layer 2 (which are stored in a later shard) are never loaded. When Validate() runs, gdn.SSMIn, gdn.SSMQKV, and gdn.SSMQKVGate are all nil for that layer, triggering the error. Workaround: Merging the 4 shards into a single file using llama-gguf-split --merge resolves the error and the model loads successfully. However, inference is very slow — the merged Q5_K_M file is ~57GB which exceeds available VRAM, causing most of the model to run on CPU/RAM rather than GPU. A smaller quantization (Q4_K_M) may help with speed but the same workaround would still be required. Suggested fix: Ollama should support multi-file split GGUFs natively — similar to how it already handles pytorch_model-*-of-*.bin for PyTorch models. When a path like model-00001-of-00004.gguf is provided, Ollama should automatically discover and load all sibling shards before running validation. note: I'm new to this and this explanation was what my llm suggested was the best way to explain the problem.
Author
Owner

@rick-github commented on GitHub (Apr 6, 2026):

note: I'm new to this and this explanation was what my llm suggested was the best way to explain the problem.

The LLM is correct that the files need to be merged. It's incorrect in thinking that ollama handles split pytorch bin models.

qwen3next is a large model and requires significant resources to run. What sort of GPU do you have, and what is the intended use?

#5245 for the ticket about handling split files without merging.

<!-- gh-comment-id:4195141088 --> @rick-github commented on GitHub (Apr 6, 2026): > note: I'm new to this and this explanation was what my llm suggested was the best way to explain the problem. The LLM is correct that the files need to be merged. It's incorrect in thinking that ollama handles split pytorch bin models. qwen3next is a large model and requires significant resources to run. What sort of GPU do you have, and what is the intended use? #5245 for the ticket about handling split files without merging.
Author
Owner

@CJames1261 commented on GitHub (Apr 6, 2026):

Name AdapterRAM


AMD Radeon(TM) 890M Graphics 536870912
NVIDIA GeForce RTX 5090 Laptop GPU 4293918720

I am wanting to have a local model that can help me with developing type script, java, and next.js

I also want a model that can help me with local python projects.

I've tried qwen2.5:7b but when I say things like 'modify this file and take out section X' it doesn't actually do it, claude reccomended I try another model so I'm trying this one.

<!-- gh-comment-id:4195163982 --> @CJames1261 commented on GitHub (Apr 6, 2026): Name AdapterRAM ---- ---------- AMD Radeon(TM) 890M Graphics 536870912 NVIDIA GeForce RTX 5090 Laptop GPU 4293918720 I am wanting to have a local model that can help me with developing type script, java, and next.js I also want a model that can help me with local python projects. I've tried qwen2.5:7b but when I say things like 'modify this file and take out section X' it doesn't actually do it, claude reccomended I try another model so I'm trying this one.
Author
Owner

@rick-github commented on GitHub (Apr 6, 2026):

RTX 5090 is quite respectable. I suggest something in qwen3.5 series, say qwen3.5:27b. Weights are 17GB and 64k context would take another 10GB so will fit comfortably on the 5090.

<!-- gh-comment-id:4195202210 --> @rick-github commented on GitHub (Apr 6, 2026): RTX 5090 is quite respectable. I suggest something in [qwen3.5](https://ollama.com/library/qwen3.5) series, say [qwen3.5:27b](https://ollama.com/library/qwen3.5:27b). Weights are 17GB and 64k context would take another 10GB so will fit comfortably on the 5090.
Author
Owner

@CJames1261 commented on GitHub (Apr 6, 2026):

I've been connecting a local LLM to the claw-code CLI (a Claude Code clone written in Rust: https://github.com/ultraworkers/claw-code) using Ollama as the backend. I run it like this:

.\rust\target\release\claw.exe --model qwen-coder:latest

With these environment variables set:

OPENAI_API_KEY=ollama
OPENAI_BASE_URL=http://localhost:11434/v1

Claw connects to Ollama's OpenAI-compatible API, so the model receives tool definitions and is expected to call them to interact with the file system.

My question is: for this use case (local LLM + claw-code CLI + file system tools), do you recommend sticking with Ollama directly, or would building a custom FastAPI middleware server between claw and Ollama give meaningful benefits — for example better tool call parsing, fallback handling, or prompt injection?

<!-- gh-comment-id:4195215281 --> @CJames1261 commented on GitHub (Apr 6, 2026): I've been connecting a local LLM to the claw-code CLI (a Claude Code clone written in Rust: https://github.com/ultraworkers/claw-code) using Ollama as the backend. I run it like this: ``` .\rust\target\release\claw.exe --model qwen-coder:latest ``` With these environment variables set: ``` OPENAI_API_KEY=ollama OPENAI_BASE_URL=http://localhost:11434/v1 ``` Claw connects to Ollama's OpenAI-compatible API, so the model receives tool definitions and is expected to call them to interact with the file system. My question is: for this use case (local LLM + claw-code CLI + file system tools), do you recommend sticking with Ollama directly, or would building a custom FastAPI middleware server between claw and Ollama give meaningful benefits — for example better tool call parsing, fallback handling, or prompt injection?
Author
Owner

@rick-github commented on GitHub (Apr 6, 2026):

The only benefit I see from middleware in this case would be redundancy for maintaining service uptime. LiteLLM is a popular choice, but there are others.

<!-- gh-comment-id:4195275175 --> @rick-github commented on GitHub (Apr 6, 2026): The only benefit I see from middleware in this case would be redundancy for maintaining service uptime. [LiteLLM](https://github.com/BerriAI/litellm) is a popular choice, but there are others.
Author
Owner

@PureBlissAK commented on GitHub (Apr 18, 2026):

🤖 Automated Triage & Analysis Report

Issue: #15372
Analyzed: 2026-04-18T18:22:27.432646

Analysis

  • Type: unknown
  • Severity: medium
  • Components: unknown

Implementation Plan

  • Effort: medium
  • Steps:

This issue has been triaged and marked for implementation.

<!-- gh-comment-id:4274310035 --> @PureBlissAK commented on GitHub (Apr 18, 2026): <!-- ollama-issue-orchestrator:v1 issue:15372 --> ## 🤖 Automated Triage & Analysis Report **Issue**: #15372 **Analyzed**: 2026-04-18T18:22:27.432646 ### Analysis - **Type**: unknown - **Severity**: medium - **Components**: unknown ### Implementation Plan - **Effort**: medium - **Steps**: *This issue has been triaged and marked for implementation.*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71895