[GH-ISSUE #15441] Please add qwen3.5:122b-a10b-q8_0 quantization to model registry #56380

Closed
opened 2026-04-29 10:44:28 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @branmacstudio on GitHub (Apr 9, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15441

Request

Please add a q8_0 quantization tag for qwen3.5:122b-a10b. The current registry only has q4_K_M (81 GB) under that model.

Context

I run a small document-processing business doing structured data extraction from business documents into spreadsheets. The pipeline emits deterministic JSON via structured/constrained generation, where small accuracy regressions show up immediately as wrong numbers in the downstream output. In my testing, Q4 vs Q8 is a measurably non-trivial accuracy gap on this workload — Q4 produces enough errors per document to be unusable in production. Q6 is the floor I can tolerate.

Current setup:

  • M2 Ultra 128 GB → running Qwen3.5-122B-A10B at Q6_K via llama.cpp today (Q8 won't fit alongside the rest of my services)
  • M3 Ultra 256 GB on order, specifically to run multiple concurrent Q8 workers for higher accuracy and parallelism
  • I'd like to evaluate Ollama's MLX backend on the new machine once it arrives

Without a Q8 tag in the registry I can't run a fair head-to-head against my current llama.cpp Q8 baseline. The workarounds (manual GGUF build + Modelfile import) lose the registry auto-update story, which is most of the value of switching to Ollama in the first place.

Why this might be a good signal for Ollama

122B-A10B at Q8 is one of the more demanding sustained-throughput workloads targetable on Apple Silicon at this hardware tier. It stresses MLX's prompt caching, KV cache sizing for hybrid attention, and the recent Qwen3.5 thinking-token fixes from v0.19 all at once. If it runs cleanly at this size point on M3 Ultra, it validates the MLX backend for the whole "single-machine production inference on Mac Studio" use case.

Evaluation plan once the hardware lands

3–5 document benchmark comparing wall-clock time, output JSON byte-equality vs current llama.cpp Q8 baseline, thinking-token handling under structured output, and multi-worker behavior (4 concurrent extractions). Happy to share results here if useful.

Thanks.

Originally created by @branmacstudio on GitHub (Apr 9, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15441 Request Please add a q8_0 quantization tag for qwen3.5:122b-a10b. The current registry only has q4_K_M (81 GB) under that model. Context I run a small document-processing business doing structured data extraction from business documents into spreadsheets. The pipeline emits deterministic JSON via structured/constrained generation, where small accuracy regressions show up immediately as wrong numbers in the downstream output. In my testing, Q4 vs Q8 is a measurably non-trivial accuracy gap on this workload — Q4 produces enough errors per document to be unusable in production. Q6 is the floor I can tolerate. Current setup: - M2 Ultra 128 GB → running Qwen3.5-122B-A10B at Q6_K via llama.cpp today (Q8 won't fit alongside the rest of my services) - M3 Ultra 256 GB on order, specifically to run multiple concurrent Q8 workers for higher accuracy and parallelism - I'd like to evaluate Ollama's MLX backend on the new machine once it arrives Without a Q8 tag in the registry I can't run a fair head-to-head against my current llama.cpp Q8 baseline. The workarounds (manual GGUF build + Modelfile import) lose the registry auto-update story, which is most of the value of switching to Ollama in the first place. Why this might be a good signal for Ollama 122B-A10B at Q8 is one of the more demanding sustained-throughput workloads targetable on Apple Silicon at this hardware tier. It stresses MLX's prompt caching, KV cache sizing for hybrid attention, and the recent Qwen3.5 thinking-token fixes from v0.19 all at once. If it runs cleanly at this size point on M3 Ultra, it validates the MLX backend for the whole "single-machine production inference on Mac Studio" use case. Evaluation plan once the hardware lands 3–5 document benchmark comparing wall-clock time, output JSON byte-equality vs current llama.cpp Q8 baseline, thinking-token handling under structured output, and multi-worker behavior (4 concurrent extractions). Happy to share results here if useful. Thanks.
GiteaMirror added the model label 2026-04-29 10:44:28 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 9, 2026):

M3 Ultra 256 GB on order, specifically to run multiple concurrent Q8 workers for higher accuracy and parallelism

Due to model architecture, qwen3.5 does not currently support parallel queries. A workaround is to run multiple servers.

<!-- gh-comment-id:4212767694 --> @rick-github commented on GitHub (Apr 9, 2026): > M3 Ultra 256 GB on order, specifically to run multiple concurrent Q8 workers for higher accuracy and parallelism Due to model architecture, qwen3.5 does not currently support parallel queries. A workaround is to run [multiple servers](https://github.com/ollama/ollama/issues/4165#issuecomment-4176068568).
Author
Owner

@branmacstudio commented on GitHub (Apr 9, 2026):

Thanks for the heads-up on the parallelism limitation — that's consistent with how I run llama.cpp today (single-slot per instance, multiple instances behind a queue), so my M3 plan is already 4 independent inference servers each pinned to one model file, just like the workaround you linked. Switching from llama-server to ollama serve is a binary swap, not an architectural change, on that side.

The blocker for evaluating Ollama on this workload is still the missing q8_0 tag for qwen3.5:122b-a10b — q4_K_M is below the accuracy floor I can tolerate for digit-accurate structured extraction. Is adding a q8_0 quant something the team can prioritize, or is there a reason the 122B-A10B is intentionally quant-limited to q4 in the registry?

Thanks again.

<!-- gh-comment-id:4213658177 --> @branmacstudio commented on GitHub (Apr 9, 2026): Thanks for the heads-up on the parallelism limitation — that's consistent with how I run llama.cpp today (single-slot per instance, multiple instances behind a queue), so my M3 plan is already 4 independent inference servers each pinned to one model file, just like the workaround you linked. Switching from llama-server to ollama serve is a binary swap, not an architectural change, on that side. The blocker for evaluating Ollama on this workload is still the missing q8_0 tag for qwen3.5:122b-a10b — q4_K_M is below the accuracy floor I can tolerate for digit-accurate structured extraction. Is adding a q8_0 quant something the team can prioritize, or is there a reason the 122B-A10B is intentionally quant-limited to q4 in the registry? Thanks again.
Author
Owner

@rick-github commented on GitHub (Apr 9, 2026):

I don't know why q8_0/bf16 weren't uploaded to the library, maybe there were so many releases they fell through the cracks. I know you want an official version, but frob/qwen3.5:122b-a10b-q8_0 will be bitwise identical (modulo changes to the quant code in the meantime) to the library model if/when it's published.

<!-- gh-comment-id:4214401966 --> @rick-github commented on GitHub (Apr 9, 2026): I don't know why q8_0/bf16 weren't uploaded to the library, maybe there were so many releases they fell through the cracks. I know you want an official version, but [frob/qwen3.5:122b-a10b-q8_0](https://ollama.com/frob/qwen3.5:122b-a10b-q8_0) will be bitwise identical (modulo changes to the quant code in the meantime) to the library model if/when it's published.
Author
Owner

@branmacstudio commented on GitHub (Apr 9, 2026):

Okay, thanks. I'll try testing against that.

<!-- gh-comment-id:4216421577 --> @branmacstudio commented on GitHub (Apr 9, 2026): Okay, thanks. I'll try testing against that.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56380