[GH-ISSUE #9530] [Website] [Bug] Incorrect Sampling Parameters for QwQ 32B #68273

Open
opened 2026-05-04 13:04:29 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @vYLQs6 on GitHub (Mar 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9530

The ollama repo for QwQ 32B didn't set any Sampling Parameters, which could cause significant performance degradation:

Image

From QwQ HF page:

Usage Guidelines
To achieve optimal performance, we recommend the following settings:
....
Sampling Parameters:
Use Temperature=0.6 and TopP=0.95 instead of Greedy decoding to avoid endless repetitions.
Use TopK between 20 and 40 to filter out rare token occurrences while maintaining the diversity of the generated output.

official generation config from qwq hf page:

{
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.0,
  "temperature": 0.6,
  "top_k": 40,
  "top_p": 0.95,
  "transformers_version": "4.45.2"
}

I would recommend fix this, since it's really easy to fix


Plus the system prompt also needs an update, here is the new official system prompt from qwq-32b demo by qwen team:

You are a helpful and harmless assistant.

def format_history(history):
    messages = [{
        "role": "system",
        "content": "You are a helpful and harmless assistant.",
    }]
    for item in history:
        if item["role"] == "user":
            messages.append({"role": "user", "content": item["content"]})
        elif item["role"] == "assistant":
            messages.append({"role": "assistant", "content": item["content"]})
    return messages

https://huggingface.co/spaces/Qwen/QwQ-32B-Demo/blob/main/app.py

Originally created by @vYLQs6 on GitHub (Mar 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9530 The ollama repo for QwQ 32B didn't set any Sampling Parameters, which could cause significant performance degradation: ![Image](https://github.com/user-attachments/assets/6bd6f966-94d8-447a-aab0-f64ba8e3235d) From QwQ HF page: > Usage Guidelines > To achieve optimal performance, we recommend the following settings: > .... > Sampling Parameters: > Use Temperature=0.6 and TopP=0.95 instead of Greedy decoding to avoid endless repetitions. > Use TopK between 20 and 40 to filter out rare token occurrences while maintaining the diversity of the generated output. official generation config from qwq hf page: ``` { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.0, "temperature": 0.6, "top_k": 40, "top_p": 0.95, "transformers_version": "4.45.2" } ``` I would recommend fix this, since it's really easy to fix --- Plus the system prompt also needs an update, here is the new official system prompt from qwq-32b demo by qwen team: ### You are a helpful and harmless assistant. ``` def format_history(history): messages = [{ "role": "system", "content": "You are a helpful and harmless assistant.", }] for item in history: if item["role"] == "user": messages.append({"role": "user", "content": item["content"]}) elif item["role"] == "assistant": messages.append({"role": "assistant", "content": item["content"]}) return messages ``` https://huggingface.co/spaces/Qwen/QwQ-32B-Demo/blob/main/app.py
Author
Owner
<!-- gh-comment-id:2702695945 --> @vYLQs6 commented on GitHub (Mar 6, 2025): Source: https://huggingface.co/Qwen/QwQ-32B/blob/main/generation_config.json https://huggingface.co/Qwen/QwQ-32B/blob/main/README.md
Author
Owner

@yorktownting commented on GitHub (Mar 6, 2025):

My QwQ-32B-Q5_K_M will ask and answer for herself, delivering an endless output. Is that what's causing this?

<!-- gh-comment-id:2702718721 --> @yorktownting commented on GitHub (Mar 6, 2025): My QwQ-32B-Q5_K_M will ask and answer for herself, delivering an endless output. Is that what's causing this?
Author
Owner

@WR-CREATOR commented on GitHub (Mar 11, 2025):

I tested qwq-fp16 deployed using both ollama and other methods for the same difficult task, and found that the output results of ollama deployment were far inferior to those of other deployment methods. Is this the reason?

<!-- gh-comment-id:2712259227 --> @WR-CREATOR commented on GitHub (Mar 11, 2025): I tested qwq-fp16 deployed using both ollama and other methods for the same difficult task, and found that the output results of ollama deployment were far inferior to those of other deployment methods. Is this the reason?
Author
Owner

@yorktownting commented on GitHub (Mar 11, 2025):

I tested qwq-fp16 deployed using both ollama and other methods for the same difficult task, and found that the output results of ollama deployment were far inferior to those of other deployment methods. Is this the reason?

Did you try this? They have bug-fixed param in their repo
https://docs.unsloth.ai/basics/tutorial-how-to-run-qwq-32b-effectively

<!-- gh-comment-id:2712265536 --> @yorktownting commented on GitHub (Mar 11, 2025): > I tested qwq-fp16 deployed using both ollama and other methods for the same difficult task, and found that the output results of ollama deployment were far inferior to those of other deployment methods. Is this the reason? Did you try this? They have bug-fixed param in their repo https://docs.unsloth.ai/basics/tutorial-how-to-run-qwq-32b-effectively
Author
Owner

@WR-CREATOR commented on GitHub (Mar 11, 2025):

I tested qwq-fp16 deployed using both ollama and other methods for the same difficult task, and found that the output results of ollama deployment were far inferior to those of other deployment methods. Is this the reason?

Did you try this? They have bug-fixed param in their repo https://docs.unsloth.ai/basics/tutorial-how-to-run-qwq-32b-effectively

No, what I deployed was just that the generated results were relatively poor, not without endless generations.

<!-- gh-comment-id:2713312090 --> @WR-CREATOR commented on GitHub (Mar 11, 2025): > > I tested qwq-fp16 deployed using both ollama and other methods for the same difficult task, and found that the output results of ollama deployment were far inferior to those of other deployment methods. Is this the reason? > > Did you try this? They have bug-fixed param in their repo https://docs.unsloth.ai/basics/tutorial-how-to-run-qwq-32b-effectively No, what I deployed was just that the generated results were relatively poor, not without endless generations.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68273