[GH-ISSUE #12022] Qwen3:4B Performance Issue: think: false Parameter Not Working + Slower Than 8B #54495

Closed
opened 2026-04-29 06:09:02 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @NeaByteLab on GitHub (Aug 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12022

What is the issue?

Issue Description

The "think": false parameter is not working properly with Qwen3:4B, causing unwanted thinking output and significantly slower performance compared to Qwen3:8B. This bug appeared after the Qwen3:4B model was updated approximately 2 weeks ago.

Environment

  • OS: macOS 24.6.0 (darwin)
  • Ollama Version: 0.11.6 (client: 0.9.6)
  • Models Tested: qwen3:4b, qwen3:8b

Reproduction Steps

  1. Install both models: ollama pull qwen3:4b and ollama pull qwen3:8b
  2. Test with API call using "think": false:
# Qwen3:4B - Still shows thinking despite parameter
curl -X POST http://localhost:11434/api/generate \
  -d '{"model": "qwen3:4b", "prompt": "What is 2+2? Answer in one word only.", "stream": false, "think": false}'

# Qwen3:8B - Works correctly, no thinking output
curl -X POST http://localhost:11434/api/generate \
  -d '{"model": "qwen3:8b", "prompt": "What is 2+2? Answer in one word only.", "stream": false, "think": false}'

Expected vs Actual Behavior

Expected: Both models should respect "think": false and output clean responses without thinking process.

Actual:

  • Qwen3:4B: Ignores "think": false, outputs verbose thinking + response (~21.6s)
  • Qwen3:8B: Correctly respects parameter, clean output (~7.5s)

Performance Comparison

Model Total Duration Load Duration Prompt Eval Eval Duration Response Quality
qwen3:4b ~21.6s ~4.8s ~0.5s ~16.3s Includes thinking
qwen3:8b ~7.5s ~3.4s ~2.8s ~1.3s Clean output

Impact

  1. Parameter Inconsistency: Same API parameter behaves differently across model variants
  2. Performance Regression: 4B model is ~3x slower than 8B model
  3. Unwanted Output: Users cannot disable thinking process for 4B model
  4. API Reliability: Inconsistent behavior breaks expected API contract

Additional Notes

  • The 4B model consistently outputs thinking process regardless of the "think": false parameter
  • Performance difference is significant and unexpected (smaller model should be faster)
  • This affects production use cases where clean, fast responses are needed
  • Timing: This bug appeared after the Qwen3:4B model was updated approximately 2 weeks ago (model ID: e55aed6fe643)
  • Regression: Previous versions of Qwen3:4B likely worked correctly before this update
  • Context Size Mismatch: Server logs show warnings about context size conflicts (see Detailed Diagnostic Information below)

Detailed Diagnostic Information

Model Configuration Comparison

Model Context Length Embedding Length Quantization Parameters
qwen3:4b 262,144 (256K) 2,560 Q4_K_M 4.0B
qwen3:8b 40,960 (40K) 4,096 Q4_K_M 8.2B

Server Log Analysis

# Context size warnings in logs:
"requested context size too large for model" num_ctx=131072 n_ctx_train=40960

# Model loading shows:
llama_context: n_ctx_per_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized

# Runtime context allocation:
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 131072

Key Findings

  1. Context size configuration mismatch between model metadata (256K) and runtime limits (40K)
  2. API requests use 128K context which conflicts with both limits
  3. The "think": false parameter failure correlates with this context mismatch
  4. Performance degradation likely caused by context size negotiation overhead

Request

Please investigate why:

  1. The "think": false parameter is ignored by Qwen3:4B
  2. Qwen3:4B performs significantly slower than Qwen3:8B
  3. There's inconsistency in parameter handling across model variants
  4. The context size configuration mismatch between model metadata and runtime limits

This appears to be a regression in the Qwen3:4B model configuration that was introduced in the recent update (model ID: e55aed6fe643, ~2 weeks ago).

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

No response

Originally created by @NeaByteLab on GitHub (Aug 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12022 ### What is the issue? ## Issue Description The `"think": false` parameter is not working properly with Qwen3:4B, causing unwanted thinking output and significantly slower performance compared to Qwen3:8B. **This bug appeared after the Qwen3:4B model was updated approximately 2 weeks ago.** ## Environment - **OS**: macOS 24.6.0 (darwin) - **Ollama Version**: 0.11.6 (client: 0.9.6) - **Models Tested**: qwen3:4b, qwen3:8b ## Reproduction Steps 1. Install both models: `ollama pull qwen3:4b` and `ollama pull qwen3:8b` 2. Test with API call using `"think": false`: ```bash # Qwen3:4B - Still shows thinking despite parameter curl -X POST http://localhost:11434/api/generate \ -d '{"model": "qwen3:4b", "prompt": "What is 2+2? Answer in one word only.", "stream": false, "think": false}' # Qwen3:8B - Works correctly, no thinking output curl -X POST http://localhost:11434/api/generate \ -d '{"model": "qwen3:8b", "prompt": "What is 2+2? Answer in one word only.", "stream": false, "think": false}' ``` ## Expected vs Actual Behavior **Expected**: Both models should respect `"think": false` and output clean responses without thinking process. **Actual**: - **Qwen3:4B**: Ignores `"think": false`, outputs verbose thinking + response (~21.6s) - **Qwen3:8B**: Correctly respects parameter, clean output (~7.5s) ## Performance Comparison | Model | Total Duration | Load Duration | Prompt Eval | Eval Duration | Response Quality | |-------|----------------|---------------|--------------|---------------|------------------| | qwen3:4b | ~21.6s | ~4.8s | ~0.5s | ~16.3s | Includes thinking | | qwen3:8b | ~7.5s | ~3.4s | ~2.8s | ~1.3s | Clean output | ## Impact 1. **Parameter Inconsistency**: Same API parameter behaves differently across model variants 2. **Performance Regression**: 4B model is ~3x slower than 8B model 3. **Unwanted Output**: Users cannot disable thinking process for 4B model 4. **API Reliability**: Inconsistent behavior breaks expected API contract ## Additional Notes - The 4B model consistently outputs thinking process regardless of the `"think": false` parameter - Performance difference is significant and unexpected (smaller model should be faster) - This affects production use cases where clean, fast responses are needed - **Timing**: This bug appeared after the Qwen3:4B model was updated approximately 2 weeks ago (model ID: e55aed6fe643) - **Regression**: Previous versions of Qwen3:4B likely worked correctly before this update - **Context Size Mismatch**: Server logs show warnings about context size conflicts (see Detailed Diagnostic Information below) ## Detailed Diagnostic Information ### Model Configuration Comparison | Model | Context Length | Embedding Length | Quantization | Parameters | |-------|----------------|------------------|--------------|------------| | **qwen3:4b** | **262,144 (256K)** | 2,560 | Q4_K_M | 4.0B | | **qwen3:8b** | **40,960 (40K)** | 4,096 | Q4_K_M | 8.2B | ### Server Log Analysis ``` # Context size warnings in logs: "requested context size too large for model" num_ctx=131072 n_ctx_train=40960 # Model loading shows: llama_context: n_ctx_per_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized # Runtime context allocation: llama_context: n_ctx = 131072 llama_context: n_ctx_per_seq = 131072 ``` ### Key Findings 1. **Context size configuration mismatch** between model metadata (256K) and runtime limits (40K) 2. **API requests use 128K context** which conflicts with both limits 3. **The `"think": false` parameter failure** correlates with this context mismatch 4. **Performance degradation** likely caused by context size negotiation overhead ## Request Please investigate why: 1. The `"think": false` parameter is ignored by Qwen3:4B 2. Qwen3:4B performs significantly slower than Qwen3:8B 3. There's inconsistency in parameter handling across model variants 4. **The context size configuration mismatch** between model metadata and runtime limits This appears to be a regression in the Qwen3:4B model configuration that was introduced in the recent update (model ID: e55aed6fe643, ~2 weeks ago). ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-29 06:09:02 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 22, 2025):

Use the non-reasoning version of the model: https://ollama.com/library/qwen3:4b-instruct

<!-- gh-comment-id:3213822869 --> @rick-github commented on GitHub (Aug 22, 2025): Use the non-reasoning version of the model: https://ollama.com/library/qwen3:4b-instruct
Author
Owner

@NeaByteLab commented on GitHub (Aug 22, 2025):

@rick-github - Thank you for the excellent suggestion!
I've now tested qwen3:4b-instruct and can confirm it's the perfect solution:

qwen3:4b-instruct works perfectly with "think": false - clean output in ~4.9s
qwen3:4b still ignores "think": false - outputs thinking process in ~20.3s
qwen3:8b works correctly but slower at ~28.3s

Performance ranking: 4b-instruct > 8b > 4b

The qwen3:4b-instruct suggestion is spot-on and actually provides the best performance. However, the underlying bug in qwen3:4b should still be investigated since it represents a regression in behavior and API consistency.

<!-- gh-comment-id:3214181144 --> @NeaByteLab commented on GitHub (Aug 22, 2025): @rick-github - Thank you for the excellent suggestion! I've now tested **qwen3:4b-instruct** and can confirm it's the perfect solution: ✅ **qwen3:4b-instruct** works perfectly with `"think": false` - clean output in `~4.9s` ❌ **qwen3:4b** still ignores `"think": false` - outputs thinking process in `~20.3s` ✅ **qwen3:8b** works correctly but slower at `~28.3s` Performance ranking: 4b-instruct > 8b > 4b The **qwen3:4b-instruct** suggestion is spot-on and actually provides the best performance. However, the underlying bug in **qwen3:4b** should still be investigated since it represents a regression in behavior and API consistency.
Author
Owner

@rick-github commented on GitHub (Aug 22, 2025):

It's not a bug. Qwen changed how the 4b model works. It is no longer a hybrid model, it now come in two versions, thinking and non-thinking (instruct). qwen:4b is an alias for the thinking version of the model.

<!-- gh-comment-id:3214193273 --> @rick-github commented on GitHub (Aug 22, 2025): It's not a bug. Qwen changed how the 4b model works. It is no longer a hybrid model, it now come in two versions, thinking and non-thinking (instruct). qwen:4b is an alias for the thinking version of the model.
Author
Owner

@NeaByteLab commented on GitHub (Aug 22, 2025):

Thank you @rick-github for the clarification!

I understand now - this isn't a bug but an intentional architectural change by Qwen.
The split into thinking vs non-thinking versions makes perfect sense.

I appreciate you taking the time to explain the model changes.
Thanks again for the help!

<!-- gh-comment-id:3214517915 --> @NeaByteLab commented on GitHub (Aug 22, 2025): Thank you @rick-github for the clarification! I understand now - this isn't a bug but an intentional architectural change by Qwen. The split into thinking vs non-thinking versions makes perfect sense. I appreciate you taking the time to explain the model changes. Thanks again for the help!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54495