[GH-ISSUE #14043] Step 3.5 Flash #71235

Open
opened 2026-05-05 00:49:32 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @ChengYen-Tang on GitHub (Feb 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14043

https://huggingface.co/stepfun-ai/Step-3.5-Flash
https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4
https://huggingface.co/stepfun-ai/Step-3.5-Flash-FP8

3. Performance

Step 3.5 Flash delivers performance parity with leading closed-source systems while remaining open and efficient.

Image

Performance of Step 3.5 Flash measured across Reasoning, Coding, and Agency. Open-source models (left) are sorted by their total parameter count, while top-tier proprietary models are shown on the right. xbench-DeepSearch scores are sourced from official publications for consistency. The shadowed bars represent the enhanced performance of Step 3.5 Flash using Parallel Thinking.

Detailed Benchmarks

Benchmark Step 3.5 Flash DeepSeek V3.2 Kimi K2 Thinking / K2.5 GLM-4.7 MiniMax M2.1 MiMo-V2 Flash
# Activated Params 11B 37B 32B 32B 10B 15B
# Total Params (MoE) 196B 671B 1T 355B 230B 309B
Est. decoding cost (@ 128K context, Hopper GPU**) 1.0x (100 tok/s, MTP-3, EP8) 6.0x (33 tok/s, MTP-1, EP32) 18.9x (33 tok/s, no MTP, EP32) 18.9x (100 tok/s, MTP-3, EP8) 3.9x (100 tok/s, MTP-3, EP8) 1.2x (100 tok/s, MTP-3, EP8)
Agency
τ²-Bench 88.2 80.3 74.3* / — 87.4 80.2* 80.3
BrowseComp 51.6 51.4 41.5* / 60.6 52.0 47.4 45.4
BrowseComp (w/ Context Manager) 69.0 67.6 60.2 / 74.9 67.5 62.0 58.3
BrowseComp-ZH 66.9 65.0 62.3 / 62.3* 66.6 47.8* 51.2*
BrowseComp-ZH (w/ Context Manager) 73.7 — / —
GAIA (no file) 84.5 75.1* 75.6* / 75.9* 61.9* 64.3* 78.2*
xbench-DeepSearch (2025.05) 83.7 78.0* 76.0* / 76.7* 72.0* 68.7* 69.3*
xbench-DeepSearch (2025.10) 56.3 55.7* — / 40+ 52.3* 43.0* 44.0*
ResearchRubrics 65.3 55.8* 56.2* / 59.5* 62.0* 60.2* 54.3*
Reasoning
AIME 2025 97.3 93.1 94.5 / 96.1 95.7 83.0 94.1 (95.1*)
HMMT 2025 (Feb.) 98.4 92.5 89.4 / 95.4 97.1 71.0* 84.4 (95.4*)
HMMT 2025 (Nov.) 94.0 90.2 89.2* / — 93.5 74.3* 91.0*
IMOAnswerBench 85.4 78.3 78.6 / 81.8 82.0 60.4* 80.9*
Coding
LiveCodeBench-V6 86.4 83.3 83.1 / 85.0 84.9 80.6 (81.6*)
SWE-bench Verified 74.4 73.1 71.3 / 76.8 73.8 74.0 73.4
Terminal-Bench 2.0 51.0 46.4 35.7* / 50.8 41.0 47.9 38.5

Notes:

  1. "—" indicates the score is not publicly available or not tested.
  2. "*" indicates the original score was inaccessible or lower than our reproduced, so we report the evaluation under the same test conditions as Step 3.5 Flash to ensure fair comparability.
  3. BrowseComp (with Context Manager): When the effective context length exceeds a predefined threshold, the agent resets the context and restarts the agent loop. By contrast, Kimi K2.5 and DeepSeek-V3.2 used a "discard-all" strategy.
  4. Decoding Cost: Estimates are based on a methodology similar to, but more accurate than, the approach described arxiv.org/abs/2507.19427
Originally created by @ChengYen-Tang on GitHub (Feb 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14043 https://huggingface.co/stepfun-ai/Step-3.5-Flash https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4 https://huggingface.co/stepfun-ai/Step-3.5-Flash-FP8 ## 3. Performance Step 3.5 Flash delivers performance parity with leading closed-source systems while remaining open and efficient. <img width="1257" height="1034" alt="Image" src="https://github.com/user-attachments/assets/9b6b56bd-f4e0-482b-b39d-7520887d8990" /> Performance of Step 3.5 Flash measured across **Reasoning**, **Coding**, and **Agency**. Open-source models (left) are sorted by their total parameter count, while top-tier proprietary models are shown on the right. xbench-DeepSearch scores are sourced from [official publications](https://xbench.org/agi/aisearch) for consistency. The shadowed bars represent the enhanced performance of Step 3.5 Flash using [Parallel Thinking](https://arxiv.org/pdf/2601.05593). ### Detailed Benchmarks | Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash | |---|---|---|---|---|---|---| | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B | | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B | | Est. decoding cost (@ 128K context, Hopper GPU**) | **1.0x** (100 tok/s, MTP-3, EP8) | 6.0x (33 tok/s, MTP-1, EP32) | 18.9x (33 tok/s, no MTP, EP32) | 18.9x (100 tok/s, MTP-3, EP8) | 3.9x (100 tok/s, MTP-3, EP8) | 1.2x (100 tok/s, MTP-3, EP8) | | **Agency** | | | | | | | | τ²-Bench | **88.2** | 80.3 | 74.3* / — | 87.4 | 80.2* | 80.3 | | BrowseComp | 51.6 | 51.4 | 41.5* / **60.6** | 52.0 | 47.4 | 45.4 | | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2 / **74.9** | 67.5 | 62.0 | 58.3 | | BrowseComp-ZH | **66.9** | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* | | BrowseComp-ZH (w/ Context Manager) | **73.7** | — | — / — | — | — | — | | GAIA (no file) | **84.5** | 75.1* | 75.6* / 75.9* | 61.9* | 64.3* | 78.2* | | xbench-DeepSearch (2025.05) | **83.7** | 78.0* | 76.0* / 76.7* | 72.0* | 68.7* | 69.3* | | xbench-DeepSearch (2025.10) | **56.3** | 55.7* | — / 40+ | 52.3* | 43.0* | 44.0* | | ResearchRubrics | **65.3** | 55.8* | 56.2* / 59.5* | 62.0* | 60.2* | 54.3* | | **Reasoning** | | | | | | | | AIME 2025 | **97.3** | 93.1 | 94.5 / 96.1 | 95.7 | 83.0 | 94.1 (95.1*) | | HMMT 2025 (Feb.) | **98.4** | 92.5 | 89.4 / 95.4 | 97.1 | 71.0* | 84.4 (95.4*) | | HMMT 2025 (Nov.) | **94.0** | 90.2 | 89.2* / — | 93.5 | 74.3* | 91.0* | | IMOAnswerBench | **85.4** | 78.3 | 78.6 / 81.8 | 82.0 | 60.4* | 80.9* | | **Coding** | | | | | | | | LiveCodeBench-V6 | **86.4** | 83.3 | 83.1 / 85.0 | 84.9 | — | 80.6 (81.6*) | | SWE-bench Verified | 74.4 | 73.1 | 71.3 / **76.8** | 73.8 | 74.0 | 73.4 | | Terminal-Bench 2.0 | **51.0** | 46.4 | 35.7* / 50.8 | 41.0 | 47.9 | 38.5 | **Notes**: 1. "—" indicates the score is not publicly available or not tested. 2. "*" indicates the original score was inaccessible or lower than our reproduced, so we report the evaluation under the same test conditions as Step 3.5 Flash to ensure fair comparability. 3. **BrowseComp (with Context Manager)**: When the effective context length exceeds a predefined threshold, the agent resets the context and restarts the agent loop. By contrast, Kimi K2.5 and DeepSeek-V3.2 used a "discard-all" strategy. 4. **Decoding Cost**: Estimates are based on a methodology similar to, but more accurate than, the approach described arxiv.org/abs/2507.19427
GiteaMirror added the model label 2026-05-05 00:49:32 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 3, 2026):

https://github.com/ggml-org/llama.cpp/pull/19271#issuecomment-3835833362

<!-- gh-comment-id:3840497190 --> @rick-github commented on GitHub (Feb 3, 2026): https://github.com/ggml-org/llama.cpp/pull/19271#issuecomment-3835833362
Author
Owner

@ChengYen-Tang commented on GitHub (Feb 4, 2026):

https://github.com/ggml-org/llama.cpp/pull/19283

<!-- gh-comment-id:3844845145 --> @ChengYen-Tang commented on GitHub (Feb 4, 2026): https://github.com/ggml-org/llama.cpp/pull/19283
Author
Owner

@matbgn commented on GitHub (Feb 6, 2026):

Merged https://github.com/ggml-org/llama.cpp/pull/19283#event-22598677210

Is their a chance that it will be also supported as ready-to-go ollama run / launch (I mean integrated with opencode, Claude, etc.) both locally and on cloud?

<!-- gh-comment-id:3862399989 --> @matbgn commented on GitHub (Feb 6, 2026): Merged https://github.com/ggml-org/llama.cpp/pull/19283#event-22598677210 Is their a chance that it will be also supported as ready-to-go ollama run / launch (I mean integrated with opencode, Claude, etc.) both locally and on cloud?
Author
Owner

@ChengYen-Tang commented on GitHub (Feb 7, 2026):

Not fully supported yet
https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3864383225

<!-- gh-comment-id:3864416367 --> @ChengYen-Tang commented on GitHub (Feb 7, 2026): Not fully supported yet https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3864383225
Author
Owner

@matbgn commented on GitHub (Feb 7, 2026):

I guess you saw the answer to your issue.

<!-- gh-comment-id:3864497660 --> @matbgn commented on GitHub (Feb 7, 2026): I guess you saw the answer to your issue.
Author
Owner

@matbgn commented on GitHub (Feb 25, 2026):

Seems that it was fixed with: https://github.com/ggml-org/llama.cpp/pull/19635

<!-- gh-comment-id:3960060811 --> @matbgn commented on GitHub (Feb 25, 2026): Seems that it was fixed with: https://github.com/ggml-org/llama.cpp/pull/19635
Author
Owner

@bendtherules commented on GitHub (Mar 13, 2026):

Any update?

<!-- gh-comment-id:4053587921 --> @bendtherules commented on GitHub (Mar 13, 2026): Any update?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71235