[GH-ISSUE #11310] UI Settings for Context Length, Flash Attention, Parallel, etc #69520

Open
opened 2026-05-04 18:18:26 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @chigkim on GitHub (Jul 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11310

Since it now has UI for settings, can we have settings for OLLAMA_CONTEXT_LENGTH, OLLAMA_FLASH_ATTENTION, OLLAMA_NUM_PARALLEL, etc in the settings dialog? That would be amazing!
Thanks!

Originally created by @chigkim on GitHub (Jul 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11310 Since it now has UI for settings, can we have settings for OLLAMA_CONTEXT_LENGTH, OLLAMA_FLASH_ATTENTION, OLLAMA_NUM_PARALLEL, etc in the settings dialog? That would be amazing! Thanks!
GiteaMirror added the feature request label 2026-05-04 18:18:26 -05:00
Author
Owner

@subrotoxing commented on GitHub (Jul 6, 2025):

how do i enable flash_attention in windows ?

<!-- gh-comment-id:3042353774 --> @subrotoxing commented on GitHub (Jul 6, 2025): how do i enable flash_attention in windows ?
Author
Owner

@chigkim commented on GitHub (Jul 6, 2025):

Set environment variable OLLAMA_FLASH_ATTENTION to 1 before launching it.

<!-- gh-comment-id:3042453553 --> @chigkim commented on GitHub (Jul 6, 2025): Set environment variable OLLAMA_FLASH_ATTENTION to 1 before launching it.
Author
Owner

@gome-pc commented on GitHub (Jul 8, 2025):

为什么我单独设置OLLAMA_FLASH_ATTENTION=1,则输出的每秒tokens会下降35%,
单独设置OLLAMA_NUM_PARALLEL =1,则输出的每秒tokens会下降20%,
若两者一起设为1,则则输出的每秒tokens会下降35%。
上述降速在较大的模型和较小的模型上均成立。
请问两个参数在什么情况下开启会加速推理?需要模型支持吗?

硬件:
主板华硕 X99EWS
CPU:E5 2696V4
内存: DDR4 2333MHz 168=128G
显卡: 2080ti 22G
2(无sli)
硬盘:M.2 pcie
系统:windows10 x64
ollama:0.9.4/0.9.5
模型1:Gemma3_Q8_0:27B 28 GB(运行在两个显卡中,占用显存约15+15=30GB)
模型2:Gemma3:1B 815MB(运行在一个显卡中,占用现存约1GB)

Why does setting OLLAMA_FLASH_ATTENTION=1 alone reduce the output tokens per second by 35%?
Setting OLLAMA_NUM_PARALLEL=1 alone reduces the output tokens/s by 20%.
If both are set to 1, the output tokens/s will decrease by 35%.
The above slowdown occurs in both larger models and smaller models.
In what situations should these two parameters be enabled to accelerate inference? Is model support required?

Hardware:
Motherboard: ASUS X99EWS
CPU: E5 2696V4
Memory: DDR4 2333MHz 168=128G
Graphics card: 2080ti 22G
2 (no SLI)
Hard disk: M.2 PCIe
System: Windows 10 x64
Ollama: 0.9.4/0.9.5
Model 1: Gemma3_Q8_0:27B 28 GB (running on two graphics cards, occupying approximately 15+15=30GB of video memory)
Model 2: Gemma3:1B 815MB (running on one graphics card, occupying approximately 1GB of video memory)

<!-- gh-comment-id:3048221850 --> @gome-pc commented on GitHub (Jul 8, 2025): 为什么我单独设置OLLAMA_FLASH_ATTENTION=1,则输出的每秒tokens会下降35%, 单独设置OLLAMA_NUM_PARALLEL =1,则输出的每秒tokens会下降20%, 若两者一起设为1,则则输出的每秒tokens会下降35%。 上述降速在较大的模型和较小的模型上均成立。 请问两个参数在什么情况下开启会加速推理?需要模型支持吗? 硬件: 主板华硕 X99EWS CPU:E5 2696V4 内存: DDR4 2333MHz 16*8=128G 显卡: 2080ti 22G*2(无sli) 硬盘:M.2 pcie 系统:windows10 x64 ollama:0.9.4/0.9.5 模型1:Gemma3_Q8_0:27B 28 GB(运行在两个显卡中,占用显存约15+15=30GB) 模型2:Gemma3:1B 815MB(运行在一个显卡中,占用现存约1GB) ------------------------------------------------------------------------------------------------------------------------------------------------ Why does setting OLLAMA_FLASH_ATTENTION=1 alone reduce the output tokens per second by 35%? Setting OLLAMA_NUM_PARALLEL=1 alone reduces the output tokens/s by 20%. If both are set to 1, the output tokens/s will decrease by 35%. The above slowdown occurs in both larger models and smaller models. In what situations should these two parameters be enabled to accelerate inference? Is model support required? Hardware: Motherboard: ASUS X99EWS CPU: E5 2696V4 Memory: DDR4 2333MHz 16*8=128G Graphics card: 2080ti 22G*2 (no SLI) Hard disk: M.2 PCIe System: Windows 10 x64 Ollama: 0.9.4/0.9.5 Model 1: Gemma3_Q8_0:27B 28 GB (running on two graphics cards, occupying approximately 15+15=30GB of video memory) Model 2: Gemma3:1B 815MB (running on one graphics card, occupying approximately 1GB of video memory)
Author
Owner

@chigkim commented on GitHub (Jul 9, 2025):

Setting the parallel increases the total context length. For example, if you have 4096 context and set parallel to 2, It allocates 8192 (4096*2. Check Ollama ps and make sure 100% is loaded to GPU. Otherwise, it'll slow down dramatically.
Here's my test with and without flash attention using gemma3:27b-it-q8_0.

Flash attention Total Load Speed Prompt Tokens PP Speed Generated Tokens TG Speed
Yes 246.74 seconds 6.18 seconds 32564 174.39 tk/s 567 10.56 tk/s
No 267.36 seconds 6.17 seconds 32564 165.87 tk/s 539 8.33 tk/s

I'm on M3Max.

<!-- gh-comment-id:3053231294 --> @chigkim commented on GitHub (Jul 9, 2025): Setting the parallel increases the total context length. For example, if you have 4096 context and set parallel to 2, It allocates 8192 (4096*2. Check Ollama ps and make sure 100% is loaded to GPU. Otherwise, it'll slow down dramatically. Here's my test with and without flash attention using gemma3:27b-it-q8_0. | Flash attention | Total | Load Speed | Prompt Tokens | PP Speed | Generated Tokens | TG Speed | | --- | --- | --- | --- | --- | --- | --- | |Yes | 246.74 seconds | 6.18 seconds | 32564 | 174.39 tk/s | 567 | 10.56 tk/s | | No | 267.36 seconds | 6.17 seconds | 32564 | 165.87 tk/s | 539 | 8.33 tk/s | I'm on M3Max.
Author
Owner

@gome-pc commented on GitHub (Jul 10, 2025):

LLM speed test本地大模型推理速度测试工具v1.4.zip
from:
https://www.bilibili.com/opus/1078272739661316119

我使用上面这个html工具的ollama接口测试了 Gemma3_Q8_0:27B 和 gemma3_Q4_K_M:4b
我先前的表述有误,并不是OLLAMA_FLASH_ATTENTION=1和OLLAMA_NUM_PARALLEL=1,
而是OLLAMA_FLASH_ATTENTION=1和OLLAMA_SCHED_SPREAD=1对我的ollama推理产生了显著影响,
这种影响在这个html中测试时稳定出现,我不确定是我的硬件、软件还是这个html导致了这种问题。

I tested Gemma3_Q8_0:27B and gemma3_Q4_K_M:4b using the "Ollama api" of the above html tool.
My previous statement was incorrect; it was not OLLAMA_FLASH_ATTENTION=1 and OLLAMA_NUM_PARALLEL=1,
but rather OLLAMA_FLASH_ATTENTION=1 and OLLAMA_SCHED_SPREAD=1 that had a significant impact on my Ollama inference.
This effect consistently appeared during testing with this html, and I am unsure whether it is caused by my hardware, software, or this html.

测试结果如下:
Test results:

Image

下一步我将在open-webui中测试,看一下情况如何。

<!-- gh-comment-id:3056388869 --> @gome-pc commented on GitHub (Jul 10, 2025): [LLM speed test本地大模型推理速度测试工具v1.4.zip](https://github.com/user-attachments/files/21157401/LLM.speed.test.v1.4.zip) from: [https://www.bilibili.com/opus/1078272739661316119](url) 我使用上面这个html工具的ollama接口测试了 Gemma3_Q8_0:27B 和 gemma3_Q4_K_M:4b 我先前的表述有误,并不是OLLAMA_FLASH_ATTENTION=1和**OLLAMA_NUM_PARALLEL=1**, 而是OLLAMA_FLASH_ATTENTION=1和**OLLAMA_SCHED_SPREAD=1**对我的ollama推理产生了显著影响, 这种影响在这个html中测试时稳定出现,我不确定是我的硬件、软件还是这个html导致了这种问题。 I tested Gemma3_Q8_0:27B and gemma3_Q4_K_M:4b using the "Ollama api" of the above html tool. My previous statement was incorrect; it was not OLLAMA_FLASH_ATTENTION=1 and **OLLAMA_NUM_PARALLEL=1**, but rather OLLAMA_FLASH_ATTENTION=1 and **OLLAMA_SCHED_SPREAD=1** that had a significant impact on my Ollama inference. This effect consistently appeared during testing with this html, and I am unsure whether it is caused by my hardware, software, or this html. 测试结果如下: Test results: <img width="1530" height="1138" alt="Image" src="https://github.com/user-attachments/assets/ebf778ba-2dc1-4bf1-a705-11848e57cbdd" /> 下一步我将在open-webui中测试,看一下情况如何。
Author
Owner

@gome-pc commented on GitHub (Jul 10, 2025):

Maybe my hardware is too old.

<!-- gh-comment-id:3056424032 --> @gome-pc commented on GitHub (Jul 10, 2025): Maybe my hardware is too old.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69520