[GH-ISSUE #4051] Enable Flash Attention on GGML/GGUF (feature now merged into llama.cpp) #49025

Closed
opened 2026-04-28 10:36:51 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @sammcj on GitHub (Apr 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4051

Flash Attention has landed in llama.cpp (https://github.com/ggerganov/llama.cpp/pull/5021).

The tldr; is simply to pass the -fa flag to llama.cpp’s server.

  • Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server?

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Originally created by @sammcj on GitHub (Apr 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4051 Flash Attention has landed in llama.cpp (https://github.com/ggerganov/llama.cpp/pull/5021). The tldr; is simply to pass the -fa flag to llama.cpp’s server. - Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server? also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. `OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else`
GiteaMirror added the feature request label 2026-04-28 10:36:51 -05:00
Author
Owner

@DuckyBlender commented on GitHub (Apr 30, 2024):

Are there any cons for using Flash Attention? If not it probably should be the default.

<!-- gh-comment-id:2087715631 --> @DuckyBlender commented on GitHub (Apr 30, 2024): Are there any cons for using Flash Attention? If not it probably should be the default.
Author
Owner

@DuckyBlender commented on GitHub (Apr 30, 2024):

Also if there is more memory available we should really set the default quants to Q4_K_M and fallback to Q4 when Q4_K_M is not available

<!-- gh-comment-id:2087717370 --> @DuckyBlender commented on GitHub (Apr 30, 2024): Also if there is more memory available we should really set the default quants to Q4_K_M and fallback to Q4 when Q4_K_M is not available
Author
Owner

@jukofyork commented on GitHub (May 1, 2024):

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Yeah, I think this would help a lot.

<!-- gh-comment-id:2087873095 --> @jukofyork commented on GitHub (May 1, 2024): > also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. `OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else` Yeah, I think this would help a lot.
Author
Owner

@jukofyork commented on GitHub (May 2, 2024):

Flash Attention has landed in llama.cpp (ggerganov/llama.cpp#5021).

The tldr; is simply to pass the -fa flag to llama.cpp’s server.

* Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server?

also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else

Just been looking at the source to see if this can be done:

https://github.com/ollama/ollama/blob/main/llm/ext_server/server.cpp

and even though it's importing the latest gpt_params structure (from the lamma.cpp submodule dated 2 days ago), all the parsing is using some server_params_parse function stripped from a outdated version of the llama.cpp server.

Some of the default values (eg: --repeat-penalty) and even meanings (eg: --batch) are also behind because of this.

I guess we could try to make a diff to see if this can be updated, but since PRs never seem to get accepted and random stuff constantly seems to change regarding the passing of command line options, I'm reluctant to even try...

<!-- gh-comment-id:2090048219 --> @jukofyork commented on GitHub (May 2, 2024): > Flash Attention has landed in llama.cpp ([ggerganov/llama.cpp#5021](https://github.com/ggerganov/llama.cpp/pull/5021)). > > The tldr; is simply to pass the -fa flag to llama.cpp’s server. > > * Can we please have an Ollama server env var to pass this flag to the underlying llama.cpp server? > > > also a related idea - perhaps there could be a way to pass arbitrary flags down to llama.cpp so that hints like this can be easily enabled? (E.g. `OLLAMA_LLAMA_EXTRA_ARGS=-fa,—something-else` Just been looking at the source to see if this can be done: https://github.com/ollama/ollama/blob/main/llm/ext_server/server.cpp and even though it's importing the latest `gpt_params` structure (from the lamma.cpp submodule dated 2 days ago), all the parsing is using some `server_params_parse` function stripped from a outdated version of the llama.cpp server. Some of the default values (eg: --repeat-penalty) and even meanings (eg: --batch) are also behind because of this. I guess we could try to make a diff to see if this can be updated, but since PRs never seem to get accepted and random stuff constantly seems to change regarding the passing of command line options, I'm reluctant to even try...
Author
Owner

@sammcj commented on GitHub (May 2, 2024):

I had a hot crack at a PR tonight but as you said @jukofyork it seems server.cpp has diverged from llama.cpp a lot and I couldn't get my head around it.

Might have to wait for someone more familiar with Ollama's version of it / more brains than me.

<!-- gh-comment-id:2090503318 --> @sammcj commented on GitHub (May 2, 2024): I had a hot crack at a PR tonight but as you said @jukofyork it seems server.cpp has diverged from llama.cpp a lot and I couldn't get my head around it. Might have to wait for someone more familiar with Ollama's version of it / more brains than me.
Author
Owner

@sammcj commented on GitHub (May 3, 2024):

FYI: LM Studio added Flash Attention earlier today: https://www.reddit.com/r/LocalLLaMA/comments/1cir98j/lm_studio_released_new_version_with_flash/

<!-- gh-comment-id:2092059230 --> @sammcj commented on GitHub (May 3, 2024): FYI: LM Studio added Flash Attention earlier today: https://www.reddit.com/r/LocalLLaMA/comments/1cir98j/lm_studio_released_new_version_with_flash/
Author
Owner

@sammcj commented on GitHub (May 3, 2024):

Comparing GGUF performance with/without Flash Attention

Hardware

  • Apple Macbook Pro M2 Max (96GB)

Model

  • llama3-bartowski-8b-instruct-q8-0.gguf

Request

curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0",
        "messages": [
            {
                "role": "user",
                "content": "Tell me two short jokes."
            }
        ]
    }'

LM Studio

With Flash Attention

Start time: 14:10:20.401
End time: 14:10:21.636
"prompt_tokens": 38,
"completion_tokens": 37,
"total_tokens": 75
  • 1.235 seconds~
  • 60.65 tokens/s~

Ollama

Without Flash Attention

total duration:       1.866343375s
load duration:        1.299853709s
prompt eval count:    15 token(s)
prompt eval duration: 84.22ms
prompt eval rate:     178.10 tokens/s
eval count:           18 token(s)
eval duration:        481.335ms
eval rate:            37.40 tokens/s

Start TIMESTAMP: 1714709035
End TIMESTAMP: 1714709036
  • 1.866 seconds
  • 40.38 tokens/s
<!-- gh-comment-id:2092092698 --> @sammcj commented on GitHub (May 3, 2024): Comparing GGUF performance with/without Flash Attention ## Hardware - Apple Macbook Pro M2 Max (96GB) ## Model - llama3-bartowski-8b-instruct-q8-0.gguf ## Request ``` curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0", "messages": [ { "role": "user", "content": "Tell me two short jokes." } ] }' ``` ## LM Studio With Flash Attention ``` Start time: 14:10:20.401 End time: 14:10:21.636 "prompt_tokens": 38, "completion_tokens": 37, "total_tokens": 75 ``` - 1.235 seconds~ - **60.65 tokens/s~** ## Ollama Without Flash Attention ```` total duration: 1.866343375s load duration: 1.299853709s prompt eval count: 15 token(s) prompt eval duration: 84.22ms prompt eval rate: 178.10 tokens/s eval count: 18 token(s) eval duration: 481.335ms eval rate: 37.40 tokens/s Start TIMESTAMP: 1714709035 End TIMESTAMP: 1714709036 ```` - 1.866 seconds - **40.38 tokens/s**
Author
Owner

@wanderingmeow commented on GitHub (May 3, 2024):

Got it working by adding the OLLAMA_LLAMA_EXTRA_ARGS environment variable in llm/server.go as @sammcj suggested. This allows the -fa flag be passed into ext_server/server.cpp. If anyone is interested, my hack can be found here. Hopefully this can help others get Flash Attention working too.

<!-- gh-comment-id:2092430887 --> @wanderingmeow commented on GitHub (May 3, 2024): Got it working by adding the `OLLAMA_LLAMA_EXTRA_ARGS` environment variable in `llm/server.go` as @sammcj suggested. This allows the `-fa` flag be passed into `ext_server/server.cpp`. If anyone is interested, my hack can be found [here](https://github.com/wanderingmeow/ollama). Hopefully this can help others get Flash Attention working too.
Author
Owner

@sammcj commented on GitHub (May 3, 2024):

That looks VERY similar to what I tried @wanderingmeow, perhaps I just wasn't parsing the parameters correctly 🤔 , that's awesome!

<!-- gh-comment-id:2092436197 --> @sammcj commented on GitHub (May 3, 2024): That looks VERY similar to what I tried @wanderingmeow, perhaps I just wasn't parsing the parameters correctly 🤔 , that's awesome!
Author
Owner

@sammcj commented on GitHub (May 3, 2024):

That worked instantly, still no where near as fast as LM Studio but it's a start.

export OLLAMA_LLAMA_EXTRA_ARGS="-fa"

ollama serve

...
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
...
ollama run registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0 'tell me two short jokes' --verbose
Here are two short jokes:

1. Why don't scientists trust atoms? Because they make up everything!
2. Why don't eggs tell jokes? They'd crack each other up!

Hope you find them amusing!

total duration:       2.539137625s
load duration:        1.286854542s
prompt eval count:    15 token(s)
prompt eval duration: 83.363ms
prompt eval rate:     179.94 tokens/s
eval count:           44 token(s)
eval duration:        1.168093s
eval rate:            37.67 tokens/s
<!-- gh-comment-id:2092449319 --> @sammcj commented on GitHub (May 3, 2024): That worked instantly, still no where near as fast as LM Studio but it's a start. ``` export OLLAMA_LLAMA_EXTRA_ARGS="-fa" ollama serve ... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ... ``` ``` ollama run registry.internal/ollama/llama3-bartowski:8b-instruct-q8_0 'tell me two short jokes' --verbose Here are two short jokes: 1. Why don't scientists trust atoms? Because they make up everything! 2. Why don't eggs tell jokes? They'd crack each other up! Hope you find them amusing! total duration: 2.539137625s load duration: 1.286854542s prompt eval count: 15 token(s) prompt eval duration: 83.363ms prompt eval rate: 179.94 tokens/s eval count: 44 token(s) eval duration: 1.168093s eval rate: 37.67 tokens/s ```
Author
Owner

@sammcj commented on GitHub (May 3, 2024):

@wanderingmeow I've created a PR with the changes - https://github.com/ollama/ollama/pull/4120

<!-- gh-comment-id:2092459673 --> @sammcj commented on GitHub (May 3, 2024): @wanderingmeow I've created a PR with the changes - https://github.com/ollama/ollama/pull/4120
Author
Owner

@jukofyork commented on GitHub (May 3, 2024):

Awesome! 👍

<!-- gh-comment-id:2092470394 --> @jukofyork commented on GitHub (May 3, 2024): Awesome! :+1:
Author
Owner

@jukofyork commented on GitHub (May 3, 2024):

I wonder if we can extend this:

	if other_args := os.Getenv("OLLAMA_LLAMA_EXTRA_ARGS"); other_args != "" {
		params = append(params, strings.Split(other_args, ",")...)
	}

To use a reg-ex to match against the model name passed in the model string?

Then we can just completely bypass the modelfile code and pass model-specific parameters as well? This would save all the hassle of PRs not getting accepted and then breaking because the code all got moved around!

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

It probably doesn't want to be split using "," though as that is used by llama.cpp for some options like tensor-split, etc.

<!-- gh-comment-id:2092518475 --> @jukofyork commented on GitHub (May 3, 2024): I wonder if we can extend this: ``` if other_args := os.Getenv("OLLAMA_LLAMA_EXTRA_ARGS"); other_args != "" { params = append(params, strings.Split(other_args, ",")...) } ``` To use a reg-ex to match against the model name passed in the `model` string? Then we can just completely bypass the modelfile code and pass model-specific parameters as well? This would save all the hassle of PRs not getting accepted and then breaking because the code all got moved around! It looks like we could just copy in `server_params_parse` from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to `llama.cpp`. It probably doesn't want to be split using "," though as that is used by `llama.cpp` for some options like `tensor-split`, etc.
Author
Owner

@jukofyork commented on GitHub (May 3, 2024):

Something like:

export OLLAMA_LLAMA_EXTRA_ARGS=".* ; -fa | qwen.* ; -sm row ; -ts 3,2 | llama2.* ; --rope-freq-base 8192"

Although I'm not sure if the model string will actually be the name or the weird sha hash name that points to the GGUF file?

<!-- gh-comment-id:2092531960 --> @jukofyork commented on GitHub (May 3, 2024): Something like: ``` export OLLAMA_LLAMA_EXTRA_ARGS=".* ; -fa | qwen.* ; -sm row ; -ts 3,2 | llama2.* ; --rope-freq-base 8192" ``` Although I'm not sure if the `model` string will actually be the name or the weird sha hash name that points to the GGUF file?
Author
Owner

@wanderingmeow commented on GitHub (May 3, 2024):

still no where near as fast as LM Studio but it's a start.

From my testing, the current ollama ext_server implementation performs on par with the latest server example from llama.cpp, with only a slight slowdown (<1%).

<!-- gh-comment-id:2092541858 --> @wanderingmeow commented on GitHub (May 3, 2024): > still no where near as fast as LM Studio but it's a start. From my testing, the current ollama `ext_server` implementation performs on par with the latest `server` example from llama.cpp, with only a slight slowdown (<1%).
Author
Owner

@sammcj commented on GitHub (May 3, 2024):

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

@jukofyork I had a look at this, but it was getting messy, fast.

What I think probably could work is:

  1. Split out Ollama's custom server configuration from the model server parameters.
  2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea).
  3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic.

That's way out of my ball park though 😅

<!-- gh-comment-id:2092606811 --> @sammcj commented on GitHub (May 3, 2024): > It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp. @jukofyork I had a look at this, but it was getting messy, fast. What I think probably could work is: 1. Split out Ollama's custom server configuration from the model server parameters. 2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea). 3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic. That's way out of my ball park though 😅
Author
Owner

@jukofyork commented on GitHub (May 4, 2024):

still no where near as fast as LM Studio but it's a start.

From my testing, the current ollama ext_server implementation performs on par with the latest server example from llama.cpp, with only a slight slowdown (<1%).

It really helps reduce the VRAM use of long-context models: some I could only run at 16k or 32k are now running with 32k or 64k for the same quant!

<!-- gh-comment-id:2094136731 --> @jukofyork commented on GitHub (May 4, 2024): > > still no where near as fast as LM Studio but it's a start. > > From my testing, the current ollama `ext_server` implementation performs on par with the latest `server` example from llama.cpp, with only a slight slowdown (<1%). It really helps reduce the VRAM use of long-context models: some I could only run at 16k or 32k are now running with 32k or 64k for the same quant!
Author
Owner

@jukofyork commented on GitHub (May 4, 2024):

It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp.

@jukofyork I had a look at this, but it was getting messy, fast.

What I think probably could work is:

1. Split out Ollama's custom server configuration from the model server parameters.

2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea).

3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic.

That's way out of my ball park though 😅

Yeah, but it would be good for the long-run: some of the parameters and their default settings that Ollama is using are very out of date compared to llama.cpp's server now and I can only see things getting worse and worse if nothing is done about it.

<!-- gh-comment-id:2094138658 --> @jukofyork commented on GitHub (May 4, 2024): > > It looks like we could just copy in server_params_parse from the sub-module (possibly automatically using AWK or something) and we would have access to any parameter added to llama.cpp. > > @jukofyork I had a look at this, but it was getting messy, fast. > > What I think probably could work is: > > 1. Split out Ollama's custom server configuration from the model server parameters. > > 2. Do the same in llama.cpp in PR (if @ggerganov thinks this might be a good idea). > > 3. Then Ollama or any project that wants to use llama.cpp's model server parameters library can do so separate from their server configuration logic. > > > That's way out of my ball park though 😅 Yeah, but it would be good for the long-run: some of the parameters and their default settings that Ollama is using are very out of date compared to llama.cpp's server now and I can only see things getting worse and worse if nothing is done about it.
Author
Owner

@sammcj commented on GitHub (May 4, 2024):

Anyway, https://github.com/ollama/ollama/pull/4120 provides the functionality for now, just waiting on someone to approve it...

<!-- gh-comment-id:2094141530 --> @sammcj commented on GitHub (May 4, 2024): Anyway, https://github.com/ollama/ollama/pull/4120 provides the functionality for now, just waiting on someone to approve it...
Author
Owner

@sammcj commented on GitHub (May 11, 2024):

#4120 is still sitting waiting for approval to be merged, I've been trying to keep it up and fix conflicts etc as other PRs are merged in.

<!-- gh-comment-id:2105500397 --> @sammcj commented on GitHub (May 11, 2024): #4120 is still sitting waiting for approval to be merged, I've been trying to keep it up and fix conflicts etc as other PRs are merged in.
Author
Owner

@enzomich commented on GitHub (Jul 18, 2024):

FA is also required for another important feature already present in Llama.cpp and hopefully coming to Ollama: KV cache quantization, with the two options -ctk and -ctv . In particular, it's required for V:
llama_new_context_with_model: V cache quantization requires flash_attn

<!-- gh-comment-id:2236752363 --> @enzomich commented on GitHub (Jul 18, 2024): FA is also required for another important feature already present in Llama.cpp and hopefully coming to Ollama: KV cache quantization, with the two options -ctk and -ctv . In particular, it's required for V: `llama_new_context_with_model: V cache quantization requires flash_attn`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49025