[GH-ISSUE #2379] The qwen:72b-chat-v1.5 model (and likely all the other v1.5 models too) is missing the rope_frequency_base value in the GGUF file. #1380

Closed
opened 2026-04-12 11:12:40 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @jukofyork on GitHub (Feb 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2379

Originally assigned to: @bmizerany on GitHub.

I've patched my Ollama to allow the setting of rope_frequency_base in the modelfile again, so I can fix this via:

PARAMETER rope_frequency_base 1000000

but it should also be possible to use gguf-set-metadata to do the same.

I'm not the only one who noticed this as the official GGUF q5_k_m and q2_k models are also missing the rope_frequency_base value:

https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/discussions/1

The transformers repo suggested that this model has a ROPE frequency of 1,000,000 while the gguf metadata here has a frequency of 10,000.

I can confirm this does seem to work as without this setting it just ends up outputting repeating newlines after a while - possibly because the default is 10000 (?) and it will make the context 'appear' to fill up 100x quicker to the model.

Originally created by @jukofyork on GitHub (Feb 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2379 Originally assigned to: @bmizerany on GitHub. I've patched my Ollama to allow the setting of `rope_frequency_base` in the modelfile again, so I can fix this via: ``` PARAMETER rope_frequency_base 1000000 ``` but it should also be possible to use `gguf-set-metadata` to do the same. I'm not the only one who noticed this as the official GGUF `q5_k_m` and `q2_k` models are also missing the `rope_frequency_base` value: https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF/discussions/1 > The transformers repo suggested that this model has a ROPE frequency of 1,000,000 while the gguf metadata here has a frequency of 10,000. I can confirm this does seem to work as without this setting it just ends up outputting repeating newlines after a while - possibly because the default is 10000 (?) and it will make the context 'appear' to fill up 100x quicker to the model.
GiteaMirror added the bugmodel labels 2026-04-12 11:12:40 -05:00
Author
Owner

@jmorganca commented on GitHub (Feb 7, 2024):

Thanks for catching this and sorry - will update these.

<!-- gh-comment-id:1931302287 --> @jmorganca commented on GitHub (Feb 7, 2024): Thanks for catching this and sorry - will update these.
Author
Owner

@svilupp commented on GitHub (Feb 7, 2024):

I have been wondering why our LLM Leaderboard scores Qwen models as complete trash (link)!
This would explain a lot.

However, I've tried changing the rope freq as an API parameter and running a slice of the benchmark and it made no difference.

<!-- gh-comment-id:1932038856 --> @svilupp commented on GitHub (Feb 7, 2024): I have been wondering why our LLM Leaderboard scores Qwen models as complete trash ([link](https://svilupp.github.io/Julia-LLM-Leaderboard/dev/examples/summarize_results_local/#Model-Comparison))! This would explain a lot. However, I've tried changing the rope freq as an API parameter and running a slice of the benchmark and it made no difference.
Author
Owner

@jukofyork commented on GitHub (Feb 7, 2024):

However, I've tried changing the rope freq as an API parameter and running a slice of the benchmark and it made no difference.
The rope scale and frequency parameters aren't passed through to the wrapped llama.cpp server in the main Ollama branch - they get zeroed out to 0.0f and ignored.

It's only around 6 line of code to change in 3 files and I will put up a PR later if I get time.

<!-- gh-comment-id:1932249247 --> @jukofyork commented on GitHub (Feb 7, 2024): > However, I've tried changing the rope freq as an API parameter and running a slice of the benchmark and it made no difference. The rope scale and frequency parameters aren't passed through to the wrapped llama.cpp server in the main Ollama branch - they get zeroed out to 0.0f and ignored. It's only around 6 line of code to change in 3 files and I will put up a PR later if I get time.
Author
Owner

@jukofyork commented on GitHub (Feb 7, 2024):

However, I've tried changing the rope freq as an API parameter and running a slice of the benchmark and it made no difference.
The rope scale and frequency parameters aren't passed through to the wrapped llama.cpp server in the main Ollama branch - they get zeroed out to 0.0f and ignored.

It's only around 6 line of code to change in 3 files and I will put up a PR later if I get time.

It's here https://github.com/ollama/ollama/pull/2389 but I can't seem to make a second fork of Ollama and this also includes the code for the PR that allows split_mode and tensor_split to be set from the modelfile (I'm too dumb to work out how to split off just the changes for the rope_freq_base and rope_freq_scale - sorry).

These are the 6 lines of code that need to be changed if you just want to clone a copy and recompile:

llm/dyn_ext_server.go
=====================
sparams.rope_freq_base = C.float(opts.RopeFrequencyBase) 
sparams.rope_freq_scale = C.float(opts.RopeFrequencyScale)

llm/llm.go
==========
// opts.RopeFrequencyBase = 0.0 
// opts.RopeFrequencyScale = 0.0 

api/types.go
============
RopeFrequencyBase: 0.0, 
RopeFrequencyScale: 0.0,
<!-- gh-comment-id:1932295574 --> @jukofyork commented on GitHub (Feb 7, 2024): > > However, I've tried changing the rope freq as an API parameter and running a slice of the benchmark and it made no difference. > > The rope scale and frequency parameters aren't passed through to the wrapped llama.cpp server in the main Ollama branch - they get zeroed out to 0.0f and ignored. > > It's only around 6 line of code to change in 3 files and I will put up a PR later if I get time. It's here https://github.com/ollama/ollama/pull/2389 but I can't seem to make a second fork of Ollama and this also includes the code for the PR that allows `split_mode` and `tensor_split` to be set from the modelfile (I'm too dumb to work out how to split off just the changes for the `rope_freq_base` and `rope_freq_scale` - sorry). These are the 6 lines of code that need to be changed if you just want to clone a copy and recompile: ``` llm/dyn_ext_server.go ===================== sparams.rope_freq_base = C.float(opts.RopeFrequencyBase) sparams.rope_freq_scale = C.float(opts.RopeFrequencyScale) llm/llm.go ========== // opts.RopeFrequencyBase = 0.0 // opts.RopeFrequencyScale = 0.0 api/types.go ============ RopeFrequencyBase: 0.0, RopeFrequencyScale: 0.0, ```
Author
Owner

@jukofyork commented on GitHub (Feb 7, 2024):

Sadly you can't use gguf-set-metadata as it seems the setting is completely missing from the GGUF file header:

> gguf-set-metadata --dry-run qwen-72b-chat.gguf llama.rope.freq_base 1000000
* Loading: qwen-72b-chat.gguf
! Field 'llama.rope.freq_base' not found
> gguf-dump qwen-72b-chat.gguf 
* Dumping 23 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 963
      3: UINT64     |        1 | GGUF.kv_count = 20
      4: STRING     |        1 | general.architecture = 'qwen2'
      5: STRING     |        1 | general.name = 'Qwen2-beta-72B-Chat'
      6: UINT32     |        1 | qwen2.block_count = 80
      7: UINT32     |        1 | qwen2.context_length = 32768
      8: UINT32     |        1 | qwen2.embedding_length = 8192
      9: UINT32     |        1 | qwen2.feed_forward_length = 24576
     10: UINT32     |        1 | qwen2.attention.head_count = 64
     11: UINT32     |        1 | qwen2.attention.head_count_kv = 64
     12: FLOAT32    |        1 | qwen2.attention.layer_norm_rms_epsilon = 9.999999974752427e-07
     13: BOOL       |        1 | qwen2.use_parallel_residual = True
     14: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     15: [STRING]   |   152064 | tokenizer.ggml.tokens
     16: [INT32]    |   152064 | tokenizer.ggml.token_type
     17: [STRING]   |   151387 | tokenizer.ggml.merges
     18: UINT32     |        1 | tokenizer.ggml.eos_token_id = 151643
     19: UINT32     |        1 | tokenizer.ggml.padding_token_id = 151643
     20: UINT32     |        1 | tokenizer.ggml.bos_token_id = 151643
     21: STRING     |        1 | tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['rol"
     22: UINT32     |        1 | general.quantization_version = 2
     23: UINT32     |        1 | general.file_type = 7

So for now the only alternative is to patch the source and pass the rope_freq_base = 1000000 via the modelfile:

<!-- gh-comment-id:1932313616 --> @jukofyork commented on GitHub (Feb 7, 2024): Sadly you can't use `gguf-set-metadata` as it seems the setting is completely missing from the GGUF file header: ``` > gguf-set-metadata --dry-run qwen-72b-chat.gguf llama.rope.freq_base 1000000 * Loading: qwen-72b-chat.gguf ! Field 'llama.rope.freq_base' not found ``` ``` > gguf-dump qwen-72b-chat.gguf * Dumping 23 key/value pair(s) 1: UINT32 | 1 | GGUF.version = 3 2: UINT64 | 1 | GGUF.tensor_count = 963 3: UINT64 | 1 | GGUF.kv_count = 20 4: STRING | 1 | general.architecture = 'qwen2' 5: STRING | 1 | general.name = 'Qwen2-beta-72B-Chat' 6: UINT32 | 1 | qwen2.block_count = 80 7: UINT32 | 1 | qwen2.context_length = 32768 8: UINT32 | 1 | qwen2.embedding_length = 8192 9: UINT32 | 1 | qwen2.feed_forward_length = 24576 10: UINT32 | 1 | qwen2.attention.head_count = 64 11: UINT32 | 1 | qwen2.attention.head_count_kv = 64 12: FLOAT32 | 1 | qwen2.attention.layer_norm_rms_epsilon = 9.999999974752427e-07 13: BOOL | 1 | qwen2.use_parallel_residual = True 14: STRING | 1 | tokenizer.ggml.model = 'gpt2' 15: [STRING] | 152064 | tokenizer.ggml.tokens 16: [INT32] | 152064 | tokenizer.ggml.token_type 17: [STRING] | 151387 | tokenizer.ggml.merges 18: UINT32 | 1 | tokenizer.ggml.eos_token_id = 151643 19: UINT32 | 1 | tokenizer.ggml.padding_token_id = 151643 20: UINT32 | 1 | tokenizer.ggml.bos_token_id = 151643 21: STRING | 1 | tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['rol" 22: UINT32 | 1 | general.quantization_version = 2 23: UINT32 | 1 | general.file_type = 7 ``` So for now the only alternative is to patch the source and pass the `rope_freq_base = 1000000` via the modelfile:
Author
Owner

@svilupp commented on GitHub (Feb 7, 2024):

I think I'll treat Qwen as a write-off or tell people to just use a different backend than Ollama. I wonder how many models are secretly affected by similar "bugs" :-/ (especially when a model performs suspiciously bad in our benchmarks)

<!-- gh-comment-id:1932921014 --> @svilupp commented on GitHub (Feb 7, 2024): I think I'll treat Qwen as a write-off or tell people to just use a different backend than Ollama. I wonder how many models are secretly affected by similar "bugs" :-/ (especially when a model performs suspiciously bad in our benchmarks)
Author
Owner

@jukofyork commented on GitHub (Feb 8, 2024):

I think every back-end will be effected until a proper GGUF gets uploaded: it seems to be Qwen themselves that have accidentally missed the rope.freq_base parameter :/

<!-- gh-comment-id:1933151596 --> @jukofyork commented on GitHub (Feb 8, 2024): I think every back-end will be effected until a proper GGUF gets uploaded: it seems to be Qwen themselves that have accidentally missed the rope.freq_base parameter :/
Author
Owner

@jukofyork commented on GitHub (Feb 16, 2024):

They've fixed the official GGUF quants now:

https://twitter.com/justinlin610/status/1757811183707681197?s=46&t=BVhfPLwVzzqRJOcJ7VU3tw

I was finding that the one downloaded from ollama.ai had some other strange problem where it would sometimes do a huge pause of around 10-15 seconds and then start outputing new lines (tried both the q8_0 and q5_K_M). No other model has ever done this so not sure if there is more wrong than just the ROPE base frequency - will report back if the new/fixed official GGUF works any better.

<!-- gh-comment-id:1947845444 --> @jukofyork commented on GitHub (Feb 16, 2024): They've fixed the official GGUF quants now: https://twitter.com/justinlin610/status/1757811183707681197?s=46&t=BVhfPLwVzzqRJOcJ7VU3tw I was finding that the one downloaded from ollama.ai had some other strange problem where it would sometimes do a huge pause of around 10-15 seconds and then start outputing new lines (tried both the q8_0 and q5_K_M). No other model has ever done this so not sure if there is more wrong than just the ROPE base frequency - will report back if the new/fixed official GGUF works any better.
Author
Owner

@svilupp commented on GitHub (Feb 16, 2024):

@jmorganca Apologies for the shout out, but would it be possible to consider re-uploading Qwen? It’s “allegedly” one of the best local models out there, but we can’t use it Ollama 😓

<!-- gh-comment-id:1948351619 --> @svilupp commented on GitHub (Feb 16, 2024): @jmorganca Apologies for the shout out, but would it be possible to consider re-uploading Qwen? It’s “allegedly” one of the best local models out there, but we can’t use it Ollama 😓
Author
Owner

@jukofyork commented on GitHub (Feb 16, 2024):

I just downloaded the official q8_0 from qwen's huggingface repo and can confirm the weird stalling is fixed and the GGUF has the correct ROPE base frequency baked in.

I've never had any other models stall like that in Ollama so it's possible the one on ollama.ai is corrupted somehow and not just the wrong ROPE setting.

<!-- gh-comment-id:1948718516 --> @jukofyork commented on GitHub (Feb 16, 2024): I just downloaded the official q8_0 from qwen's huggingface repo and can confirm the weird stalling is fixed and the GGUF has the correct ROPE base frequency baked in. I've never had any other models stall like that in Ollama so it's possible the one on ollama.ai is corrupted somehow and not just the wrong ROPE setting.
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

This should be part of the gguf and model download now. Just make sure to re-pull if you have an old copy of the model.

<!-- gh-comment-id:2249042126 --> @dhiltgen commented on GitHub (Jul 24, 2024): This should be part of the gguf and model download now. Just make sure to re-pull if you have an old copy of the model.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1380