[GH-ISSUE #3759] llama3-instruct models not stopping at stop token #64356

Closed
opened 2026-05-03 17:15:16 -05:00 by GiteaMirror · 47 comments
Owner

Originally created by @moyix on GitHub (Apr 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3759

What is the issue?

I'm using llama3:70b through the OpenAI-compatible endpoint. When generating, I am getting outputs like this:

Please provide the output of the above command.                                                              
                                                                                                             
Let's proceed from                                                                                           
here!<|eot_id|><|start_header_id|>assistant<|end_header_id|>                                                 
                                                                                                             
It seems that I made a mistake. Radare2 does not have a command called                                       
radebol. Instead, we can use r2 to analyze the binary.                   

Here's the correct command:                                              

This is probably related to https://github.com/vllm-project/vllm/issues/4180 ? There is also an issue/PR on the LLaMA 3 HuggingFace repo: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4

But it's a bit confusing since <|eot_id|> is already included in the stop sequences:

$ ollama show --modelfile llama3:70b
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

Is there some other config param that needs to be updated?

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.32

Originally created by @moyix on GitHub (Apr 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3759 ### What is the issue? I'm using `llama3:70b` through the OpenAI-compatible endpoint. When generating, I am getting outputs like this: ``` Please provide the output of the above command. Let's proceed from here!<|eot_id|><|start_header_id|>assistant<|end_header_id|> It seems that I made a mistake. Radare2 does not have a command called radebol. Instead, we can use r2 to analyze the binary. Here's the correct command: ``` This is probably related to https://github.com/vllm-project/vllm/issues/4180 ? There is also an issue/PR on the LLaMA 3 HuggingFace repo: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4 But it's a bit confusing since `<|eot_id|>` is already included in the stop sequences: ``` $ ollama show --modelfile llama3:70b # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:70b FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ``` Is there some other config param that needs to be updated? ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.32
GiteaMirror added the bug label 2026-05-03 17:15:16 -05:00
Author
Owner

@binaryc0de commented on GitHub (Apr 19, 2024):

Noticing the same behavior here and when using the langchain package with ollama often once prompted the model doesn't stop generating.

<!-- gh-comment-id:2067101841 --> @binaryc0de commented on GitHub (Apr 19, 2024): Noticing the same behavior here and when using the langchain package with ollama often once prompted the model doesn't stop generating.
Author
Owner

@JasonXiao89 commented on GitHub (Apr 19, 2024):

Same issue using llama3:latest 71a106a91016

<!-- gh-comment-id:2067241041 --> @JasonXiao89 commented on GitHub (Apr 19, 2024): Same issue using llama3:latest 71a106a91016
Author
Owner

@olinorwell commented on GitHub (Apr 20, 2024):

I had the same issue and got around it by adding the stop token to the request the front-end I am using (LibreChat) was making to Ollama's OpenAI compatible API end-point.

I'm sure a more permanent solution will arrive but for now that does the trick.

(Note: the Elephant in the room of course is that the stop token is in the model file as shown above - but that setting appears to be ignored when using the OpenAI compatible end-point. Perhaps that is fixed to OpenAI's traditional stop tokens? and needs my above solution to get around the limitation.)

<!-- gh-comment-id:2067422225 --> @olinorwell commented on GitHub (Apr 20, 2024): I had the same issue and got around it by adding the stop token to the request the front-end I am using (LibreChat) was making to Ollama's OpenAI compatible API end-point. I'm sure a more permanent solution will arrive but for now that does the trick. (Note: the Elephant in the room of course is that the stop token is in the model file as shown above - but that setting appears to be ignored when using the OpenAI compatible end-point. Perhaps that is fixed to OpenAI's traditional stop tokens? and needs my above solution to get around the limitation.)
Author
Owner

@taozhiyuai commented on GitHub (Apr 20, 2024):

my model file works fine.

Modelfile generated by "ollama show"

To build a new Modelfile based on this one, replace the FROM line with:

FROM llama3:8b-instruct-fp16

FROM /Users/taozhiyu/.ollama/models/blobs/sha256-a4bbea838ebde985f2f99d710c849219979b9608e44e1c3c46416b5fbff72d64
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop ""<|reserved_special_token""

<!-- gh-comment-id:2067502581 --> @taozhiyuai commented on GitHub (Apr 20, 2024): my model file works fine. # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:8b-instruct-fp16 FROM /Users/taozhiyu/.ollama/models/blobs/sha256-a4bbea838ebde985f2f99d710c849219979b9608e44e1c3c46416b5fbff72d64 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" PARAMETER stop "\"<|reserved_special_token\""
Author
Owner

@telehan commented on GitHub (Apr 20, 2024):

create from gguf 70b q4, it's the same problem while ollama run

this gguf download from https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF

$ ollama -v
ollama version is 0.1.32

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant

(I'm here to listen and help if I can!)assistant

How's your day going so far?assistant

(By the way, (this emoji means "high five"! 😊)assistant

Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex
😄)!assistant

Haha,^C

$ tail -f ~/.ollama/logs/server3.log
...
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.55 MiB
...
<!-- gh-comment-id:2067510951 --> @telehan commented on GitHub (Apr 20, 2024): create from gguf 70b q4, it's the same problem while `ollama run` this gguf download from https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF ```bash $ ollama -v ollama version is 0.1.32 $ ollama show --modelfile llama3:70b-ins-q4km # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:70b-ins-q4km FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 4096 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" $ ollama run llama3:70b-ins-q4km >>> hi Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant (I'm here to listen and help if I can!)assistant How's your day going so far?assistant (By the way, (this emoji means "high five"! 😊)assistant Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex 😄)!assistant Haha,^C $ tail -f ~/.ollama/logs/server3.log ... llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = hub llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 llama_model_loader: - kv 5: llama.block_count u32 = 80 ... llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = hub llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_tensors: ggml ctx size = 0.55 MiB ... ```
Author
Owner

@taozhiyuai commented on GitHub (Apr 20, 2024):

create from gguf 7b q4, it's the same problem while ollama run

$ ollama -v
ollama version is 0.1.32

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant

(I'm here to listen and help if I can!)assistant

How's your day going so far?assistant

(By the way, (this emoji means "high five"! 😊)assistant

Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex
😄)!assistant

Haha,^C

$ tail -f ~/.ollama/logs/server3.log
...
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.55 MiB
...

try my model file. your file is wrong, which maybe import from gguf

<!-- gh-comment-id:2067543177 --> @taozhiyuai commented on GitHub (Apr 20, 2024): > create from gguf 7b q4, it's the same problem while `ollama run` > > ```shell > $ ollama -v > ollama version is 0.1.32 > > $ ollama show --modelfile llama3:70b-ins-q4km > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this one, replace the FROM line with: > # FROM llama3:70b-ins-q4km > > FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72 > TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> > > {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> > > {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> > > {{ .Response }}<|eot_id|>""" > PARAMETER num_ctx 4096 > PARAMETER stop "<|start_header_id|>" > PARAMETER stop "<|end_header_id|>" > PARAMETER stop "<|eot_id|>" > > $ ollama run llama3:70b-ins-q4km > >>> hi > Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant > > (I'm here to listen and help if I can!)assistant > > How's your day going so far?assistant > > (By the way, (this emoji means "high five"! 😊)assistant > > Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex > 😄)!assistant > > Haha,^C > > $ tail -f ~/.ollama/logs/server3.log > ... > llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. > llama_model_loader: - kv 0: general.architecture str = llama > llama_model_loader: - kv 1: general.name str = hub > llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 > llama_model_loader: - kv 3: llama.context_length u32 = 8192 > llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 > llama_model_loader: - kv 5: llama.block_count u32 = 80 > ... > llm_load_print_meta: model type = 70B > llm_load_print_meta: model ftype = Q4_K - Medium > llm_load_print_meta: model params = 70.55 B > llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) > llm_load_print_meta: general.name = hub > llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' > llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' > llm_load_print_meta: LF token = 128 'Ä' > llm_load_tensors: ggml ctx size = 0.55 MiB > ... > ``` try my model file. your file is wrong, which maybe import from gguf
Author
Owner

@Madd0g commented on GitHub (Apr 20, 2024):

happens to me too, macos using 0.1.32 and Meta-Llama-3-8B-Instruct-Q6_K.gguf

When I add assistant\n, <|eot_id|> to stop tokens, it seems to work at first, but then begins stopping in the middle of sentences, I upgraded ollama just to see if it fixes the problem, so I removed the stop parameters from the client and I see it spamming <|eot_id|> in the middle of the sentence (like 30 of them in a row and then stopping)

<!-- gh-comment-id:2067633620 --> @Madd0g commented on GitHub (Apr 20, 2024): happens to me too, macos using 0.1.32 and Meta-Llama-3-8B-Instruct-Q6_K.gguf When I add `assistant\n, <|eot_id|>` to stop tokens, it seems to work at first, but then begins stopping in the middle of sentences, I upgraded ollama just to see if it fixes the problem, so I removed the stop parameters from the client and I see it spamming <|eot_id|> in the middle of the sentence (like 30 of them in a row and then stopping)
Author
Owner

@telehan commented on GitHub (Apr 20, 2024):

this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like
to chat?

$ tail -f ~/.ollama/logs/server3.log
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
...

the previous gguf 70b has the problem

<!-- gh-comment-id:2067660326 --> @telehan commented on GitHub (Apr 20, 2024): this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main ```bash $ ollama show --modelfile llama3:70b-ins-q4km # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:70b-ins-q4km FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 4096 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" $ ollama run llama3:70b-ins-q4km >>> hi Hi! It's nice to meet you. Is there something I can help you with or would you like to chat? $ tail -f ~/.ollama/logs/server3.log ... llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = . llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' ... ``` the previous gguf 70b has the problem
Author
Owner

@FutureGadget commented on GitHub (Apr 21, 2024):

Solved this manually by adding the stop parameter, but I think this is a bug.

llm = Ollama(model="llama3", stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"])
llm.invoke("Why is the sky blue?")
<!-- gh-comment-id:2068033899 --> @FutureGadget commented on GitHub (Apr 21, 2024): Solved this manually by adding the `stop` parameter, but I think this is a bug. ```python llm = Ollama(model="llama3", stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"]) llm.invoke("Why is the sky blue?") ```
Author
Owner

@leotam commented on GitHub (Apr 21, 2024):

this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like
to chat?

$ tail -f ~/.ollama/logs/server3.log
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
...

the previous gguf 70b has the problem

Only difference from the 70b-instruct is:

PARAMETER num_ctx 4096
<!-- gh-comment-id:2068129397 --> @leotam commented on GitHub (Apr 21, 2024): > this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main > > ```shell > $ ollama show --modelfile llama3:70b-ins-q4km > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this one, replace the FROM line with: > # FROM llama3:70b-ins-q4km > > FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929 > TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> > > {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> > > {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> > > {{ .Response }}<|eot_id|>""" > PARAMETER num_ctx 4096 > PARAMETER stop "<|start_header_id|>" > PARAMETER stop "<|end_header_id|>" > PARAMETER stop "<|eot_id|>" > > $ ollama run llama3:70b-ins-q4km > >>> hi > Hi! It's nice to meet you. Is there something I can help you with or would you like > to chat? > > $ tail -f ~/.ollama/logs/server3.log > ... > llm_load_print_meta: model type = 70B > llm_load_print_meta: model ftype = Q4_K - Medium > llm_load_print_meta: model params = 70.55 B > llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) > llm_load_print_meta: general.name = . > llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' > llm_load_print_meta: EOS token = 128009 '<|eot_id|>' > llm_load_print_meta: LF token = 128 'Ä' > ... > ``` > > the previous gguf 70b has the problem Only difference from the 70b-instruct is: ``` PARAMETER num_ctx 4096 ```
Author
Owner

@jukofyork commented on GitHub (Apr 21, 2024):

https://github.com/ggerganov/llama.cpp/issues/6772

I edited my gguf to use the <|eot_id|> token but it still prints it out? Using gguf-dump I can confirm I have made the change from the reddit thread but I don't understand why it prints <|eot_id|>? I've never had another gguf model print the EOS token defined in the gguf header so don't get what's special about this?

So had to also add:

PARAMETER stop "<|eot_id|>" 

But does:

{{ .Response }}<|eot_id|>"""

actually work as expected in Ollama and add <|eot_id|> after the AI's response as required by the wrapped llama.cpp server:

stop: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: []

?

We really need some way to debug stuff like this in Ollama desperately as it seems model creators are currently competing for most confusingly complex prompt template possible 😞

<!-- gh-comment-id:2068209537 --> @jukofyork commented on GitHub (Apr 21, 2024): https://github.com/ggerganov/llama.cpp/issues/6772 I edited my gguf to use the `<|eot_id|>` token but it still prints it out? Using `gguf-dump` I can confirm I have made the change from the reddit thread but I don't understand why it prints `<|eot_id|>`? I've never had another gguf model print the EOS token defined in the gguf header so don't get what's special about this? So had to also add: ``` PARAMETER stop "<|eot_id|>" ``` But does: ``` {{ .Response }}<|eot_id|>""" ``` actually work as expected in Ollama and add `<|eot_id|>` after the AI's response as required by the wrapped llama.cpp server: > `stop`: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: [] ? We ***really*** need some way to debug stuff like this in Ollama desperately as it seems model creators are currently competing for most confusingly complex prompt template possible 😞
Author
Owner

@phalexo commented on GitHub (Apr 22, 2024):

Infinite loop here too.

FROM /opt/data/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_S.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER num_ctx 8192
PARAMETER temperature 0
PARAMETER num_gpu 63

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

SYSTEM """You are an AI programming, planning assistant. You never refuse to answer questions or provide code.
Write a response that appropriately completes the request within the user message."""


Hello.
Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.styleTypeassistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

<!-- gh-comment-id:2070512917 --> @phalexo commented on GitHub (Apr 22, 2024): Infinite loop here too. FROM /opt/data/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_S.gguf TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 8192 PARAMETER temperature 0 PARAMETER num_gpu 63 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" SYSTEM """You are an AI programming, planning assistant. You never refuse to answer questions or provide code. Write a response that appropriately completes the request within the user message.""" _____________________________________________________________ >>> Hello. Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.styleTypeassistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant
Author
Owner

@jukofyork commented on GitHub (Apr 22, 2024):

> llama.cpp/gguf-py/scripts/gguf-dump.py --no-tensors llama3:70b-instruct-q8_0.gguf

* Loading: llama3:70b-instruct-q8_0.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 24 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 723
      3: UINT64     |        1 | GGUF.kv_count = 21
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'Meta-Llama-3-70B-Instruct'
      6: UINT32     |        1 | llama.block_count = 80
      7: UINT32     |        1 | llama.context_length = 8192
      8: UINT32     |        1 | llama.embedding_length = 8192
      9: UINT32     |        1 | llama.feed_forward_length = 28672
     10: UINT32     |        1 | llama.attention.head_count = 64
     11: UINT32     |        1 | llama.attention.head_count_kv = 8
     12: FLOAT32    |        1 | llama.rope.freq_base = 500000.0
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: UINT32     |        1 | general.file_type = 7
     15: UINT32     |        1 | llama.vocab_size = 128256
     16: UINT32     |        1 | llama.rope.dimension_count = 128
     17: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     18: [STRING]   |   128256 | tokenizer.ggml.tokens
     19: [INT32]    |   128256 | tokenizer.ggml.token_type
     20: [STRING]   |   280147 | tokenizer.ggml.merges
     21: UINT32     |        1 | tokenizer.ggml.bos_token_id = 128000
     22: UINT32     |        1 | tokenizer.ggml.eos_token_id = 128009
     23: STRING     |        1 | tokenizer.chat_template = '{% set loop_messages = messages %}{% for message in loop_mes'
     24: UINT32     |        1 | general.quantization_version = 2

Check tokenizer.ggml.eos_token_id = 128009.

FROM llama3:70b-instruct-q8_0
TEMPLATE """{{if .System}}<|start_header_id|>system<|end_header_id|>

{{.System}}<|eot_id|>{{end}}<|start_header_id|>user<|end_header_id|>

{{.Prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{.Response}}<|eot_id|>"""
PARAMETER num_ctx 8192
PARAMETER num_gpu 1000
PARAMETER stop "<|eot_id|>"

Seems to be working OK for me:

>>> Hello
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

total duration:       3.675769536s
load duration:        1.899593ms
prompt eval count:    11 token(s)
prompt eval duration: 606.671ms
prompt eval rate:     18.13 tokens/s
eval count:           26 token(s)
eval duration:        2.931362s
eval rate:            8.87 tokens/s
>>> What sort of thngs can you help me with?
I'm a large language model, so I can assist with a wide range of topics and tasks. Here are some examples:

1. **Answering questions**: I can provide information on various subjects like history, science, technology, health, and more.
2. **Language translation**: I can translate text from one language to another. I currently support translations in dozens of languages.
3. **Writing and proofreading**: I can help with writing tasks such as suggesting alternative phrases, providing grammar corrections, and even generating text based on a 
prompt.
4. **Conversation and chat**: I can have a conversation with you, answering your questions, sharing interesting facts, or just chatting about your day.
5. **Problem-solving**: I can help with logical reasoning, puzzles, and brain teasers.
6. **Generating ideas**: If you're stuck on a creative project, I can help generate ideas for stories, articles, or other writing tasks.
7. **Learning and education**: I can assist with explaining complex topics, providing study materials, and even offering practice quizzes.
8. **Jokes and humor**: If you need a laugh, I can share some jokes or engage in a fun conversation.
9. **Brainstorming**: I can help facilitate brainstorming sessions for creative projects or business ideas.
10. **Emotional support**: Sometimes, all we need is someone to listen. I'm here to offer a supportive ear and provide words of encouragement.

These are just a few examples of what I can do. If you have something specific in mind, feel free to ask me if I can help!

What's on your mind today?

total duration:       39.675841004s
load duration:        2.299579ms
prompt eval count:    50 token(s)
prompt eval duration: 632.596ms
prompt eval rate:     79.04 tokens/s
eval count:           330 token(s)
eval duration:        38.908067s
eval rate:            8.48 tokens/s
>>> Send a message (/? for help)
<!-- gh-comment-id:2070579704 --> @jukofyork commented on GitHub (Apr 22, 2024): ``` > llama.cpp/gguf-py/scripts/gguf-dump.py --no-tensors llama3:70b-instruct-q8_0.gguf * Loading: llama3:70b-instruct-q8_0.gguf * File is LITTLE endian, script is running on a LITTLE endian host. * Dumping 24 key/value pair(s) 1: UINT32 | 1 | GGUF.version = 3 2: UINT64 | 1 | GGUF.tensor_count = 723 3: UINT64 | 1 | GGUF.kv_count = 21 4: STRING | 1 | general.architecture = 'llama' 5: STRING | 1 | general.name = 'Meta-Llama-3-70B-Instruct' 6: UINT32 | 1 | llama.block_count = 80 7: UINT32 | 1 | llama.context_length = 8192 8: UINT32 | 1 | llama.embedding_length = 8192 9: UINT32 | 1 | llama.feed_forward_length = 28672 10: UINT32 | 1 | llama.attention.head_count = 64 11: UINT32 | 1 | llama.attention.head_count_kv = 8 12: FLOAT32 | 1 | llama.rope.freq_base = 500000.0 13: FLOAT32 | 1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06 14: UINT32 | 1 | general.file_type = 7 15: UINT32 | 1 | llama.vocab_size = 128256 16: UINT32 | 1 | llama.rope.dimension_count = 128 17: STRING | 1 | tokenizer.ggml.model = 'gpt2' 18: [STRING] | 128256 | tokenizer.ggml.tokens 19: [INT32] | 128256 | tokenizer.ggml.token_type 20: [STRING] | 280147 | tokenizer.ggml.merges 21: UINT32 | 1 | tokenizer.ggml.bos_token_id = 128000 22: UINT32 | 1 | tokenizer.ggml.eos_token_id = 128009 23: STRING | 1 | tokenizer.chat_template = '{% set loop_messages = messages %}{% for message in loop_mes' 24: UINT32 | 1 | general.quantization_version = 2 ``` Check `tokenizer.ggml.eos_token_id = 128009`. ``` FROM llama3:70b-instruct-q8_0 TEMPLATE """{{if .System}}<|start_header_id|>system<|end_header_id|> {{.System}}<|eot_id|>{{end}}<|start_header_id|>user<|end_header_id|> {{.Prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {{.Response}}<|eot_id|>""" PARAMETER num_ctx 8192 PARAMETER num_gpu 1000 PARAMETER stop "<|eot_id|>" ``` Seems to be working OK for me: ``` >>> Hello Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? total duration: 3.675769536s load duration: 1.899593ms prompt eval count: 11 token(s) prompt eval duration: 606.671ms prompt eval rate: 18.13 tokens/s eval count: 26 token(s) eval duration: 2.931362s eval rate: 8.87 tokens/s >>> What sort of thngs can you help me with? I'm a large language model, so I can assist with a wide range of topics and tasks. Here are some examples: 1. **Answering questions**: I can provide information on various subjects like history, science, technology, health, and more. 2. **Language translation**: I can translate text from one language to another. I currently support translations in dozens of languages. 3. **Writing and proofreading**: I can help with writing tasks such as suggesting alternative phrases, providing grammar corrections, and even generating text based on a prompt. 4. **Conversation and chat**: I can have a conversation with you, answering your questions, sharing interesting facts, or just chatting about your day. 5. **Problem-solving**: I can help with logical reasoning, puzzles, and brain teasers. 6. **Generating ideas**: If you're stuck on a creative project, I can help generate ideas for stories, articles, or other writing tasks. 7. **Learning and education**: I can assist with explaining complex topics, providing study materials, and even offering practice quizzes. 8. **Jokes and humor**: If you need a laugh, I can share some jokes or engage in a fun conversation. 9. **Brainstorming**: I can help facilitate brainstorming sessions for creative projects or business ideas. 10. **Emotional support**: Sometimes, all we need is someone to listen. I'm here to offer a supportive ear and provide words of encouragement. These are just a few examples of what I can do. If you have something specific in mind, feel free to ask me if I can help! What's on your mind today? total duration: 39.675841004s load duration: 2.299579ms prompt eval count: 50 token(s) prompt eval duration: 632.596ms prompt eval rate: 79.04 tokens/s eval count: 330 token(s) eval duration: 38.908067s eval rate: 8.48 tokens/s >>> Send a message (/? for help) ```
Author
Owner

@phalexo commented on GitHub (Apr 23, 2024):

Ok, this quantized version works after ollama import.

FROM /opt/data/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER num_gpu 73

PARAMETER stop "<|eot_id|>"
PARAMETER stop '<|end_of_text|>'
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop '<|begin_of_text|>'

SYSTEM "You are a helpful AI which can plan, program, and test."

<!-- gh-comment-id:2071256810 --> @phalexo commented on GitHub (Apr 23, 2024): Ok, this quantized version works after ollama import. FROM /opt/data/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 8192 PARAMETER temperature 0.2 PARAMETER num_gpu 73 PARAMETER stop "<|eot_id|>" PARAMETER stop '<|end_of_text|>' PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop '<|begin_of_text|>' SYSTEM "You are a helpful AI which can plan, program, and test."
Author
Owner

@romkage commented on GitHub (Apr 24, 2024):

I've got looping too. im testing with both llama3:8b and llama3:8b-instruct-fp16.
I have tried both models with the Modelfiles mentioned above, but still no luck.

This is with crewai, at the end of the first reply:

I think that's it!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Ha ha, indeed!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It seems like we've really wrapped things up this time! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The final nail in the coffin! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think we can officially close the book on our conversation now. *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It's over! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Goodbye!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

and it just goes on.

<!-- gh-comment-id:2074711473 --> @romkage commented on GitHub (Apr 24, 2024): I've got looping too. im testing with both llama3:8b and llama3:8b-instruct-fp16. I have tried both models with the Modelfiles mentioned above, but still no luck. This is with crewai, at the end of the first reply: ``` I think that's it!<|eot_id|><|start_header_id|>assistant<|end_header_id|> Ha ha, indeed!<|eot_id|><|start_header_id|>assistant<|end_header_id|> *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> It seems like we've really wrapped things up this time! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> The final nail in the coffin! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> I think we can officially close the book on our conversation now. *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> It's over! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|> Goodbye!<|eot_id|><|start_header_id|>assistant<|end_header_id|> ``` and it just goes on.
Author
Owner

@richardgroves commented on GitHub (Apr 24, 2024):

I've been round the houses with this as above - eventually got it working with a stopSequence of ["<|eot_id|>"] - tells the engine to stop asking for more responses when it sees that in the output stream of new data.

Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently.

<!-- gh-comment-id:2075180682 --> @richardgroves commented on GitHub (Apr 24, 2024): I've been round the houses with this as above - eventually got it working with a `stopSequence` of `["<|eot_id|>"]` - tells the engine to stop asking for more responses when it sees that in the output stream of new data. Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently.
Author
Owner

@kungfu-eric commented on GitHub (Apr 24, 2024):

I've been round the houses with this as above - eventually got it working with a stopSequence of ["<|eot_id|>"] - tells the engine to stop asking for more responses when it sees that in the output stream of new data.

Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently.

The template was updated yesterday https://ollama.com/library/llama3:70b-instruct. The only change was:

PARAMETER num_keep 24

Was your change different from this line that's always been in the file?:

PARAMETER stop "<|eot_id|>"
<!-- gh-comment-id:2075365849 --> @kungfu-eric commented on GitHub (Apr 24, 2024): > I've been round the houses with this as above - eventually got it working with a `stopSequence` of `["<|eot_id|>"]` - tells the engine to stop asking for more responses when it sees that in the output stream of new data. > > Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently. The template was updated yesterday https://ollama.com/library/llama3:70b-instruct. The only change was: ``` PARAMETER num_keep 24 ``` Was your change different from this line that's always been in the file?: ``` PARAMETER stop "<|eot_id|>" ```
Author
Owner

@richardgroves commented on GitHub (Apr 24, 2024):

@kungfu-eric I think I had (still have) an older version:

ollama show --modelfile llama3:8b

FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"

A newly pulled llama3 (latest) shows:

FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

I'm getting extra issues as I'm working through modelfusion (https://github.com/vercel/modelfusion) and unformatted chat requests to /api/chat with no stop sequences specified work for Llama2 but not Llama3. Tracing through the modelfusion code to work out what is going on is sloooow. Quick hacks on the completion api code got Llama3 working by forcing the "<|eot_id|>" as a specified stop sequence.

<!-- gh-comment-id:2075433483 --> @richardgroves commented on GitHub (Apr 24, 2024): @kungfu-eric I think I had (still have) an older version: `ollama show --modelfile llama3:8b` ``` FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" PARAMETER stop "<|reserved_special_token" ``` A newly pulled llama3 (latest) shows: ``` FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ``` I'm getting extra issues as I'm working through modelfusion (https://github.com/vercel/modelfusion) and unformatted chat requests to /api/chat with no stop sequences specified work for Llama2 but not Llama3. Tracing through the modelfusion code to work out what is going on is sloooow. Quick hacks on the completion api code got Llama3 working by forcing the "<|eot_id|>" as a specified stop sequence.
Author
Owner

@richardgroves commented on GitHub (Apr 25, 2024):

Further investigation has found the specific problem, but no clearer as to whether it is Ollama or modelfusion at fault.

So with llama3 this works:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}'

and so does this:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["<|eot_id|>"]}}'

but this doesn't:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

But with llama 2 all these work:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}'

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["</s>"]}}'

So the difference is that using Ollama with Llama 2 and specifying a stop option of [] works, but on Llama 3 it doesn't.

Modelfusion 'chat' paths make it less easy to set the stop options, and they send an empty [], whereas the completion models do allow setting of the stop options, which is what I'd got working in my earlier message.

<!-- gh-comment-id:2076973989 --> @richardgroves commented on GitHub (Apr 25, 2024): Further investigation has found the specific problem, but no clearer as to whether it is Ollama or modelfusion at fault. So with llama3 this works: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}' ` and so does this: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["<|eot_id|>"]}}' ` **but this doesn't:** `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'` But with llama 2 all these work: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}'` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}' ` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["</s>"]}}'` So the difference is that using Ollama with Llama 2 and specifying a `stop` option of `[]` works, but on Llama 3 it doesn't. Modelfusion 'chat' paths make it less easy to set the stop options, and they send an empty [], whereas the `completion` models do allow setting of the stop options, which is what I'd got working in my earlier message.
Author
Owner

@reneleonhardt commented on GitHub (Apr 28, 2024):

@richardgroves thank you for analyzing this problem 👍
Would it be possible / feasible to fix it inside of ollama instead of requiring every user/application to specify different stop tokens for both models?

<!-- gh-comment-id:2081464948 --> @reneleonhardt commented on GitHub (Apr 28, 2024): @richardgroves thank you for analyzing this problem 👍 Would it be possible / feasible to fix it inside of ollama instead of requiring every user/application to specify different stop tokens for both models?
Author
Owner

@nkeilar commented on GitHub (Apr 28, 2024):

Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

<!-- gh-comment-id:2081483096 --> @nkeilar commented on GitHub (Apr 28, 2024): Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.
Author
Owner

@jukofyork commented on GitHub (Apr 28, 2024):

Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem):

https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/

<!-- gh-comment-id:2081484232 --> @jukofyork commented on GitHub (Apr 28, 2024): > Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version. There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem): https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/
Author
Owner

@eracle commented on GitHub (Apr 28, 2024):

I fixed with this, but I am not really sure what I am doing since I don't know how ollama internals work.

llm = Ollama(model="llama3", stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token", "assistant"])
<!-- gh-comment-id:2081628654 --> @eracle commented on GitHub (Apr 28, 2024): I fixed with this, but I am not really sure what I am doing since I don't know how ollama internals work. ```python llm = Ollama(model="llama3", stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token", "assistant"]) ```
Author
Owner

@nkeilar commented on GitHub (Apr 29, 2024):

Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem):

https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/

I just learned about the strange behaviour when exceeding context length. The change proposed in this thread (https://github.com/ollama/ollama/issues/3819) improved things, but still when running crewai ollama seems to get stuck in a loop generating something but never returning. So I have to force kill the process.

<!-- gh-comment-id:2081727387 --> @nkeilar commented on GitHub (Apr 29, 2024): > > Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version. > > There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem): > > https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/ I just learned about the strange behaviour when exceeding context length. The change proposed in this thread (https://github.com/ollama/ollama/issues/3819) improved things, but still when running crewai ollama seems to get stuck in a loop generating something but never returning. So I have to force kill the process.
Author
Owner

@richardgroves commented on GitHub (Apr 29, 2024):

There is the same problem with Phi-3 model on Ollama:

ollama pull phi3

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

Note it will stop eventually when an empty content response is received. But a few marker tokens have been sent too.

Whereas:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}]}'

Stops without the extra tokens.

Hard to say it is a bug in Ollama, as "options":{"stop":[]} is basically requesting it to not stop until an empty response is sent, but it appears that for older models (eg. mistral / llama2) it has worked to mean 'use the model file stop parameters'

<!-- gh-comment-id:2082226680 --> @richardgroves commented on GitHub (Apr 29, 2024): There is the same problem with Phi-3 model on Ollama: `ollama pull phi3` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'` Note it will stop eventually when an empty content response is received. But a few marker tokens have been sent too. Whereas: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}]}' ` Stops without the extra tokens. Hard to say it is a bug in Ollama, as `"options":{"stop":[]}` is basically requesting it to not stop until an empty response is sent, but it appears that for older models (eg. mistral / llama2) it has worked to mean 'use the model file stop parameters'
Author
Owner

@danielgen commented on GitHub (Apr 29, 2024):

For me Llama3 works as expected in Ollama CLI.
However it does not work in CrewAi, not even specifying the same modelfile.
Not sure if Ollama is at fault here, might well be a langchain issue or something else.

Below is the modelfile:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:latest

FROM /Users/[omitted]
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
<!-- gh-comment-id:2082416065 --> @danielgen commented on GitHub (Apr 29, 2024): For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else. Below is the modelfile: ``` # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:latest FROM /Users/[omitted] TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ```
Author
Owner

@olinorwell commented on GitHub (Apr 29, 2024):

For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else.

Agreed. In my case Llama3 was perfect when using the Ollama CLI. The issues were when other programs connected to Ollama via the OpenAI compatible interface.

<!-- gh-comment-id:2082423104 --> @olinorwell commented on GitHub (Apr 29, 2024): > For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else. Agreed. In my case Llama3 was perfect when using the Ollama CLI. The issues were when other programs connected to Ollama via the OpenAI compatible interface.
Author
Owner

@phalexo commented on GitHub (Apr 29, 2024):

I have llama3-70b-instruct Q5 working with gptpilot. Appears quite stable.

On Mon, Apr 29, 2024, 6:58 AM Oli Norwell @.***> wrote:

For me Llama3 works as expected in Ollama CLI. However it does not work in
CrewAi, not even specifying the same modelfile. Not sure if Ollama is at
fault here, might well be a langchain issue or something else.

Agreed. In my case Llama3 was perfect when using the Ollama CLI. The
issues were when other programs connected to Ollama via the OpenAI
compatible interface.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/3759#issuecomment-2082423104,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLFDICU3J5MHZQD3O3Y7YRURAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSGQZDGMJQGQ
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:2082453480 --> @phalexo commented on GitHub (Apr 29, 2024): I have llama3-70b-instruct Q5 working with gptpilot. Appears quite stable. On Mon, Apr 29, 2024, 6:58 AM Oli Norwell ***@***.***> wrote: > For me Llama3 works as expected in Ollama CLI. However it does not work in > CrewAi, not even specifying the same modelfile. Not sure if Ollama is at > fault here, might well be a langchain issue or something else. > > Agreed. In my case Llama3 was perfect when using the Ollama CLI. The > issues were when other programs connected to Ollama via the OpenAI > compatible interface. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/3759#issuecomment-2082423104>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLFDICU3J5MHZQD3O3Y7YRURAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSGQZDGMJQGQ> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@richardgroves commented on GitHub (Apr 29, 2024):

@danielgen @olinorwell Are you able to trace the request sent to the Ollama server from those external tools to see if it is the same "options":{"stop":[]} problem I've written about above, or some other issue?

<!-- gh-comment-id:2082585126 --> @richardgroves commented on GitHub (Apr 29, 2024): @danielgen @olinorwell Are you able to trace the request sent to the Ollama server from those external tools to see if it is the same `"options":{"stop":[]}` problem I've written about above, or some other issue?
Author
Owner

@boristopalov commented on GitHub (Apr 30, 2024):

For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else.

Below is the modelfile:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:latest

FROM /Users/[omitted]
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

It seems like an ollama issue, I have a program that hits the ollama API directly- doesn't use Langchain or any other wrappers and I was having this issue. Adding PARAMETER stop "<|eot_id|>" fixed it for me but I now see the <|eot_id|> character at the end of each response which is annoying

<!-- gh-comment-id:2087396153 --> @boristopalov commented on GitHub (Apr 30, 2024): > For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else. > > Below is the modelfile: > > ``` > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this one, replace the FROM line with: > # FROM llama3:latest > > FROM /Users/[omitted] > TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> > > {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> > > {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> > > {{ .Response }}<|eot_id|>""" > PARAMETER num_keep 24 > PARAMETER stop "<|start_header_id|>" > PARAMETER stop "<|end_header_id|>" > PARAMETER stop "<|eot_id|>" > ``` It seems like an ollama issue, I have a program that hits the ollama API directly- doesn't use Langchain or any other wrappers and I was having this issue. Adding `PARAMETER stop "<|eot_id|>"` fixed it for me but I now see the `<|eot_id|>` character at the end of each response which is annoying
Author
Owner

@sabaimran commented on GitHub (May 1, 2024):

I'm also experiencing this issue while routing to Ollama via the openai chat completions python library. It streamed 30000 characters and emitted <|eot_id|><|start_header_id|><|end_header_id|> multiple times.

Without knowing too much about the ollama internals, there may be an issue in the way the prompt template is being formatted in the requests? And like others have pointed out, stop words are not being honored.

<!-- gh-comment-id:2087977403 --> @sabaimran commented on GitHub (May 1, 2024): I'm also experiencing this issue while routing to Ollama via the `openai` chat completions python library. It streamed 30000 characters and emitted `<|eot_id|><|start_header_id|><|end_header_id|>` multiple times. Without knowing too much about the ollama internals, there may be an issue in the way the prompt template is being formatted in the requests? And like others have pointed out, stop words are not being honored.
Author
Owner

@97k commented on GitHub (May 1, 2024):

I am also experiencing the same issue. This doesn't happen with ollama cli but I am not able to use APIs.
This doesn't happen with mistral.

I am using Langchain.chat_models.ollama.ChatOllama

<!-- gh-comment-id:2088309588 --> @97k commented on GitHub (May 1, 2024): I am also experiencing the same issue. This doesn't happen with ollama cli but I am not able to use APIs. This doesn't happen with mistral. I am using Langchain.chat_models.ollama.ChatOllama
Author
Owner

@nickychung commented on GitHub (May 1, 2024):

Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue.

<!-- gh-comment-id:2088375572 --> @nickychung commented on GitHub (May 1, 2024): Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue.
Author
Owner

@vrijsinghani commented on GitHub (May 1, 2024):

@nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far.

<!-- gh-comment-id:2088896759 --> @vrijsinghani commented on GitHub (May 1, 2024): @nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far.
Author
Owner

@orangeswim commented on GitHub (May 1, 2024):

Okay so I had a similar issue today, this was the solution for me.

After changing the token to the correct eos token, the model runs as expected.

<!-- gh-comment-id:2088918825 --> @orangeswim commented on GitHub (May 1, 2024): Okay so I had a similar issue today, this was the solution for me. - First `pip install gguf`. - See https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-dump.py - Copy and run that python script to evaluate and see the metadata for your gguf model - Next See https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-set-metadata.py - Run the script to change the eos token. The eos token id for llama3 is 128009. For some reason, In the quantified models, you can see it has a different token. After changing the token to the correct eos token, the model runs as expected.
Author
Owner

@nickychung commented on GitHub (May 2, 2024):

@nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far.

Model: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
ModelFile:

FROM D:/my_models/Meta-Llama-3-8B-Instruct.Q4_K_S.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
<!-- gh-comment-id:2089390083 --> @nickychung commented on GitHub (May 2, 2024): > @nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far. Model: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF ModelFile: ``` FROM D:/my_models/Meta-Llama-3-8B-Instruct.Q4_K_S.gguf TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ```
Author
Owner

@Sneakr commented on GitHub (May 4, 2024):

Loading the original llama3 instruct works fine, but creating a new model from a fine tuned llama3 has the continous non-stop generating issue still.

<!-- gh-comment-id:2094244219 --> @Sneakr commented on GitHub (May 4, 2024): Loading the original llama3 instruct works fine, but creating a new model from a fine tuned llama3 has the continous non-stop generating issue still.
Author
Owner

@phalexo commented on GitHub (May 4, 2024):

Only some quantized models have the issue. I am using a Q5_M model and it
terminates. I think some quantized models are not generating the stop token.

On Sat, May 4, 2024, 11:03 AM Sneakr @.***> wrote:

Loading the original llama3 instruct works fine, but creating a new model
from a fine tuned llama3 has the continous non-stop generating issue still.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/3759#issuecomment-2094244219,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLRIOJXPPQ6MFPOYRLZAT2DJAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGI2DIMRRHE
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:2094248521 --> @phalexo commented on GitHub (May 4, 2024): Only some quantized models have the issue. I am using a Q5_M model and it terminates. I think some quantized models are not generating the stop token. On Sat, May 4, 2024, 11:03 AM Sneakr ***@***.***> wrote: > Loading the original llama3 instruct works fine, but creating a new model > from a fine tuned llama3 has the continous non-stop generating issue still. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/3759#issuecomment-2094244219>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLRIOJXPPQ6MFPOYRLZAT2DJAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGI2DIMRRHE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@Sneakr commented on GitHub (May 4, 2024):

I fine tuned my own model, works fine in inference mode after the tune, and converting to GGUF works fine with LM Studio, but when loaded into Ollama , it has the issue non-stop generating even with the prompt template defined for the llama3.

<!-- gh-comment-id:2094250149 --> @Sneakr commented on GitHub (May 4, 2024): I fine tuned my own model, works fine in inference mode after the tune, and converting to GGUF works fine with LM Studio, but when loaded into Ollama , it has the issue non-stop generating even with the prompt template defined for the llama3.
Author
Owner

@VideoFX commented on GitHub (May 4, 2024):

Ive had the same issue.

Ive tried llama3 and llama3-gradient. Ive updated ollama, and the models.

Ive tried crewai, langchain, and openwebUI, they all behave similarly. Ive updated those to newest versions as well.

The model will run for what seems like forever, and eventually repeats itself in a loop, or talks gibberish. It has to be stopped manually.

<!-- gh-comment-id:2094505086 --> @VideoFX commented on GitHub (May 4, 2024): Ive had the same issue. Ive tried llama3 and llama3-gradient. Ive updated ollama, and the models. Ive tried crewai, langchain, and openwebUI, they all behave similarly. Ive updated those to newest versions as well. The model will run for what seems like forever, and eventually repeats itself in a loop, or talks gibberish. It has to be stopped manually.
Author
Owner

@97k commented on GitHub (May 7, 2024):

Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue.

Thank you! This worked

For any one else going through this, I will break down step by step!

  1. Go through QuantFactory on HF, choose the quantised model you want

  2. Download the gguf model, I personally prefer q5_K_M

  3. Once downloaded, create a Modelfile (again thanks to @nickychung )

      FROM <downloaded gguf file>
    
      TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
    
      {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
    
      {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
    
      {{ .Response }}<|eot_id|>"""
      PARAMETER num_keep 24
      PARAMETER stop "<|start_header_id|>"
      PARAMETER stop "<|end_header_id|>"
      PARAMETER stop "<|eot_id|>"
    
  4. Create the model using ollama create

     ollama create llama3:<your tag> -f <path to Modelfile>
    
  5. ollama ls this will show you the model

  6. Celebrate!
    image
    and it respects the EOS!
    image

<!-- gh-comment-id:2097836764 --> @97k commented on GitHub (May 7, 2024): > Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue. Thank you! This worked For any one else going through this, I will break down step by step! 1. Go through [QuantFactory on HF](https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/main), choose the quantised model you want 2. Download the gguf model, I personally prefer q5_K_M 3. Once downloaded, create a Modelfile (again thanks to @nickychung ) ```bash FROM <downloaded gguf file> TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ``` 4. Create the model using ollama create ```bash ollama create llama3:<your tag> -f <path to Modelfile> ``` 5. `ollama ls` this will show you the model 6. Celebrate! ![image](https://github.com/ollama/ollama/assets/21143936/703081f2-8c95-4476-91fe-f6838451812c) and it *respects* the EOS! ![image](https://github.com/ollama/ollama/assets/21143936/76fd0293-62c4-4492-add5-83c9329a871a)
Author
Owner

@richardgroves commented on GitHub (May 10, 2024):

The latest Ollama release appears to have fixed this problem for Llama3 and Phi3.

My ollama --version is now ollama version is 0.1.34

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

Now stops properly, as does the same test with model: phi3.

So many Pull Requests merged recently that it is hard to find the exact change that fixed it.

<!-- gh-comment-id:2104445225 --> @richardgroves commented on GitHub (May 10, 2024): The latest Ollama release appears to have fixed this problem for Llama3 and Phi3. My `ollama --version` is now `ollama version is 0.1.34` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'` Now stops properly, as does the same test with model: phi3. So many Pull Requests merged recently that it is hard to find the exact change that fixed it.
Author
Owner

@reneleonhardt commented on GitHub (May 11, 2024):

Now stops properly, as does the same test with model: phi3.

So many Pull Requests merged recently that it is hard to find the exact change that fixed it.

I'm glad this has been finally fixed!
Yeah, too many merges causing problems mixed with some trying to fix them later if you're lucky 😅
I wonder why the test suite doesn't catch tokens inside instruction model responses for different prompt templates and endpoints...

<!-- gh-comment-id:2105548489 --> @reneleonhardt commented on GitHub (May 11, 2024): > Now stops properly, as does the same test with model: phi3. > > So many Pull Requests merged recently that it is hard to find the exact change that fixed it. I'm glad this has been finally fixed! Yeah, too many merges causing problems mixed with some trying to fix them later if you're lucky 😅 I wonder why the test suite doesn't catch tokens inside instruction model responses for different prompt templates and endpoints...
Author
Owner

@joshuavial commented on GitHub (May 13, 2024):

I had the same problem, but the issue was an out of date ollama client - upgrading sorted things out.

<!-- gh-comment-id:2106418722 --> @joshuavial commented on GitHub (May 13, 2024): I had the same problem, but the issue was an out of date ollama client - upgrading sorted things out.
Author
Owner

@eevmanu commented on GitHub (May 13, 2024):

for any onlooker, if you're:

don't forget to restart the service (sudo systemctl restart ollama.service), ymmv but in my case started throwing memory errors, despite having restart instructions 9c76b30d72/docs/linux.md (L51-L52)

<!-- gh-comment-id:2107617955 --> @eevmanu commented on GitHub (May 13, 2024): for any onlooker, if you're: - using `linux`, - added the [startup service](https://github.com/ollama/ollama/blob/main/docs/linux.md#adding-ollama-as-a-startup-service-recommended) and - `update` ollama recently as described here https://github.com/ollama/ollama/issues/3759#issuecomment-2104445225, don't forget to restart the service (`sudo systemctl restart ollama.service`), [ymmv](https://www.urbandictionary.com/define.php?term=ymmv) but in my case started throwing memory errors, despite having `restart` instructions https://github.com/ollama/ollama/blob/9c76b30d72b76f0ce1fe7f357651ea9985c2cb24/docs/linux.md#L51-L52
Author
Owner

@adryan-ai commented on GitHub (May 15, 2024):

upgrading ollama resolved this for me, from 0.1.32 to 0.1.37

<!-- gh-comment-id:2111983132 --> @adryan-ai commented on GitHub (May 15, 2024): upgrading ollama resolved this for me, from 0.1.32 to 0.1.37
Author
Owner

@jmorganca commented on GitHub (Jun 25, 2024):

Great! Sorry for the issues and glad to see the newest versions fixed this for folks. Let me know if that's not the case and I can re-open

<!-- gh-comment-id:2187954716 --> @jmorganca commented on GitHub (Jun 25, 2024): Great! Sorry for the issues and glad to see the newest versions fixed this for folks. Let me know if that's not the case and I can re-open
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64356