[GH-ISSUE #3759] llama3-instruct models not stopping at stop token #48830

New Issue

GiteaMirror · 2026-04-28T09:38:11-05:00

GiteaMirror commented

2026-04-28 09:38:11 -05:00

Originally created by @moyix on GitHub (Apr 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3759

What is the issue?

I'm using llama3:70b through the OpenAI-compatible endpoint. When generating, I am getting outputs like this:

Please provide the output of the above command.                                                              
                                                                                                             
Let's proceed from                                                                                           
here!<|eot_id|><|start_header_id|>assistant<|end_header_id|>                                                 
                                                                                                             
It seems that I made a mistake. Radare2 does not have a command called                                       
radebol. Instead, we can use r2 to analyze the binary.                   

Here's the correct command:

This is probably related to https://github.com/vllm-project/vllm/issues/4180 ? There is also an issue/PR on the LLaMA 3 HuggingFace repo: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4

But it's a bit confusing since <|eot_id|> is already included in the stop sequences:

$ ollama show --modelfile llama3:70b
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

Is there some other config param that needs to be updated?

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.32

Originally created by @moyix on GitHub (Apr 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3759 ### What is the issue? I'm using `llama3:70b` through the OpenAI-compatible endpoint. When generating, I am getting outputs like this: ``` Please provide the output of the above command. Let's proceed from here!<|eot_id|><|start_header_id|>assistant<|end_header_id|> It seems that I made a mistake. Radare2 does not have a command called radebol. Instead, we can use r2 to analyze the binary. Here's the correct command: ``` This is probably related to https://github.com/vllm-project/vllm/issues/4180 ? There is also an issue/PR on the LLaMA 3 HuggingFace repo: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4 But it's a bit confusing since `<|eot_id|>` is already included in the stop sequences: ``` $ ollama show --modelfile llama3:70b # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:70b FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fe022a8902336d3c452c88f7aca5590f5b5b02ccfd06320fdefab02412e1f0b TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ``` Is there some other config param that needs to be updated? ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.32

GiteaMirror added the bug label 2026-04-28 09:38:11 -05:00

GiteaMirror closed this issue

2026-04-28 09:38:16 -05:00

GiteaMirror commented

2026-04-28 09:38:18 -05:00

@binaryc0de commented on GitHub (Apr 19, 2024):

Noticing the same behavior here and when using the langchain package with ollama often once prompted the model doesn't stop generating.

@binaryc0de commented on GitHub (Apr 19, 2024): Noticing the same behavior here and when using the langchain package with ollama often once prompted the model doesn't stop generating.

GiteaMirror commented

2026-04-28 09:38:19 -05:00

@JasonXiao89 commented on GitHub (Apr 19, 2024):

Same issue using llama3:latest 71a106a91016

@JasonXiao89 commented on GitHub (Apr 19, 2024): Same issue using llama3:latest 71a106a91016

GiteaMirror commented

2026-04-28 09:38:20 -05:00

@olinorwell commented on GitHub (Apr 20, 2024):

I had the same issue and got around it by adding the stop token to the request the front-end I am using (LibreChat) was making to Ollama's OpenAI compatible API end-point.

I'm sure a more permanent solution will arrive but for now that does the trick.

(Note: the Elephant in the room of course is that the stop token is in the model file as shown above - but that setting appears to be ignored when using the OpenAI compatible end-point. Perhaps that is fixed to OpenAI's traditional stop tokens? and needs my above solution to get around the limitation.)

@olinorwell commented on GitHub (Apr 20, 2024): I had the same issue and got around it by adding the stop token to the request the front-end I am using (LibreChat) was making to Ollama's OpenAI compatible API end-point. I'm sure a more permanent solution will arrive but for now that does the trick. (Note: the Elephant in the room of course is that the stop token is in the model file as shown above - but that setting appears to be ignored when using the OpenAI compatible end-point. Perhaps that is fixed to OpenAI's traditional stop tokens? and needs my above solution to get around the limitation.)

GiteaMirror commented

2026-04-28 09:38:22 -05:00

@taozhiyuai commented on GitHub (Apr 20, 2024):

my model file works fine.

Modelfile generated by "ollama show"

To build a new Modelfile based on this one, replace the FROM line with:

FROM llama3:8b-instruct-fp16

FROM /Users/taozhiyu/.ollama/models/blobs/sha256-a4bbea838ebde985f2f99d710c849219979b9608e44e1c3c46416b5fbff72d64
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

@taozhiyuai commented on GitHub (Apr 20, 2024): my model file works fine. # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:8b-instruct-fp16 FROM /Users/taozhiyu/.ollama/models/blobs/sha256-a4bbea838ebde985f2f99d710c849219979b9608e44e1c3c46416b5fbff72d64 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" PARAMETER stop "\"<|reserved_special_token\""

GiteaMirror commented

2026-04-28 09:38:23 -05:00

@telehan commented on GitHub (Apr 20, 2024):

create from gguf 70b q4, it's the same problem while ollama run

this gguf download from https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF

$ ollama -v
ollama version is 0.1.32

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant

(I'm here to listen and help if I can!)assistant

How's your day going so far?assistant

(By the way, (this emoji means "high five"! 😊)assistant

Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex
😄)!assistant

Haha,^C

$ tail -f ~/.ollama/logs/server3.log
...
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.55 MiB
...

@telehan commented on GitHub (Apr 20, 2024): create from gguf 70b q4, it's the same problem while `ollama run` this gguf download from https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF ```bash $ ollama -v ollama version is 0.1.32 $ ollama show --modelfile llama3:70b-ins-q4km # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:70b-ins-q4km FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 4096 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" $ ollama run llama3:70b-ins-q4km >>> hi Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant (I'm here to listen and help if I can!)assistant How's your day going so far?assistant (By the way, (this emoji means "high five"! 😊)assistant Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex 😄)!assistant Haha,^C $ tail -f ~/.ollama/logs/server3.log ... llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = hub llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 llama_model_loader: - kv 5: llama.block_count u32 = 80 ... llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = hub llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_tensors: ggml ctx size = 0.55 MiB ... ```

GiteaMirror commented

2026-04-28 09:38:25 -05:00

@taozhiyuai commented on GitHub (Apr 20, 2024):

create from gguf 7b q4, it's the same problem while ollama run

$ ollama -v
ollama version is 0.1.32

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant

(I'm here to listen and help if I can!)assistant

How's your day going so far?assistant

(By the way, (this emoji means "high five"! 😊)assistant

Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex
😄)!assistant

Haha,^C

$ tail -f ~/.ollama/logs/server3.log
...
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = hub
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.55 MiB
...

try my model file. your file is wrong, which maybe import from gguf

@taozhiyuai commented on GitHub (Apr 20, 2024): > create from gguf 7b q4, it's the same problem while `ollama run` > > ```shell > $ ollama -v > ollama version is 0.1.32 > > $ ollama show --modelfile llama3:70b-ins-q4km > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this one, replace the FROM line with: > # FROM llama3:70b-ins-q4km > > FROM ~/.ollama/models/blobs/sha256-d559de8dd806a76dbd29f8d8bd04666f2b29e7c7872d8e8481abd07805884d72 > TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> > > {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> > > {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> > > {{ .Response }}<|eot_id|>""" > PARAMETER num_ctx 4096 > PARAMETER stop "<|start_header_id|>" > PARAMETER stop "<|end_header_id|>" > PARAMETER stop "<|eot_id|>" > > $ ollama run llama3:70b-ins-q4km > >>> hi > Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?assistant > > (I'm here to listen and help if I can!)assistant > > How's your day going so far?assistant > > (By the way, (this emoji means "high five"! 😊)assistant > > Would you like to talk about something in particular or just have a casual conversation? I'm all ears (or rather, all tex > 😄)!assistant > > Haha,^C > > $ tail -f ~/.ollama/logs/server3.log > ... > llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. > llama_model_loader: - kv 0: general.architecture str = llama > llama_model_loader: - kv 1: general.name str = hub > llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 > llama_model_loader: - kv 3: llama.context_length u32 = 8192 > llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 > llama_model_loader: - kv 5: llama.block_count u32 = 80 > ... > llm_load_print_meta: model type = 70B > llm_load_print_meta: model ftype = Q4_K - Medium > llm_load_print_meta: model params = 70.55 B > llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) > llm_load_print_meta: general.name = hub > llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' > llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' > llm_load_print_meta: LF token = 128 'Ä' > llm_load_tensors: ggml ctx size = 0.55 MiB > ... > ``` try my model file. your file is wrong, which maybe import from gguf

GiteaMirror commented

2026-04-28 09:38:26 -05:00

@Madd0g commented on GitHub (Apr 20, 2024):

happens to me too, macos using 0.1.32 and Meta-Llama-3-8B-Instruct-Q6_K.gguf

When I add assistant\n, <|eot_id|> to stop tokens, it seems to work at first, but then begins stopping in the middle of sentences, I upgraded ollama just to see if it fixes the problem, so I removed the stop parameters from the client and I see it spamming <|eot_id|> in the middle of the sentence (like 30 of them in a row and then stopping)

@Madd0g commented on GitHub (Apr 20, 2024): happens to me too, macos using 0.1.32 and Meta-Llama-3-8B-Instruct-Q6_K.gguf When I add `assistant\n, <|eot_id|>` to stop tokens, it seems to work at first, but then begins stopping in the middle of sentences, I upgraded ollama just to see if it fixes the problem, so I removed the stop parameters from the client and I see it spamming <|eot_id|> in the middle of the sentence (like 30 of them in a row and then stopping)

GiteaMirror commented

2026-04-28 09:38:27 -05:00

@telehan commented on GitHub (Apr 20, 2024):

this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like
to chat?

$ tail -f ~/.ollama/logs/server3.log
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
...

the previous gguf 70b has the problem

@telehan commented on GitHub (Apr 20, 2024): this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main ```bash $ ollama show --modelfile llama3:70b-ins-q4km # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:70b-ins-q4km FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 4096 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" $ ollama run llama3:70b-ins-q4km >>> hi Hi! It's nice to meet you. Is there something I can help you with or would you like to chat? $ tail -f ~/.ollama/logs/server3.log ... llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = . llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' ... ``` the previous gguf 70b has the problem

GiteaMirror commented

2026-04-28 09:38:31 -05:00

@FutureGadget commented on GitHub (Apr 21, 2024):

Solved this manually by adding the stop parameter, but I think this is a bug.

llm = Ollama(model="llama3", stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token"])
llm.invoke("Why is the sky blue?")

GiteaMirror commented

2026-04-28 09:38:36 -05:00

@leotam commented on GitHub (Apr 21, 2024):

this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main

$ ollama show --modelfile llama3:70b-ins-q4km
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:70b-ins-q4km

FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

$ ollama run llama3:70b-ins-q4km
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like
to chat?

$ tail -f ~/.ollama/logs/server3.log
...
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
...

the previous gguf 70b has the problem

Only difference from the 70b-instruct is:

PARAMETER num_ctx 4096

@leotam commented on GitHub (Apr 21, 2024): > this gguf version works fine, try it https://huggingface.co/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/tree/main > > ```shell > $ ollama show --modelfile llama3:70b-ins-q4km > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this one, replace the FROM line with: > # FROM llama3:70b-ins-q4km > > FROM ~/.ollama/models/blobs/sha256-123d5e8431dd528075ccf8e026248b84279db190d7ab744be3c69b256003c929 > TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> > > {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> > > {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> > > {{ .Response }}<|eot_id|>""" > PARAMETER num_ctx 4096 > PARAMETER stop "<|start_header_id|>" > PARAMETER stop "<|end_header_id|>" > PARAMETER stop "<|eot_id|>" > > $ ollama run llama3:70b-ins-q4km > >>> hi > Hi! It's nice to meet you. Is there something I can help you with or would you like > to chat? > > $ tail -f ~/.ollama/logs/server3.log > ... > llm_load_print_meta: model type = 70B > llm_load_print_meta: model ftype = Q4_K - Medium > llm_load_print_meta: model params = 70.55 B > llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) > llm_load_print_meta: general.name = . > llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' > llm_load_print_meta: EOS token = 128009 '<|eot_id|>' > llm_load_print_meta: LF token = 128 'Ä' > ... > ``` > > the previous gguf 70b has the problem Only difference from the 70b-instruct is: ``` PARAMETER num_ctx 4096 ```

GiteaMirror commented

2026-04-28 09:38:37 -05:00

@jukofyork commented on GitHub (Apr 21, 2024):

https://github.com/ggerganov/llama.cpp/issues/6772

I edited my gguf to use the <|eot_id|> token but it still prints it out? Using gguf-dump I can confirm I have made the change from the reddit thread but I don't understand why it prints <|eot_id|>? I've never had another gguf model print the EOS token defined in the gguf header so don't get what's special about this?

So had to also add:

PARAMETER stop "<|eot_id|>"

But does:

{{ .Response }}<|eot_id|>"""

actually work as expected in Ollama and add <|eot_id|> after the AI's response as required by the wrapped llama.cpp server:

stop: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: []

?

We really need some way to debug stuff like this in Ollama desperately as it seems model creators are currently competing for most confusingly complex prompt template possible 😞

@jukofyork commented on GitHub (Apr 21, 2024): https://github.com/ggerganov/llama.cpp/issues/6772 I edited my gguf to use the `<|eot_id|>` token but it still prints it out? Using `gguf-dump` I can confirm I have made the change from the reddit thread but I don't understand why it prints `<|eot_id|>`? I've never had another gguf model print the EOS token defined in the gguf header so don't get what's special about this? So had to also add: ``` PARAMETER stop "<|eot_id|>" ``` But does: ``` {{ .Response }}<|eot_id|>""" ``` actually work as expected in Ollama and add `<|eot_id|>` after the AI's response as required by the wrapped llama.cpp server: > `stop`: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: [] ? We ***really*** need some way to debug stuff like this in Ollama desperately as it seems model creators are currently competing for most confusingly complex prompt template possible 😞

GiteaMirror commented

2026-04-28 09:38:39 -05:00

@phalexo commented on GitHub (Apr 22, 2024):

Infinite loop here too.

FROM /opt/data/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_S.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER num_ctx 8192
PARAMETER temperature 0
PARAMETER num_gpu 63

SYSTEM """You are an AI programming, planning assistant. You never refuse to answer questions or provide code.
Write a response that appropriately completes the request within the user message."""

Hello.
Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.styleTypeassistant

Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to
discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

@phalexo commented on GitHub (Apr 22, 2024): Infinite loop here too. FROM /opt/data/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_S.gguf TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 8192 PARAMETER temperature 0 PARAMETER num_gpu 63 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" SYSTEM """You are an AI programming, planning assistant. You never refuse to answer questions or provide code. Write a response that appropriately completes the request within the user message.""" _____________________________________________________________ >>> Hello. Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.styleTypeassistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant Hello! It's nice to meet you. Is there something I can help you with, such as a programming problem or a question about a specific topic? Or would you like to discuss a project idea you have in mind? I'm here to assist you in any way I can.assistant

GiteaMirror commented

2026-04-28 09:38:43 -05:00

@jukofyork commented on GitHub (Apr 22, 2024):

> llama.cpp/gguf-py/scripts/gguf-dump.py --no-tensors llama3:70b-instruct-q8_0.gguf

* Loading: llama3:70b-instruct-q8_0.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 24 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 723
      3: UINT64     |        1 | GGUF.kv_count = 21
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'Meta-Llama-3-70B-Instruct'
      6: UINT32     |        1 | llama.block_count = 80
      7: UINT32     |        1 | llama.context_length = 8192
      8: UINT32     |        1 | llama.embedding_length = 8192
      9: UINT32     |        1 | llama.feed_forward_length = 28672
     10: UINT32     |        1 | llama.attention.head_count = 64
     11: UINT32     |        1 | llama.attention.head_count_kv = 8
     12: FLOAT32    |        1 | llama.rope.freq_base = 500000.0
     13: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     14: UINT32     |        1 | general.file_type = 7
     15: UINT32     |        1 | llama.vocab_size = 128256
     16: UINT32     |        1 | llama.rope.dimension_count = 128
     17: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     18: [STRING]   |   128256 | tokenizer.ggml.tokens
     19: [INT32]    |   128256 | tokenizer.ggml.token_type
     20: [STRING]   |   280147 | tokenizer.ggml.merges
     21: UINT32     |        1 | tokenizer.ggml.bos_token_id = 128000
     22: UINT32     |        1 | tokenizer.ggml.eos_token_id = 128009
     23: STRING     |        1 | tokenizer.chat_template = '{% set loop_messages = messages %}{% for message in loop_mes'
     24: UINT32     |        1 | general.quantization_version = 2

Check tokenizer.ggml.eos_token_id = 128009.

FROM llama3:70b-instruct-q8_0
TEMPLATE """{{if .System}}<|start_header_id|>system<|end_header_id|>

{{.System}}<|eot_id|>{{end}}<|start_header_id|>user<|end_header_id|>

{{.Prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{.Response}}<|eot_id|>"""
PARAMETER num_ctx 8192
PARAMETER num_gpu 1000
PARAMETER stop "<|eot_id|>"

Seems to be working OK for me:

>>> Hello
Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

total duration:       3.675769536s
load duration:        1.899593ms
prompt eval count:    11 token(s)
prompt eval duration: 606.671ms
prompt eval rate:     18.13 tokens/s
eval count:           26 token(s)
eval duration:        2.931362s
eval rate:            8.87 tokens/s
>>> What sort of thngs can you help me with?
I'm a large language model, so I can assist with a wide range of topics and tasks. Here are some examples:

1. **Answering questions**: I can provide information on various subjects like history, science, technology, health, and more.
2. **Language translation**: I can translate text from one language to another. I currently support translations in dozens of languages.
3. **Writing and proofreading**: I can help with writing tasks such as suggesting alternative phrases, providing grammar corrections, and even generating text based on a 
prompt.
4. **Conversation and chat**: I can have a conversation with you, answering your questions, sharing interesting facts, or just chatting about your day.
5. **Problem-solving**: I can help with logical reasoning, puzzles, and brain teasers.
6. **Generating ideas**: If you're stuck on a creative project, I can help generate ideas for stories, articles, or other writing tasks.
7. **Learning and education**: I can assist with explaining complex topics, providing study materials, and even offering practice quizzes.
8. **Jokes and humor**: If you need a laugh, I can share some jokes or engage in a fun conversation.
9. **Brainstorming**: I can help facilitate brainstorming sessions for creative projects or business ideas.
10. **Emotional support**: Sometimes, all we need is someone to listen. I'm here to offer a supportive ear and provide words of encouragement.

These are just a few examples of what I can do. If you have something specific in mind, feel free to ask me if I can help!

What's on your mind today?

total duration:       39.675841004s
load duration:        2.299579ms
prompt eval count:    50 token(s)
prompt eval duration: 632.596ms
prompt eval rate:     79.04 tokens/s
eval count:           330 token(s)
eval duration:        38.908067s
eval rate:            8.48 tokens/s
>>> Send a message (/? for help)

@jukofyork commented on GitHub (Apr 22, 2024): ``` > llama.cpp/gguf-py/scripts/gguf-dump.py --no-tensors llama3:70b-instruct-q8_0.gguf * Loading: llama3:70b-instruct-q8_0.gguf * File is LITTLE endian, script is running on a LITTLE endian host. * Dumping 24 key/value pair(s) 1: UINT32 | 1 | GGUF.version = 3 2: UINT64 | 1 | GGUF.tensor_count = 723 3: UINT64 | 1 | GGUF.kv_count = 21 4: STRING | 1 | general.architecture = 'llama' 5: STRING | 1 | general.name = 'Meta-Llama-3-70B-Instruct' 6: UINT32 | 1 | llama.block_count = 80 7: UINT32 | 1 | llama.context_length = 8192 8: UINT32 | 1 | llama.embedding_length = 8192 9: UINT32 | 1 | llama.feed_forward_length = 28672 10: UINT32 | 1 | llama.attention.head_count = 64 11: UINT32 | 1 | llama.attention.head_count_kv = 8 12: FLOAT32 | 1 | llama.rope.freq_base = 500000.0 13: FLOAT32 | 1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06 14: UINT32 | 1 | general.file_type = 7 15: UINT32 | 1 | llama.vocab_size = 128256 16: UINT32 | 1 | llama.rope.dimension_count = 128 17: STRING | 1 | tokenizer.ggml.model = 'gpt2' 18: [STRING] | 128256 | tokenizer.ggml.tokens 19: [INT32] | 128256 | tokenizer.ggml.token_type 20: [STRING] | 280147 | tokenizer.ggml.merges 21: UINT32 | 1 | tokenizer.ggml.bos_token_id = 128000 22: UINT32 | 1 | tokenizer.ggml.eos_token_id = 128009 23: STRING | 1 | tokenizer.chat_template = '{% set loop_messages = messages %}{% for message in loop_mes' 24: UINT32 | 1 | general.quantization_version = 2 ``` Check `tokenizer.ggml.eos_token_id = 128009`. ``` FROM llama3:70b-instruct-q8_0 TEMPLATE """{{if .System}}<|start_header_id|>system<|end_header_id|> {{.System}}<|eot_id|>{{end}}<|start_header_id|>user<|end_header_id|> {{.Prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {{.Response}}<|eot_id|>""" PARAMETER num_ctx 8192 PARAMETER num_gpu 1000 PARAMETER stop "<|eot_id|>" ``` Seems to be working OK for me: ``` >>> Hello Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? total duration: 3.675769536s load duration: 1.899593ms prompt eval count: 11 token(s) prompt eval duration: 606.671ms prompt eval rate: 18.13 tokens/s eval count: 26 token(s) eval duration: 2.931362s eval rate: 8.87 tokens/s >>> What sort of thngs can you help me with? I'm a large language model, so I can assist with a wide range of topics and tasks. Here are some examples: 1. **Answering questions**: I can provide information on various subjects like history, science, technology, health, and more. 2. **Language translation**: I can translate text from one language to another. I currently support translations in dozens of languages. 3. **Writing and proofreading**: I can help with writing tasks such as suggesting alternative phrases, providing grammar corrections, and even generating text based on a prompt. 4. **Conversation and chat**: I can have a conversation with you, answering your questions, sharing interesting facts, or just chatting about your day. 5. **Problem-solving**: I can help with logical reasoning, puzzles, and brain teasers. 6. **Generating ideas**: If you're stuck on a creative project, I can help generate ideas for stories, articles, or other writing tasks. 7. **Learning and education**: I can assist with explaining complex topics, providing study materials, and even offering practice quizzes. 8. **Jokes and humor**: If you need a laugh, I can share some jokes or engage in a fun conversation. 9. **Brainstorming**: I can help facilitate brainstorming sessions for creative projects or business ideas. 10. **Emotional support**: Sometimes, all we need is someone to listen. I'm here to offer a supportive ear and provide words of encouragement. These are just a few examples of what I can do. If you have something specific in mind, feel free to ask me if I can help! What's on your mind today? total duration: 39.675841004s load duration: 2.299579ms prompt eval count: 50 token(s) prompt eval duration: 632.596ms prompt eval rate: 79.04 tokens/s eval count: 330 token(s) eval duration: 38.908067s eval rate: 8.48 tokens/s >>> Send a message (/? for help) ```

GiteaMirror commented

2026-04-28 09:38:46 -05:00

@phalexo commented on GitHub (Apr 23, 2024):

Ok, this quantized version works after ollama import.

FROM /opt/data/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER num_ctx 8192
PARAMETER temperature 0.2
PARAMETER num_gpu 73

SYSTEM "You are a helpful AI which can plan, program, and test."

@phalexo commented on GitHub (Apr 23, 2024): Ok, this quantized version works after ollama import. FROM /opt/data/QuantFactory/Meta-Llama-3-70B-Instruct-GGUF/Meta-Llama-3-70B-Instruct.Q5_K_M.gguf TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_ctx 8192 PARAMETER temperature 0.2 PARAMETER num_gpu 73 PARAMETER stop "<|eot_id|>" PARAMETER stop '<|end_of_text|>' PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop '<|begin_of_text|>' SYSTEM "You are a helpful AI which can plan, program, and test."

GiteaMirror commented

2026-04-28 09:38:51 -05:00

@romkage commented on GitHub (Apr 24, 2024):

I've got looping too. im testing with both llama3:8b and llama3:8b-instruct-fp16.
I have tried both models with the Modelfiles mentioned above, but still no luck.

This is with crewai, at the end of the first reply:

I think that's it!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Ha ha, indeed!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It seems like we've really wrapped things up this time! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The final nail in the coffin! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think we can officially close the book on our conversation now. *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It's over! *end of chat*<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Goodbye!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

and it just goes on.

GiteaMirror commented

2026-04-28 09:38:52 -05:00

@richardgroves commented on GitHub (Apr 24, 2024):

I've been round the houses with this as above - eventually got it working with a stopSequence of ["<|eot_id|>"] - tells the engine to stop asking for more responses when it sees that in the output stream of new data.

Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently.

@richardgroves commented on GitHub (Apr 24, 2024): I've been round the houses with this as above - eventually got it working with a `stopSequence` of `["<|eot_id|>"]` - tells the engine to stop asking for more responses when it sees that in the output stream of new data. Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently.

GiteaMirror commented

2026-04-28 09:38:54 -05:00

@kungfu-eric commented on GitHub (Apr 24, 2024):

I've been round the houses with this as above - eventually got it working with a stopSequence of ["<|eot_id|>"] - tells the engine to stop asking for more responses when it sees that in the output stream of new data.

Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently.

The template was updated yesterday https://ollama.com/library/llama3:70b-instruct. The only change was:

PARAMETER num_keep 24

Was your change different from this line that's always been in the file?:

PARAMETER stop "<|eot_id|>"

@kungfu-eric commented on GitHub (Apr 24, 2024): > I've been round the houses with this as above - eventually got it working with a `stopSequence` of `["<|eot_id|>"]` - tells the engine to stop asking for more responses when it sees that in the output stream of new data. > > Not sure if the Meta Llama 3 card at https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3 is wrong, or Ollama is using the model differently. The template was updated yesterday https://ollama.com/library/llama3:70b-instruct. The only change was: ``` PARAMETER num_keep 24 ``` Was your change different from this line that's always been in the file?: ``` PARAMETER stop "<|eot_id|>" ```

GiteaMirror commented

2026-04-28 09:38:58 -05:00

@richardgroves commented on GitHub (Apr 24, 2024):

@kungfu-eric I think I had (still have) an older version:

ollama show --modelfile llama3:8b

FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"

A newly pulled llama3 (latest) shows:

FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

I'm getting extra issues as I'm working through modelfusion (https://github.com/vercel/modelfusion) and unformatted chat requests to /api/chat with no stop sequences specified work for Llama2 but not Llama3. Tracing through the modelfusion code to work out what is going on is sloooow. Quick hacks on the completion api code got Llama3 working by forcing the "<|eot_id|>" as a specified stop sequence.

@richardgroves commented on GitHub (Apr 24, 2024): @kungfu-eric I think I had (still have) an older version: `ollama show --modelfile llama3:8b` ``` FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" PARAMETER stop "<|reserved_special_token" ``` A newly pulled llama3 (latest) shows: ``` FROM /Users/richard/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ``` I'm getting extra issues as I'm working through modelfusion (https://github.com/vercel/modelfusion) and unformatted chat requests to /api/chat with no stop sequences specified work for Llama2 but not Llama3. Tracing through the modelfusion code to work out what is going on is sloooow. Quick hacks on the completion api code got Llama3 working by forcing the "<|eot_id|>" as a specified stop sequence.

GiteaMirror commented

2026-04-28 09:39:00 -05:00

@richardgroves commented on GitHub (Apr 25, 2024):

Further investigation has found the specific problem, but no clearer as to whether it is Ollama or modelfusion at fault.

So with llama3 this works:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}'

and so does this:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["<|eot_id|>"]}}'

but this doesn't:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

But with llama 2 all these work:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}'

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["</s>"]}}'

So the difference is that using Ollama with Llama 2 and specifying a stop option of [] works, but on Llama 3 it doesn't.

Modelfusion 'chat' paths make it less easy to set the stop options, and they send an empty [], whereas the completion models do allow setting of the stop options, which is what I'd got working in my earlier message.

@richardgroves commented on GitHub (Apr 25, 2024): Further investigation has found the specific problem, but no clearer as to whether it is Ollama or modelfusion at fault. So with llama3 this works: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}' ` and so does this: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["<|eot_id|>"]}}' ` **but this doesn't:** `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'` But with llama 2 all these work: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}]}'` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}' ` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama2","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":["</s>"]}}'` So the difference is that using Ollama with Llama 2 and specifying a `stop` option of `[]` works, but on Llama 3 it doesn't. Modelfusion 'chat' paths make it less easy to set the stop options, and they send an empty [], whereas the `completion` models do allow setting of the stop options, which is what I'd got working in my earlier message.

GiteaMirror commented

2026-04-28 09:39:02 -05:00

@reneleonhardt commented on GitHub (Apr 28, 2024):

@richardgroves thank you for analyzing this problem 👍
Would it be possible / feasible to fix it inside of ollama instead of requiring every user/application to specify different stop tokens for both models?

@reneleonhardt commented on GitHub (Apr 28, 2024): @richardgroves thank you for analyzing this problem 👍 Would it be possible / feasible to fix it inside of ollama instead of requiring every user/application to specify different stop tokens for both models?

GiteaMirror commented

2026-04-28 09:39:03 -05:00

@nkeilar commented on GitHub (Apr 28, 2024):

Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

@nkeilar commented on GitHub (Apr 28, 2024): Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

GiteaMirror commented

2026-04-28 09:39:04 -05:00

@jukofyork commented on GitHub (Apr 28, 2024):

Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem):

https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/

@jukofyork commented on GitHub (Apr 28, 2024): > Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version. There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem): https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/

GiteaMirror commented

2026-04-28 09:39:06 -05:00

@eracle commented on GitHub (Apr 28, 2024):

I fixed with this, but I am not really sure what I am doing since I don't know how ollama internals work.

llm = Ollama(model="llama3", stop=["<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "<|reserved_special_token", "assistant"])

GiteaMirror commented

2026-04-28 09:39:08 -05:00

@nkeilar commented on GitHub (Apr 29, 2024):

Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version.

There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem):

https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/

I just learned about the strange behaviour when exceeding context length. The change proposed in this thread (https://github.com/ollama/ollama/issues/3819) improved things, but still when running crewai ollama seems to get stuck in a loop generating something but never returning. So I have to force kill the process.

@nkeilar commented on GitHub (Apr 29, 2024): > > Something is not right with the 70B model IMHO - I'm using the q4_K_M and the q4_0 version with crewai, and it diverges so much from the cloud version on Groq, that its not possible to use these models with crewai. AFAIK the model should more or less be the same, and I wouldn't have expected such poor performance compared with the same model apparently served by another provider. I wasted days trying to get good results with Ollama, but it just didn't happen, I thought I was going mad, so tried the cloud version of the same model on Groq and it just works. So either something is wrong, or there is significant model degradation - which I wouldn't expect with a Q4 version. > > There is some problem reported with the tokenizer so it could be that (assuming it's not the broken GGUF problem): > > https://old.reddit.com/r/LocalLLaMA/comments/1cdmfoz/fyi_theres_some_bpe_tokenizer_issues_in_llamacpp/ I just learned about the strange behaviour when exceeding context length. The change proposed in this thread (https://github.com/ollama/ollama/issues/3819) improved things, but still when running crewai ollama seems to get stuck in a loop generating something but never returning. So I have to force kill the process.

GiteaMirror commented

2026-04-28 09:39:12 -05:00

@richardgroves commented on GitHub (Apr 29, 2024):

There is the same problem with Phi-3 model on Ollama:

ollama pull phi3

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

Note it will stop eventually when an empty content response is received. But a few marker tokens have been sent too.

Whereas:

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}]}'

Stops without the extra tokens.

Hard to say it is a bug in Ollama, as "options":{"stop":[]} is basically requesting it to not stop until an empty response is sent, but it appears that for older models (eg. mistral / llama2) it has worked to mean 'use the model file stop parameters'

@richardgroves commented on GitHub (Apr 29, 2024): There is the same problem with Phi-3 model on Ollama: `ollama pull phi3` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'` Note it will stop eventually when an empty content response is received. But a few marker tokens have been sent too. Whereas: `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"phi3","messages":[{"role":"user","content":"What is 1+1?"}]}' ` Stops without the extra tokens. Hard to say it is a bug in Ollama, as `"options":{"stop":[]}` is basically requesting it to not stop until an empty response is sent, but it appears that for older models (eg. mistral / llama2) it has worked to mean 'use the model file stop parameters'

GiteaMirror commented

2026-04-28 09:39:15 -05:00

@danielgen commented on GitHub (Apr 29, 2024):

For me Llama3 works as expected in Ollama CLI.
However it does not work in CrewAi, not even specifying the same modelfile.
Not sure if Ollama is at fault here, might well be a langchain issue or something else.

Below is the modelfile:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:latest

FROM /Users/[omitted]
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

@danielgen commented on GitHub (Apr 29, 2024): For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else. Below is the modelfile: ``` # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama3:latest FROM /Users/[omitted] TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ```

GiteaMirror commented

2026-04-28 09:39:16 -05:00

@olinorwell commented on GitHub (Apr 29, 2024):

For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else.

Agreed. In my case Llama3 was perfect when using the Ollama CLI. The issues were when other programs connected to Ollama via the OpenAI compatible interface.

@olinorwell commented on GitHub (Apr 29, 2024): > For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else. Agreed. In my case Llama3 was perfect when using the Ollama CLI. The issues were when other programs connected to Ollama via the OpenAI compatible interface.

GiteaMirror commented

2026-04-28 09:39:18 -05:00

@phalexo commented on GitHub (Apr 29, 2024):

I have llama3-70b-instruct Q5 working with gptpilot. Appears quite stable.

On Mon, Apr 29, 2024, 6:58 AM Oli Norwell @.***> wrote:

For me Llama3 works as expected in Ollama CLI. However it does not work in
CrewAi, not even specifying the same modelfile. Not sure if Ollama is at
fault here, might well be a langchain issue or something else.

Agreed. In my case Llama3 was perfect when using the Ollama CLI. The
issues were when other programs connected to Ollama via the OpenAI
compatible interface.

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/3759#issuecomment-2082423104,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLFDICU3J5MHZQD3O3Y7YRURAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSGQZDGMJQGQ
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (Apr 29, 2024): I have llama3-70b-instruct Q5 working with gptpilot. Appears quite stable. On Mon, Apr 29, 2024, 6:58 AM Oli Norwell ***@***.***> wrote: > For me Llama3 works as expected in Ollama CLI. However it does not work in > CrewAi, not even specifying the same modelfile. Not sure if Ollama is at > fault here, might well be a langchain issue or something else. > > Agreed. In my case Llama3 was perfect when using the Ollama CLI. The > issues were when other programs connected to Ollama via the OpenAI > compatible interface. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/3759#issuecomment-2082423104>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLFDICU3J5MHZQD3O3Y7YRURAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBSGQZDGMJQGQ> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-28 09:39:19 -05:00

@richardgroves commented on GitHub (Apr 29, 2024):

@danielgen @olinorwell Are you able to trace the request sent to the Ollama server from those external tools to see if it is the same "options":{"stop":[]} problem I've written about above, or some other issue?

@richardgroves commented on GitHub (Apr 29, 2024): @danielgen @olinorwell Are you able to trace the request sent to the Ollama server from those external tools to see if it is the same `"options":{"stop":[]}` problem I've written about above, or some other issue?

GiteaMirror commented

2026-04-28 09:39:20 -05:00

@boristopalov commented on GitHub (Apr 30, 2024):

For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else.

Below is the modelfile:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM llama3:latest

FROM /Users/[omitted]
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

It seems like an ollama issue, I have a program that hits the ollama API directly- doesn't use Langchain or any other wrappers and I was having this issue. Adding PARAMETER stop "<|eot_id|>" fixed it for me but I now see the <|eot_id|> character at the end of each response which is annoying

@boristopalov commented on GitHub (Apr 30, 2024): > For me Llama3 works as expected in Ollama CLI. However it does not work in CrewAi, not even specifying the same modelfile. Not sure if Ollama is at fault here, might well be a langchain issue or something else. > > Below is the modelfile: > > ``` > # Modelfile generated by "ollama show" > # To build a new Modelfile based on this one, replace the FROM line with: > # FROM llama3:latest > > FROM /Users/[omitted] > TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> > > {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> > > {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> > > {{ .Response }}<|eot_id|>""" > PARAMETER num_keep 24 > PARAMETER stop "<|start_header_id|>" > PARAMETER stop "<|end_header_id|>" > PARAMETER stop "<|eot_id|>" > ``` It seems like an ollama issue, I have a program that hits the ollama API directly- doesn't use Langchain or any other wrappers and I was having this issue. Adding `PARAMETER stop "<|eot_id|>"` fixed it for me but I now see the `<|eot_id|>` character at the end of each response which is annoying

GiteaMirror commented

2026-04-28 09:39:21 -05:00

@sabaimran commented on GitHub (May 1, 2024):

Without knowing too much about the ollama internals, there may be an issue in the way the prompt template is being formatted in the requests? And like others have pointed out, stop words are not being honored.

@sabaimran commented on GitHub (May 1, 2024): I'm also experiencing this issue while routing to Ollama via the `openai` chat completions python library. It streamed 30000 characters and emitted `<|eot_id|><|start_header_id|><|end_header_id|>` multiple times. Without knowing too much about the ollama internals, there may be an issue in the way the prompt template is being formatted in the requests? And like others have pointed out, stop words are not being honored.

GiteaMirror commented

2026-04-28 09:39:23 -05:00

@97k commented on GitHub (May 1, 2024):

I am also experiencing the same issue. This doesn't happen with ollama cli but I am not able to use APIs.
This doesn't happen with mistral.

I am using Langchain.chat_models.ollama.ChatOllama

@97k commented on GitHub (May 1, 2024): I am also experiencing the same issue. This doesn't happen with ollama cli but I am not able to use APIs. This doesn't happen with mistral. I am using Langchain.chat_models.ollama.ChatOllama

GiteaMirror commented

2026-04-28 09:39:26 -05:00

@nickychung commented on GitHub (May 1, 2024):

Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue.

@nickychung commented on GitHub (May 1, 2024): Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue.

GiteaMirror commented

2026-04-28 09:39:28 -05:00

@vrijsinghani commented on GitHub (May 1, 2024):

@nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far.

@vrijsinghani commented on GitHub (May 1, 2024): @nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far.

GiteaMirror commented

2026-04-28 09:39:29 -05:00

@orangeswim commented on GitHub (May 1, 2024):

Okay so I had a similar issue today, this was the solution for me.

First pip install gguf.
See https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-dump.py
Copy and run that python script to evaluate and see the metadata for your gguf model
Next See https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-set-metadata.py
Run the script to change the eos token. The eos token id for llama3 is 128009. For some reason, In the quantified models, you can see it has a different token.

After changing the token to the correct eos token, the model runs as expected.

@orangeswim commented on GitHub (May 1, 2024): Okay so I had a similar issue today, this was the solution for me. - First `pip install gguf`. - See https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-dump.py - Copy and run that python script to evaluate and see the metadata for your gguf model - Next See https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/scripts/gguf-set-metadata.py - Run the script to change the eos token. The eos token id for llama3 is 128009. For some reason, In the quantified models, you can see it has a different token. After changing the token to the correct eos token, the model runs as expected.

GiteaMirror commented

2026-04-28 09:39:30 -05:00

@nickychung commented on GitHub (May 2, 2024):

@nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far.

Model: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
ModelFile:

FROM D:/my_models/Meta-Llama-3-8B-Instruct.Q4_K_S.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER num_keep 24
PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"

@nickychung commented on GitHub (May 2, 2024): > @nickychung can you elaborate and specify the specific model and modelfile contents? I've tried a few of them and they all do not stop or generate gibberish so far. Model: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF ModelFile: ``` FROM D:/my_models/Meta-Llama-3-8B-Instruct.Q4_K_S.gguf TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ```

GiteaMirror commented

2026-04-28 09:39:30 -05:00

@Sneakr commented on GitHub (May 4, 2024):

Loading the original llama3 instruct works fine, but creating a new model from a fine tuned llama3 has the continous non-stop generating issue still.

@Sneakr commented on GitHub (May 4, 2024): Loading the original llama3 instruct works fine, but creating a new model from a fine tuned llama3 has the continous non-stop generating issue still.

GiteaMirror commented

2026-04-28 09:39:31 -05:00

@phalexo commented on GitHub (May 4, 2024):

Only some quantized models have the issue. I am using a Q5_M model and it
terminates. I think some quantized models are not generating the stop token.

On Sat, May 4, 2024, 11:03 AM Sneakr @.***> wrote:

Loading the original llama3 instruct works fine, but creating a new model
from a fine tuned llama3 has the continous non-stop generating issue still.

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/3759#issuecomment-2094244219,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLRIOJXPPQ6MFPOYRLZAT2DJAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGI2DIMRRHE
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (May 4, 2024): Only some quantized models have the issue. I am using a Q5_M model and it terminates. I think some quantized models are not generating the stop token. On Sat, May 4, 2024, 11:03 AM Sneakr ***@***.***> wrote: > Loading the original llama3 instruct works fine, but creating a new model > from a fine tuned llama3 has the continous non-stop generating issue still. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/3759#issuecomment-2094244219>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLRIOJXPPQ6MFPOYRLZAT2DJAVCNFSM6AAAAABGPRONZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGI2DIMRRHE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-28 09:39:32 -05:00

@Sneakr commented on GitHub (May 4, 2024):

I fine tuned my own model, works fine in inference mode after the tune, and converting to GGUF works fine with LM Studio, but when loaded into Ollama , it has the issue non-stop generating even with the prompt template defined for the llama3.

@Sneakr commented on GitHub (May 4, 2024): I fine tuned my own model, works fine in inference mode after the tune, and converting to GGUF works fine with LM Studio, but when loaded into Ollama , it has the issue non-stop generating even with the prompt template defined for the llama3.

GiteaMirror commented

2026-04-28 09:39:34 -05:00

@VideoFX commented on GitHub (May 4, 2024):

Ive had the same issue.

Ive tried llama3 and llama3-gradient. Ive updated ollama, and the models.

Ive tried crewai, langchain, and openwebUI, they all behave similarly. Ive updated those to newest versions as well.

The model will run for what seems like forever, and eventually repeats itself in a loop, or talks gibberish. It has to be stopped manually.

@VideoFX commented on GitHub (May 4, 2024): Ive had the same issue. Ive tried llama3 and llama3-gradient. Ive updated ollama, and the models. Ive tried crewai, langchain, and openwebUI, they all behave similarly. Ive updated those to newest versions as well. The model will run for what seems like forever, and eventually repeats itself in a loop, or talks gibberish. It has to be stopped manually.

GiteaMirror commented

2026-04-28 09:39:35 -05:00

@97k commented on GitHub (May 7, 2024):

Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue.

Thank you! This worked

For any one else going through this, I will break down step by step!

Go through QuantFactory on HF, choose the quantised model you want
Download the gguf model, I personally prefer q5_K_M

Once downloaded, create a Modelfile (again thanks to @nickychung )

  FROM <downloaded gguf file>

  TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

  {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

  {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

  {{ .Response }}<|eot_id|>"""
  PARAMETER num_keep 24
  PARAMETER stop "<|start_header_id|>"
  PARAMETER stop "<|end_header_id|>"
  PARAMETER stop "<|eot_id|>"

Create the model using ollama create

 ollama create llama3:<your tag> -f <path to Modelfile>

ollama ls this will show you the model
Celebrate!

and it respects the EOS!

@97k commented on GitHub (May 7, 2024): > Both the ollama CLI and ollama.chat resulted in a never-ending response. Changing the modelfile did not resolve the issue. However, instead of using ollama pull, I successfully addressed this problem by downloading the Llama3 GGUF from Hugging Face and running 'ollama create' with the modelfile provided there. This approach ultimately resolved the issue. Thank you! This worked For any one else going through this, I will break down step by step! 1. Go through [QuantFactory on HF](https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/main), choose the quantised model you want 2. Download the gguf model, I personally prefer q5_K_M 3. Once downloaded, create a Modelfile (again thanks to @nickychung ) ```bash FROM <downloaded gguf file> TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|>""" PARAMETER num_keep 24 PARAMETER stop "<|start_header_id|>" PARAMETER stop "<|end_header_id|>" PARAMETER stop "<|eot_id|>" ``` 4. Create the model using ollama create ```bash ollama create llama3:<your tag> -f <path to Modelfile> ``` 5. `ollama ls` this will show you the model 6. Celebrate! ![image](https://github.com/ollama/ollama/assets/21143936/703081f2-8c95-4476-91fe-f6838451812c) and it *respects* the EOS! ![image](https://github.com/ollama/ollama/assets/21143936/76fd0293-62c4-4492-add5-83c9329a871a)

GiteaMirror commented

2026-04-28 09:39:38 -05:00

@richardgroves commented on GitHub (May 10, 2024):

The latest Ollama release appears to have fixed this problem for Llama3 and Phi3.

My ollama --version is now ollama version is 0.1.34

curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'

Now stops properly, as does the same test with model: phi3.

So many Pull Requests merged recently that it is hard to find the exact change that fixed it.

@richardgroves commented on GitHub (May 10, 2024): The latest Ollama release appears to have fixed this problem for Llama3 and Phi3. My `ollama --version` is now `ollama version is 0.1.34` `curl http://localhost:11434/api/chat -d '{"stream":true,"model":"llama3","messages":[{"role":"system","content":"You are a helpful, respectful and honest assistant."},{"role":"user","content":"What is 1+1?"}],"options":{"stop":[]}}'` Now stops properly, as does the same test with model: phi3. So many Pull Requests merged recently that it is hard to find the exact change that fixed it.

GiteaMirror commented

2026-04-28 09:39:40 -05:00

@reneleonhardt commented on GitHub (May 11, 2024):

Now stops properly, as does the same test with model: phi3.

So many Pull Requests merged recently that it is hard to find the exact change that fixed it.

I'm glad this has been finally fixed!
Yeah, too many merges causing problems mixed with some trying to fix them later if you're lucky 😅
I wonder why the test suite doesn't catch tokens inside instruction model responses for different prompt templates and endpoints...

@reneleonhardt commented on GitHub (May 11, 2024): > Now stops properly, as does the same test with model: phi3. > > So many Pull Requests merged recently that it is hard to find the exact change that fixed it. I'm glad this has been finally fixed! Yeah, too many merges causing problems mixed with some trying to fix them later if you're lucky 😅 I wonder why the test suite doesn't catch tokens inside instruction model responses for different prompt templates and endpoints...

GiteaMirror commented

2026-04-28 09:39:44 -05:00

@joshuavial commented on GitHub (May 13, 2024):

I had the same problem, but the issue was an out of date ollama client - upgrading sorted things out.

@joshuavial commented on GitHub (May 13, 2024): I had the same problem, but the issue was an out of date ollama client - upgrading sorted things out.

GiteaMirror commented

2026-04-28 09:39:46 -05:00

@eevmanu commented on GitHub (May 13, 2024):

for any onlooker, if you're:

using linux,
added the startup service and
update ollama recently as described here https://github.com/ollama/ollama/issues/3759#issuecomment-2104445225,

don't forget to restart the service (sudo systemctl restart ollama.service), ymmv but in my case started throwing memory errors, despite having restart instructions 9c76b30d72/docs/linux.md (L51-L52)

@eevmanu commented on GitHub (May 13, 2024): for any onlooker, if you're: - using `linux`, - added the [startup service](https://github.com/ollama/ollama/blob/main/docs/linux.md#adding-ollama-as-a-startup-service-recommended) and - `update` ollama recently as described here https://github.com/ollama/ollama/issues/3759#issuecomment-2104445225, don't forget to restart the service (`sudo systemctl restart ollama.service`), [ymmv](https://www.urbandictionary.com/define.php?term=ymmv) but in my case started throwing memory errors, despite having `restart` instructions https://github.com/ollama/ollama/blob/9c76b30d72b76f0ce1fe7f357651ea9985c2cb24/docs/linux.md#L51-L52

GiteaMirror commented

2026-04-28 09:39:47 -05:00

@adryan-ai commented on GitHub (May 15, 2024):

upgrading ollama resolved this for me, from 0.1.32 to 0.1.37

@adryan-ai commented on GitHub (May 15, 2024): upgrading ollama resolved this for me, from 0.1.32 to 0.1.37

GiteaMirror commented

2026-04-28 09:39:49 -05:00

@jmorganca commented on GitHub (Jun 25, 2024):

Great! Sorry for the issues and glad to see the newest versions fixed this for folks. Let me know if that's not the case and I can re-open

@jmorganca commented on GitHub (Jun 25, 2024): Great! Sorry for the issues and glad to see the newest versions fixed this for folks. Let me know if that's not the case and I can re-open

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#48830