[GH-ISSUE #2774] What is the different between /api/generate and /api/chat? #48184

Closed
opened 2026-04-28 07:04:35 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @owenzhao on GitHub (Feb 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2774

I mean if I give them the same prompt and input, the answers will be the same. Right? Then why they are two different API?

Or is chat auto context? I mean when using /api/chat, the answer will automatically include the previous conversation? And the /api/generate only answer for the present?

Originally created by @owenzhao on GitHub (Feb 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2774 I mean if I give them the same prompt and input, the answers will be the same. Right? Then why they are two different API? Or is chat auto context? I mean when using /api/chat, the answer will automatically include the previous conversation? And the /api/generate only answer for the present?
Author
Owner

@kescherCode commented on GitHub (Feb 27, 2024):

It's completions vs chat completions.

<!-- gh-comment-id:1966390772 --> @kescherCode commented on GitHub (Feb 27, 2024): It's completions vs chat completions.
Author
Owner

@owenzhao commented on GitHub (Feb 27, 2024):

It's completions vs chat completions.

Thanks any way. Your answer is like to tell me an apple is different from a pear. That is enough for people who have already known apples and pears. But it is clueless for me as I want to know the internal differences, not the name differences.

<!-- gh-comment-id:1966891743 --> @owenzhao commented on GitHub (Feb 27, 2024): > It's completions vs chat completions. Thanks any way. Your answer is like to tell me an apple is different from a pear. That is enough for people who have already known apples and pears. But it is clueless for me as I want to know the internal differences, not the name differences.
Author
Owner

@maximinus commented on GitHub (Mar 8, 2024):

Generate: post a single message and get a response.

Chat: post a single message and the previous chat history, and get a response.

Imagine this conversation:

> What's the capital of France?
> LLM: Paris

> And what about Germany?
> LLM: ???

If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin".

<!-- gh-comment-id:1986330936 --> @maximinus commented on GitHub (Mar 8, 2024): **Generate**: *post a single message and get a response.* **Chat**: *post a single message and the previous chat history, and get a response.* Imagine this conversation: ``` > What's the capital of France? > LLM: Paris > And what about Germany? > LLM: ??? ``` If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin".
Author
Owner

@owenzhao commented on GitHub (Mar 8, 2024):

Generate: post a single message and get a response.

Chat: post a single message and the previous chat history, and get a response.

Imagine this conversation:

> What's the capital of France?
> LLM: Paris

> And what about Germany?
> LLM: ???

If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin".

Thank you for your clarifications. But in my own tests the results were not as expected.

For example, when translating a word from one language to another, it is common that a word has more than one meanings. Say word "check", it can be examine, or a check from checkbook.

So I think it is a good idea to to let LLM to cross translating, that is using the word from the original language and another word with the same meaning in another language, then translate to the target language.

something like:

  1. original(en): check, (zh-Hans)检查, target language: foo
  2. original(en): check, (zh-Hans)支票, target language: bar

However, I found that even with the generate API, the second result was affected by the first in many models. Instead of given the result of "bar", they gave the result "foo, bar" as the result.

So I wonder if there was a way to get a clear result each time. Without any previous context?

<!-- gh-comment-id:1986390393 --> @owenzhao commented on GitHub (Mar 8, 2024): > **Generate**: _post a single message and get a response._ > > **Chat**: _post a single message and the previous chat history, and get a response._ > > Imagine this conversation: > > ``` > > What's the capital of France? > > LLM: Paris > > > And what about Germany? > > LLM: ??? > ``` > > If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin". Thank you for your clarifications. But in my own tests the results were not as expected. For example, when translating a word from one language to another, it is common that a word has more than one meanings. Say word "check", it can be examine, or a check from checkbook. So I think it is a good idea to to let LLM to cross translating, that is using the word from the original language and another word with the same meaning in another language, then translate to the target language. something like: 1. original(en): check, (zh-Hans)检查, target language: foo 2. original(en): check, (zh-Hans)支票, target language: bar However, I found that even with the generate API, the second result was affected by the first in many models. Instead of given the result of "bar", they gave the result "foo, bar" as the result. So I wonder if there was a way to get a clear result each time. Without any previous context?
Author
Owner

@maximinus commented on GitHub (Mar 8, 2024):

Thank you for your clarifications. But in my own tests the results were not as expected.

I think we are all learning in this new area. But I can only clarify what the documentation says.

If you are getting different results you may need to use the same random seed, or lower the temperature of the model to 0, or something else; we don't know your setup and it would be hard to replicate anyway.

<!-- gh-comment-id:1986509021 --> @maximinus commented on GitHub (Mar 8, 2024): > Thank you for your clarifications. But in my own tests the results were not as expected. I think we are all learning in this new area. But I can only clarify what the documentation says. If you are getting different results you may need to use the same random seed, or lower the temperature of the model to 0, or something else; we don't know your setup and it would be hard to replicate anyway.
Author
Owner

@jmorganca commented on GitHub (Mar 12, 2024):

Hi there, thanks for creating an issue. As mentioned the /api/chat endpoint takes a history of messages and provides the next message in the conversation. This is ideal for conversations with history. The /api/generate API provides a one-time completion based on the input.

<!-- gh-comment-id:1990849182 --> @jmorganca commented on GitHub (Mar 12, 2024): Hi there, thanks for creating an issue. As mentioned the `/api/chat` endpoint takes a history of messages and provides the next message in the conversation. This is ideal for conversations with history. The `/api/generate` API provides a one-time completion based on the input.
Author
Owner

@rutu-samas commented on GitHub (May 3, 2024):

@jmorganca, maybe you can help clarify this that will clear the question for me and perhaps others.

Is /api/chat equivalent to /api/generate if I give it chat_history as a string and append it with user prompt or does it do something more to keep context more efficiently?

<!-- gh-comment-id:2092750029 --> @rutu-samas commented on GitHub (May 3, 2024): @jmorganca, maybe you can help clarify this that will clear the question for me and perhaps others. Is /api/chat equivalent to /api/generate if I give it chat_history as a string and append it with user prompt or does it do something more to keep context more efficiently?
Author
Owner

@formigarafa commented on GitHub (May 12, 2024):

I feel I have the same question as you @owenzhao and I believe the answers above do not grasp the concept of the question. So I will give it a jab here and hopefully or I get it right or someone who understand it better than me corrects me and we get it somewhere.

I am no expert on go or whatever tool is used to make this project but I've found some (possible) answers on file https://github.com/ollama/ollama/blob/main/server/routes.go

On lines L972-L973 the api endpoints are defined with respective handlers for generate and chat:

r.POST("/api/generate", s.GenerateHandler)
r.POST("/api/chat", s.ChatHandler)

Then GenerateHandler is defined from line 77 and on lines L273-L281 it calls runner.llama.Completion. The code looks like this:

req := llm.CompletionRequest{
  Prompt:  prompt,
  Format:  req.Format,
  Images:  images,
  Options: opts,
}
if err := runner.llama.Completion(c.Request.Context(), req, fn); err != nil {
  ch <- gin.H{"error": err.Error()}
}

ChatHandler is defined from line 1154 and now compare this snippet from lines L1295-L1302:

if err := runner.llama.Completion(c.Request.Context(), llm.CompletionRequest{
  Prompt:  prompt,
  Format:  req.Format,
  Images:  images,
  Options: opts,
}, fn); err != nil {
  ch <- gin.H{"error": err.Error()}
}

I cold be wrong, but they do the same job when passing the prompt to the runner.llama.Completion call.
And for what I saw on its preceding lines it seem the handler gets the list of messages from the api request params and build a prompt.

I still want to test this theory but from what I understood from the code it seems the api/chat advantage is that it prepares a prompt for you from a list of messages and answers following the same message format so you can just use it on the next request. It would be enlightening just get this answer on the docs and save me a lot of time. Maybe I can contribute with some edits on the docs later if I get on the bottom of all this. I am really enjoying Ollama, I've been learning heaps with it.

But in conclusion (if I am correct) if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result.

<!-- gh-comment-id:2106216883 --> @formigarafa commented on GitHub (May 12, 2024): I feel I have the same question as you @owenzhao and I believe the answers above do not grasp the concept of the question. So I will give it a jab here and hopefully or I get it right or someone who understand it better than me corrects me and we get it somewhere. I am no expert on go or whatever tool is used to make this project but I've found some (possible) answers on file https://github.com/ollama/ollama/blob/main/server/routes.go On lines L972-L973 the api endpoints are defined with respective handlers for `generate` and `chat`: ``` r.POST("/api/generate", s.GenerateHandler) r.POST("/api/chat", s.ChatHandler) ``` Then `GenerateHandler` is defined from line 77 and on lines L273-L281 it calls `runner.llama.Completion`. The code looks like this: ``` req := llm.CompletionRequest{ Prompt: prompt, Format: req.Format, Images: images, Options: opts, } if err := runner.llama.Completion(c.Request.Context(), req, fn); err != nil { ch <- gin.H{"error": err.Error()} } ``` ChatHandler is defined from line 1154 and now compare this snippet from lines L1295-L1302: ``` if err := runner.llama.Completion(c.Request.Context(), llm.CompletionRequest{ Prompt: prompt, Format: req.Format, Images: images, Options: opts, }, fn); err != nil { ch <- gin.H{"error": err.Error()} } ``` I cold be wrong, but they do the same job when passing the prompt to the `runner.llama.Completion` call. And for what I saw on its preceding lines it seem the handler gets the list of messages from the api request params and build a prompt. I still want to test this theory but from what I understood from the code it seems the api/chat advantage is that it prepares a prompt for you from a list of messages and answers following the same message format so you can just use it on the next request. It would be enlightening just get this answer on the docs and save me a lot of time. Maybe I can contribute with some edits on the docs later if I get on the bottom of all this. I am really enjoying Ollama, I've been learning heaps with it. But in conclusion (**if I am correct**) if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result.
Author
Owner

@SanchiMittal commented on GitHub (Jun 18, 2024):

Related question -- In case of /api/chat, for creation of prompt from list of messages, is there any form of summarization done or multiple calls made to the model before finally constructing the prompt? Or is it directly just all concatenated and sent to model for generation of response? I need this clarity to better decide whether I should use /api/chat directly or /api/generate with my own customized prompt that includes a summarized chat history.

@jmorganca

<!-- gh-comment-id:2175253698 --> @SanchiMittal commented on GitHub (Jun 18, 2024): Related question -- In case of `/api/chat`, for creation of prompt from list of messages, is there any form of summarization done or multiple calls made to the model before finally constructing the prompt? Or is it directly just all concatenated and sent to model for generation of response? I need this clarity to better decide whether I should use `/api/chat` directly or `/api/generate` with my own customized prompt that includes a summarized chat history. @jmorganca
Author
Owner

@silasalves commented on GitHub (Jun 19, 2024):

I am also curious about that. I've made a quick test and the two functions seem to be very similar:

from ollama import Client
import json

conversation = [
    {
        'role': 'system',
        'content': 'You are a bored assistant. Provide short answers.',
    },
    {
        'role': 'user',
        'content': 'Why is the sky blue?',
    },
    {
        'role': 'assistant',
        'content': 'Because the gods wanted it that way.',
    },
    {
        'role': 'user',
        'content': 'Why did the gods want it that way?',
    }]

ollama = Client(host='http://localhost:11434')
response = ollama.chat(
    model='llama3', 
    messages=conversation,
    options={'temperature': 0})
print(response['message']['content'])

response = ollama.generate(
    model='llama3', 
    prompt=json.dumps(conversation),
    options={'temperature': 0})
print(response['response'])

Output:

*sigh* I don't know, okay? It's just science-y stuff...
{"role": "system", "content": "I'm not sure. Maybe they just felt like it."}

Notes:

  • I set temperature = 0 so that the responses are always the same (no randomness) to allow better comparison. You should be able to reproduce these results, or at the very least get the same different result every time.
  • The two responses were different, although both of them admitted "not knowing" and "dodging" the answer.
  • The similarity between the answers corroborate with @formigarafa proposition that if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result. In that case, I simply failed to provide the exact same prompt.
  • It seems that chat does some additional work, which could be (this is just me hallucinating, don't take this as factual information):
    • Formatting the messages: I lazily used json to transform the conversation to a string, maybe the chat function does more than that.
    • Letting the model know it is the "assistant", not the "system"
    • Unpacking the message if the model returns a JSON formatted string.
<!-- gh-comment-id:2177451400 --> @silasalves commented on GitHub (Jun 19, 2024): I am also curious about that. I've made a quick test and the two functions seem to be very similar: ```python from ollama import Client import json conversation = [ { 'role': 'system', 'content': 'You are a bored assistant. Provide short answers.', }, { 'role': 'user', 'content': 'Why is the sky blue?', }, { 'role': 'assistant', 'content': 'Because the gods wanted it that way.', }, { 'role': 'user', 'content': 'Why did the gods want it that way?', }] ollama = Client(host='http://localhost:11434') response = ollama.chat( model='llama3', messages=conversation, options={'temperature': 0}) print(response['message']['content']) response = ollama.generate( model='llama3', prompt=json.dumps(conversation), options={'temperature': 0}) print(response['response']) ``` Output: ``` *sigh* I don't know, okay? It's just science-y stuff... {"role": "system", "content": "I'm not sure. Maybe they just felt like it."} ``` Notes: * I set `temperature = 0` so that the responses are always the same (no randomness) to allow better comparison. You should be able to reproduce these results, or at the very least get the same different result every time. * The two responses were different, although both of them admitted "not knowing" and "dodging" the answer. * The similarity between the answers corroborate with @formigarafa proposition that _if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result_. In that case, I simply failed to provide the exact same prompt. * It seems that `chat` does some additional work, which could be (**this is just me hallucinating, don't take this as factual information**): * Formatting the messages: I lazily used `json` to transform the conversation to a string, maybe the `chat` function does more than that. * Letting the model know it is the "assistant", not the "system" * Unpacking the message if the model returns a JSON formatted string.
Author
Owner

@formigarafa commented on GitHub (Jun 19, 2024):

If you enable the debug mode in the server you can see, beside a bunch of other information the prompt being fed to the model. That helps to get to the end of it, but it is not clear to me if any other different treatment is given to each prompt on thei respective calls. I assume it does not. But it is also hard to see what is coming raw from the model when using the api as the debug mode does not include the responses.
I think ot would be very educational at least to be able to isolatedely log prompt and generation.

<!-- gh-comment-id:2177759178 --> @formigarafa commented on GitHub (Jun 19, 2024): If you enable the debug mode in the server you can see, beside a bunch of other information the prompt being fed to the model. That helps to get to the end of it, but it is not clear to me if any other different treatment is given to each prompt on thei respective calls. I assume it does not. But it is also hard to see what is coming raw from the model when using the api as the debug mode does not include the responses. I think ot would be very educational at least to be able to isolatedely log prompt and generation.
Author
Owner

@silasalves commented on GitHub (Jun 19, 2024):

@formigarafa Thanks for pointing it out the existence of debug mode! I did that and looked at the logs and saw what is going on.

ollama.chat() transforms conversation to the following prompt:

<|start_header_id|>system<|end_header_id|>\n\nYou are a bored assistant. Provide short answers.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBecause the gods wanted it that way.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy did the gods want it that way?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

meanwhile, ollama.generate() transforms the dumped JSON string to the following prompt:

<|start_header_id|>user<|end_header_id|>\n\n[{\"role\": \"system\", \"content\": \"You are a bored assistant. Provide short answers.\"}, {\"role\": \"user\", \"content\": \"Why is the sky blue?\"}, {\"role\": \"assistant\", \"content\": \"Because the gods wanted it that way.\"}, {\"role\": \"user\", \"content\": \"Why did the gods want it that way?\"}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

That means Chat and Generate use the model's template differently. This is Llama3 template:

{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>

While Chat uses <|start_header_id|>{{role_name}}<|end_header_id|>{{message}} for each message to create the conversation context, Generate only uses the prompt:

<|start_header_id|>user<|end_header_id|>\n\n{{prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

I guess this solves the mystery! This was a good exercise to understand how the template is used as well. =)

Perhaps the README file should have this information on how the template is used for both Chat and Generate functions. It's very basic, but it's so under the hood that it's hard for beginners (like yo) to understand it.

<!-- gh-comment-id:2179287043 --> @silasalves commented on GitHub (Jun 19, 2024): @formigarafa Thanks for pointing it out the existence of [debug mode](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md)! I did that and looked at the logs and saw what is going on. `ollama.chat()` transforms `conversation` to the following prompt: ``` <|start_header_id|>system<|end_header_id|>\n\nYou are a bored assistant. Provide short answers.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBecause the gods wanted it that way.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy did the gods want it that way?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n ``` meanwhile, `ollama.generate()` transforms the dumped JSON string to the following prompt: ``` <|start_header_id|>user<|end_header_id|>\n\n[{\"role\": \"system\", \"content\": \"You are a bored assistant. Provide short answers.\"}, {\"role\": \"user\", \"content\": \"Why is the sky blue?\"}, {\"role\": \"assistant\", \"content\": \"Because the gods wanted it that way.\"}, {\"role\": \"user\", \"content\": \"Why did the gods want it that way?\"}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n ``` That means Chat and Generate use the model's template differently. This is [Llama3 template](https://ollama.com/library/llama3/blobs/8ab4849b038c): ``` {{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|> ``` While Chat uses `<|start_header_id|>{{role_name}}<|end_header_id|>{{message}}` for each message to create the conversation context, Generate only uses the prompt: ``` <|start_header_id|>user<|end_header_id|>\n\n{{prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n ``` I guess this solves the mystery! This was a good exercise to understand how the template is used as well. =) Perhaps the README file should have this information on how the template is used for both Chat and Generate functions. It's very basic, but it's so under the hood that it's hard for beginners (like _yo_) to understand it.
Author
Owner

@formigarafa commented on GitHub (Jun 19, 2024):

@silasalves, please have another go but set the options raw: true on generate. This way the model won't use the template.
Also, try again, using the output from chat log as input on generate raw.
My hypothesis is that under these conditions the model should behave exactly the same.

<!-- gh-comment-id:2179550596 --> @formigarafa commented on GitHub (Jun 19, 2024): @silasalves, please have another go but set the options `raw: true` on generate. This way the model won't use the template. Also, try again, using the output from chat log as input on generate raw. My hypothesis is that under these conditions the model should behave exactly the same.
Author
Owner

@formigarafa commented on GitHub (Jun 19, 2024):

Here, I think I now got the grip on how to run this test. here are my results:

Chat call

thread = [
  {"role": "system", "content": "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."},
  {"role": "user", "content": "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"},
]
client.chat({"model": MODEL_NAME, "messages": thread, "options": {"temperature": 0})

Logged this prompt:

source=routes.go:1305 msg="chat handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"

Produced this answer:

"Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!"

Generate call using raw=false (default)

chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"
CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": false, "options": {"temperature": 0}})

Logged these 2 prompts:

source=routes.go:179 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"
source=routes.go:212 msg="generate handler" prompt="<|im_start|>user\n<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n<|im_end|>\n<|im_start|>assistant\n"

Produced this answer:

"Olá, Mario! As três perguntas mais comuns que eu recebo são:\n\n1. Como posso melhorar a eficiência do meu trabalho?\n2. Qual é o processo para resolver um problema técnico específico?\n3. Onde posso encontrar informações detalhadas sobre um tópico específico?"

Generate call using raw=true

chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"
CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": true, "options": {"temperature": 0}})

Logged this single prompt:

source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"

Produced this answer (exactly the same as chat):

"Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!"

My experiments point towards => chat(thread) == generate(apply_template(thread), raw=false)

It is a bit hard to tell with certainty only from this test if this confirms the hypothesis because there could be something else going on and maybe
this specific case we could have got a false representation of the result.
But so far from what I've been learning and all the other experiments I've made this is the assumption I am making while proceeding until I find something else to contradict me.


edit:

Generate call using raw=false and straight system and prompt params:

system = "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."
prompt = "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"
CLIENT.generate({"model": MODEL_NAME, "system": system, "prompt": prompt, "raw": false, "options": {"temperature": 0}})

Logged these 3 prompts:

source=routes.go:179 msg="generate handler" prompt="Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"
source=routes.go:181 msg="generate handler" system="Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."
source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"

Produced this answer:

"Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!"

This one also worked the same as chat but it would be limited to a single question from the user. It would not work the same with a follow-up question, for example because there is no way to format this in a single prompt the same way as chat. My assumptions, so far, remain unchanged.

<!-- gh-comment-id:2179587962 --> @formigarafa commented on GitHub (Jun 19, 2024): Here, I think I now got the grip on how to run this test. here are my results: Chat call ``` thread = [ {"role": "system", "content": "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."}, {"role": "user", "content": "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"}, ] client.chat({"model": MODEL_NAME, "messages": thread, "options": {"temperature": 0}) ``` Logged this prompt: ``` source=routes.go:1305 msg="chat handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer: ``` "Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!" ``` Generate call using raw=false (default) ``` chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": false, "options": {"temperature": 0}}) ``` Logged these *2* prompts: ``` source=routes.go:179 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" source=routes.go:212 msg="generate handler" prompt="<|im_start|>user\n<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer: ``` "Olá, Mario! As três perguntas mais comuns que eu recebo são:\n\n1. Como posso melhorar a eficiência do meu trabalho?\n2. Qual é o processo para resolver um problema técnico específico?\n3. Onde posso encontrar informações detalhadas sobre um tópico específico?" ``` Generate call using raw=true ``` chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": true, "options": {"temperature": 0}}) ``` Logged this single prompt: ``` source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer (exactly the same as chat): ``` "Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!" ``` My experiments point towards => `chat(thread) == generate(apply_template(thread), raw=false)` It is a bit hard to tell with certainty only from this test if this confirms the hypothesis because there could be something else going on and maybe this specific case we could have got a false representation of the result. But so far from what I've been learning and all the other experiments I've made this is the assumption I am making while proceeding until I find something else to contradict me. ----- edit: Generate call using raw=false and straight system and prompt params: ``` system = "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant." prompt = "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?" CLIENT.generate({"model": MODEL_NAME, "system": system, "prompt": prompt, "raw": false, "options": {"temperature": 0}}) ``` Logged these *3* prompts: ``` source=routes.go:179 msg="generate handler" prompt="Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?" source=routes.go:181 msg="generate handler" system="Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant." source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer: ``` "Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!" ``` This one also worked the same as chat but **it would be limited to a single question from the user**. It would not work the same with a follow-up question, for example because there is no way to format this in a single prompt the same way as chat. My assumptions, so far, remain unchanged.
Author
Owner

@malteneuss commented on GitHub (Jun 23, 2024):

I found a helpful Youtube video by Matt Williams that discusses the difference: https://www.youtube.com/watch?v=kaK3ye8rczA It basically comes down to convenience. For one-off questions you would use the /api/generate endpoint for quick results. For back-and-forth (like in a real conversation with a chatbot), you would use the /api/chat endpoint. Although you can also use the chat endpoint for one-off questions to fake previous responses in the "assistant" role like in e0eee85d67/2024-02-15-functioncalling/fc.py (L30) to inject some examples. Apparently some models better then understand what they should produce for the their actual response.

<!-- gh-comment-id:2184937525 --> @malteneuss commented on GitHub (Jun 23, 2024): I found a helpful Youtube video by Matt Williams that discusses the difference: https://www.youtube.com/watch?v=kaK3ye8rczA It basically comes down to convenience. For one-off questions you would use the `/api/generate` endpoint for quick results. For back-and-forth (like in a real conversation with a chatbot), you would use the `/api/chat` endpoint. Although you can also use the chat endpoint for one-off questions to fake previous responses in the "assistant" role like in https://github.com/technovangelist/videoprojects/blob/e0eee85d67b3cf8d885d472980b03b3e819ef8c3/2024-02-15-functioncalling/fc.py#L30 to inject some examples. Apparently some models better then understand what they should produce for the their actual response.
Author
Owner

@Propfend commented on GitHub (Sep 9, 2024):

Thanks for the responses!!

<!-- gh-comment-id:2338828827 --> @Propfend commented on GitHub (Sep 9, 2024): Thanks for the responses!!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48184