[GH-ISSUE #2774] What is the different between /api/generate and /api/chat? #27432

New Issue

GiteaMirror · 2026-04-22T04:46:46-05:00

GiteaMirror commented

2026-04-22 04:46:46 -05:00

Originally created by @owenzhao on GitHub (Feb 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2774

I mean if I give them the same prompt and input, the answers will be the same. Right? Then why they are two different API?

Or is chat auto context? I mean when using /api/chat, the answer will automatically include the previous conversation? And the /api/generate only answer for the present?

Originally created by @owenzhao on GitHub (Feb 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2774 I mean if I give them the same prompt and input, the answers will be the same. Right? Then why they are two different API? Or is chat auto context? I mean when using /api/chat, the answer will automatically include the previous conversation? And the /api/generate only answer for the present?

GiteaMirror closed this issue

2026-04-22 04:46:47 -05:00

GiteaMirror commented

2026-04-22 04:46:48 -05:00

@kescherCode commented on GitHub (Feb 27, 2024):

It's completions vs chat completions.

@kescherCode commented on GitHub (Feb 27, 2024): It's completions vs chat completions.

GiteaMirror commented

2026-04-22 04:46:50 -05:00

@owenzhao commented on GitHub (Feb 27, 2024):

It's completions vs chat completions.

Thanks any way. Your answer is like to tell me an apple is different from a pear. That is enough for people who have already known apples and pears. But it is clueless for me as I want to know the internal differences, not the name differences.

@owenzhao commented on GitHub (Feb 27, 2024): > It's completions vs chat completions. Thanks any way. Your answer is like to tell me an apple is different from a pear. That is enough for people who have already known apples and pears. But it is clueless for me as I want to know the internal differences, not the name differences.

GiteaMirror commented

2026-04-22 04:46:51 -05:00

@maximinus commented on GitHub (Mar 8, 2024):

Generate: post a single message and get a response.

Chat: post a single message and the previous chat history, and get a response.

Imagine this conversation:

> What's the capital of France?
> LLM: Paris

> And what about Germany?
> LLM: ???

If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin".

@maximinus commented on GitHub (Mar 8, 2024): **Generate**: *post a single message and get a response.* **Chat**: *post a single message and the previous chat history, and get a response.* Imagine this conversation: ``` > What's the capital of France? > LLM: Paris > And what about Germany? > LLM: ??? ``` If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin".

GiteaMirror commented

2026-04-22 04:46:52 -05:00

@owenzhao commented on GitHub (Mar 8, 2024):

Generate: post a single message and get a response.

Chat: post a single message and the previous chat history, and get a response.

Imagine this conversation:
> What's the capital of France?
> LLM: Paris

> And what about Germany?
> LLM: ???
If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin".

Thank you for your clarifications. But in my own tests the results were not as expected.

For example, when translating a word from one language to another, it is common that a word has more than one meanings. Say word "check", it can be examine, or a check from checkbook.

So I think it is a good idea to to let LLM to cross translating, that is using the word from the original language and another word with the same meaning in another language, then translate to the target language.

something like:

original(en): check, (zh-Hans)检查, target language: foo
original(en): check, (zh-Hans)支票, target language: bar

However, I found that even with the generate API, the second result was affected by the first in many models. Instead of given the result of "bar", they gave the result "foo, bar" as the result.

So I wonder if there was a way to get a clear result each time. Without any previous context?

@owenzhao commented on GitHub (Mar 8, 2024): > **Generate**: _post a single message and get a response._ > > **Chat**: _post a single message and the previous chat history, and get a response._ > > Imagine this conversation: > > ``` > > What's the capital of France? > > LLM: Paris > > > And what about Germany? > > LLM: ??? > ``` > > If this was done via generate, the LLM would not understand the context, however, with chat it would also have the previous history and could probably give the correct answer "Berlin". Thank you for your clarifications. But in my own tests the results were not as expected. For example, when translating a word from one language to another, it is common that a word has more than one meanings. Say word "check", it can be examine, or a check from checkbook. So I think it is a good idea to to let LLM to cross translating, that is using the word from the original language and another word with the same meaning in another language, then translate to the target language. something like: 1. original(en): check, (zh-Hans)检查, target language: foo 2. original(en): check, (zh-Hans)支票, target language: bar However, I found that even with the generate API, the second result was affected by the first in many models. Instead of given the result of "bar", they gave the result "foo, bar" as the result. So I wonder if there was a way to get a clear result each time. Without any previous context?

GiteaMirror commented

2026-04-22 04:46:53 -05:00

@maximinus commented on GitHub (Mar 8, 2024):

Thank you for your clarifications. But in my own tests the results were not as expected.

I think we are all learning in this new area. But I can only clarify what the documentation says.

If you are getting different results you may need to use the same random seed, or lower the temperature of the model to 0, or something else; we don't know your setup and it would be hard to replicate anyway.

@maximinus commented on GitHub (Mar 8, 2024): > Thank you for your clarifications. But in my own tests the results were not as expected. I think we are all learning in this new area. But I can only clarify what the documentation says. If you are getting different results you may need to use the same random seed, or lower the temperature of the model to 0, or something else; we don't know your setup and it would be hard to replicate anyway.

GiteaMirror commented

2026-04-22 04:46:54 -05:00

@jmorganca commented on GitHub (Mar 12, 2024):

Hi there, thanks for creating an issue. As mentioned the /api/chat endpoint takes a history of messages and provides the next message in the conversation. This is ideal for conversations with history. The /api/generate API provides a one-time completion based on the input.

@jmorganca commented on GitHub (Mar 12, 2024): Hi there, thanks for creating an issue. As mentioned the `/api/chat` endpoint takes a history of messages and provides the next message in the conversation. This is ideal for conversations with history. The `/api/generate` API provides a one-time completion based on the input.

GiteaMirror commented

2026-04-22 04:46:55 -05:00

@rutu-samas commented on GitHub (May 3, 2024):

@jmorganca, maybe you can help clarify this that will clear the question for me and perhaps others.

Is /api/chat equivalent to /api/generate if I give it chat_history as a string and append it with user prompt or does it do something more to keep context more efficiently?

@rutu-samas commented on GitHub (May 3, 2024): @jmorganca, maybe you can help clarify this that will clear the question for me and perhaps others. Is /api/chat equivalent to /api/generate if I give it chat_history as a string and append it with user prompt or does it do something more to keep context more efficiently?

GiteaMirror commented

2026-04-22 04:46:55 -05:00

@formigarafa commented on GitHub (May 12, 2024):

I feel I have the same question as you @owenzhao and I believe the answers above do not grasp the concept of the question. So I will give it a jab here and hopefully or I get it right or someone who understand it better than me corrects me and we get it somewhere.

I am no expert on go or whatever tool is used to make this project but I've found some (possible) answers on file https://github.com/ollama/ollama/blob/main/server/routes.go

On lines L972-L973 the api endpoints are defined with respective handlers for generate and chat:

r.POST("/api/generate", s.GenerateHandler)
r.POST("/api/chat", s.ChatHandler)

Then GenerateHandler is defined from line 77 and on lines L273-L281 it calls runner.llama.Completion. The code looks like this:

req := llm.CompletionRequest{
  Prompt:  prompt,
  Format:  req.Format,
  Images:  images,
  Options: opts,
}
if err := runner.llama.Completion(c.Request.Context(), req, fn); err != nil {
  ch <- gin.H{"error": err.Error()}
}

ChatHandler is defined from line 1154 and now compare this snippet from lines L1295-L1302:

if err := runner.llama.Completion(c.Request.Context(), llm.CompletionRequest{
  Prompt:  prompt,
  Format:  req.Format,
  Images:  images,
  Options: opts,
}, fn); err != nil {
  ch <- gin.H{"error": err.Error()}
}

I cold be wrong, but they do the same job when passing the prompt to the runner.llama.Completion call.
And for what I saw on its preceding lines it seem the handler gets the list of messages from the api request params and build a prompt.

I still want to test this theory but from what I understood from the code it seems the api/chat advantage is that it prepares a prompt for you from a list of messages and answers following the same message format so you can just use it on the next request. It would be enlightening just get this answer on the docs and save me a lot of time. Maybe I can contribute with some edits on the docs later if I get on the bottom of all this. I am really enjoying Ollama, I've been learning heaps with it.

But in conclusion (if I am correct) if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result.

@formigarafa commented on GitHub (May 12, 2024): I feel I have the same question as you @owenzhao and I believe the answers above do not grasp the concept of the question. So I will give it a jab here and hopefully or I get it right or someone who understand it better than me corrects me and we get it somewhere. I am no expert on go or whatever tool is used to make this project but I've found some (possible) answers on file https://github.com/ollama/ollama/blob/main/server/routes.go On lines L972-L973 the api endpoints are defined with respective handlers for `generate` and `chat`: ``` r.POST("/api/generate", s.GenerateHandler) r.POST("/api/chat", s.ChatHandler) ``` Then `GenerateHandler` is defined from line 77 and on lines L273-L281 it calls `runner.llama.Completion`. The code looks like this: ``` req := llm.CompletionRequest{ Prompt: prompt, Format: req.Format, Images: images, Options: opts, } if err := runner.llama.Completion(c.Request.Context(), req, fn); err != nil { ch <- gin.H{"error": err.Error()} } ``` ChatHandler is defined from line 1154 and now compare this snippet from lines L1295-L1302: ``` if err := runner.llama.Completion(c.Request.Context(), llm.CompletionRequest{ Prompt: prompt, Format: req.Format, Images: images, Options: opts, }, fn); err != nil { ch <- gin.H{"error": err.Error()} } ``` I cold be wrong, but they do the same job when passing the prompt to the `runner.llama.Completion` call. And for what I saw on its preceding lines it seem the handler gets the list of messages from the api request params and build a prompt. I still want to test this theory but from what I understood from the code it seems the api/chat advantage is that it prepares a prompt for you from a list of messages and answers following the same message format so you can just use it on the next request. It would be enlightening just get this answer on the docs and save me a lot of time. Maybe I can contribute with some edits on the docs later if I get on the bottom of all this. I am really enjoying Ollama, I've been learning heaps with it. But in conclusion (**if I am correct**) if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result.

GiteaMirror commented

2026-04-22 04:46:56 -05:00

@SanchiMittal commented on GitHub (Jun 18, 2024):

Related question -- In case of /api/chat, for creation of prompt from list of messages, is there any form of summarization done or multiple calls made to the model before finally constructing the prompt? Or is it directly just all concatenated and sent to model for generation of response? I need this clarity to better decide whether I should use /api/chat directly or /api/generate with my own customized prompt that includes a summarized chat history.

@jmorganca

@SanchiMittal commented on GitHub (Jun 18, 2024): Related question -- In case of `/api/chat`, for creation of prompt from list of messages, is there any form of summarization done or multiple calls made to the model before finally constructing the prompt? Or is it directly just all concatenated and sent to model for generation of response? I need this clarity to better decide whether I should use `/api/chat` directly or `/api/generate` with my own customized prompt that includes a summarized chat history. @jmorganca

GiteaMirror commented

2026-04-22 04:46:56 -05:00

@silasalves commented on GitHub (Jun 19, 2024):

I am also curious about that. I've made a quick test and the two functions seem to be very similar:

from ollama import Client
import json

conversation = [
    {
        'role': 'system',
        'content': 'You are a bored assistant. Provide short answers.',
    },
    {
        'role': 'user',
        'content': 'Why is the sky blue?',
    },
    {
        'role': 'assistant',
        'content': 'Because the gods wanted it that way.',
    },
    {
        'role': 'user',
        'content': 'Why did the gods want it that way?',
    }]

ollama = Client(host='http://localhost:11434')
response = ollama.chat(
    model='llama3', 
    messages=conversation,
    options={'temperature': 0})
print(response['message']['content'])

response = ollama.generate(
    model='llama3', 
    prompt=json.dumps(conversation),
    options={'temperature': 0})
print(response['response'])

Output:

*sigh* I don't know, okay? It's just science-y stuff...
{"role": "system", "content": "I'm not sure. Maybe they just felt like it."}

Notes:

I set temperature = 0 so that the responses are always the same (no randomness) to allow better comparison. You should be able to reproduce these results, or at the very least get the same different result every time.
The two responses were different, although both of them admitted "not knowing" and "dodging" the answer.
The similarity between the answers corroborate with @formigarafa proposition that if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result. In that case, I simply failed to provide the exact same prompt.
It seems that chat does some additional work, which could be (this is just me hallucinating, don't take this as factual information):
- Formatting the messages: I lazily used json to transform the conversation to a string, maybe the chat function does more than that.
- Letting the model know it is the "assistant", not the "system"
- Unpacking the message if the model returns a JSON formatted string.

@silasalves commented on GitHub (Jun 19, 2024): I am also curious about that. I've made a quick test and the two functions seem to be very similar: ```python from ollama import Client import json conversation = [ { 'role': 'system', 'content': 'You are a bored assistant. Provide short answers.', }, { 'role': 'user', 'content': 'Why is the sky blue?', }, { 'role': 'assistant', 'content': 'Because the gods wanted it that way.', }, { 'role': 'user', 'content': 'Why did the gods want it that way?', }] ollama = Client(host='http://localhost:11434') response = ollama.chat( model='llama3', messages=conversation, options={'temperature': 0}) print(response['message']['content']) response = ollama.generate( model='llama3', prompt=json.dumps(conversation), options={'temperature': 0}) print(response['response']) ``` Output: ``` *sigh* I don't know, okay? It's just science-y stuff... {"role": "system", "content": "I'm not sure. Maybe they just felt like it."} ``` Notes: * I set `temperature = 0` so that the responses are always the same (no randomness) to allow better comparison. You should be able to reproduce these results, or at the very least get the same different result every time. * The two responses were different, although both of them admitted "not knowing" and "dodging" the answer. * The similarity between the answers corroborate with @formigarafa proposition that _if you format the prompt on the exact same way as the chat api would do for you then the api/generate will produce the same result_. In that case, I simply failed to provide the exact same prompt. * It seems that `chat` does some additional work, which could be (**this is just me hallucinating, don't take this as factual information**): * Formatting the messages: I lazily used `json` to transform the conversation to a string, maybe the `chat` function does more than that. * Letting the model know it is the "assistant", not the "system" * Unpacking the message if the model returns a JSON formatted string.

GiteaMirror commented

2026-04-22 04:46:57 -05:00

@formigarafa commented on GitHub (Jun 19, 2024):

If you enable the debug mode in the server you can see, beside a bunch of other information the prompt being fed to the model. That helps to get to the end of it, but it is not clear to me if any other different treatment is given to each prompt on thei respective calls. I assume it does not. But it is also hard to see what is coming raw from the model when using the api as the debug mode does not include the responses.
I think ot would be very educational at least to be able to isolatedely log prompt and generation.

@formigarafa commented on GitHub (Jun 19, 2024): If you enable the debug mode in the server you can see, beside a bunch of other information the prompt being fed to the model. That helps to get to the end of it, but it is not clear to me if any other different treatment is given to each prompt on thei respective calls. I assume it does not. But it is also hard to see what is coming raw from the model when using the api as the debug mode does not include the responses. I think ot would be very educational at least to be able to isolatedely log prompt and generation.

GiteaMirror commented

2026-04-22 04:46:58 -05:00

@silasalves commented on GitHub (Jun 19, 2024):

@formigarafa Thanks for pointing it out the existence of debug mode! I did that and looked at the logs and saw what is going on.

ollama.chat() transforms conversation to the following prompt:

<|start_header_id|>system<|end_header_id|>\n\nYou are a bored assistant. Provide short answers.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBecause the gods wanted it that way.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy did the gods want it that way?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

meanwhile, ollama.generate() transforms the dumped JSON string to the following prompt:

<|start_header_id|>user<|end_header_id|>\n\n[{\"role\": \"system\", \"content\": \"You are a bored assistant. Provide short answers.\"}, {\"role\": \"user\", \"content\": \"Why is the sky blue?\"}, {\"role\": \"assistant\", \"content\": \"Because the gods wanted it that way.\"}, {\"role\": \"user\", \"content\": \"Why did the gods want it that way?\"}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

That means Chat and Generate use the model's template differently. This is Llama3 template:

{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>

While Chat uses <|start_header_id|>{{role_name}}<|end_header_id|>{{message}} for each message to create the conversation context, Generate only uses the prompt:

<|start_header_id|>user<|end_header_id|>\n\n{{prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

I guess this solves the mystery! This was a good exercise to understand how the template is used as well. =)

Perhaps the README file should have this information on how the template is used for both Chat and Generate functions. It's very basic, but it's so under the hood that it's hard for beginners (like yo) to understand it.

@silasalves commented on GitHub (Jun 19, 2024): @formigarafa Thanks for pointing it out the existence of [debug mode](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md)! I did that and looked at the logs and saw what is going on. `ollama.chat()` transforms `conversation` to the following prompt: ``` <|start_header_id|>system<|end_header_id|>\n\nYou are a bored assistant. Provide short answers.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy is the sky blue?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBecause the gods wanted it that way.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhy did the gods want it that way?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n ``` meanwhile, `ollama.generate()` transforms the dumped JSON string to the following prompt: ``` <|start_header_id|>user<|end_header_id|>\n\n[{\"role\": \"system\", \"content\": \"You are a bored assistant. Provide short answers.\"}, {\"role\": \"user\", \"content\": \"Why is the sky blue?\"}, {\"role\": \"assistant\", \"content\": \"Because the gods wanted it that way.\"}, {\"role\": \"user\", \"content\": \"Why did the gods want it that way?\"}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n ``` That means Chat and Generate use the model's template differently. This is [Llama3 template](https://ollama.com/library/llama3/blobs/8ab4849b038c): ``` {{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|> {{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|> {{ .Response }}<|eot_id|> ``` While Chat uses `<|start_header_id|>{{role_name}}<|end_header_id|>{{message}}` for each message to create the conversation context, Generate only uses the prompt: ``` <|start_header_id|>user<|end_header_id|>\n\n{{prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n ``` I guess this solves the mystery! This was a good exercise to understand how the template is used as well. =) Perhaps the README file should have this information on how the template is used for both Chat and Generate functions. It's very basic, but it's so under the hood that it's hard for beginners (like _yo_) to understand it.

GiteaMirror commented

2026-04-22 04:46:58 -05:00

@formigarafa commented on GitHub (Jun 19, 2024):

@silasalves, please have another go but set the options raw: true on generate. This way the model won't use the template.
Also, try again, using the output from chat log as input on generate raw.
My hypothesis is that under these conditions the model should behave exactly the same.

@formigarafa commented on GitHub (Jun 19, 2024): @silasalves, please have another go but set the options `raw: true` on generate. This way the model won't use the template. Also, try again, using the output from chat log as input on generate raw. My hypothesis is that under these conditions the model should behave exactly the same.

GiteaMirror commented

2026-04-22 04:46:59 -05:00

@formigarafa commented on GitHub (Jun 19, 2024):

Here, I think I now got the grip on how to run this test. here are my results:

Chat call

thread = [
  {"role": "system", "content": "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."},
  {"role": "user", "content": "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"},
]
client.chat({"model": MODEL_NAME, "messages": thread, "options": {"temperature": 0})

Logged this prompt:

source=routes.go:1305 msg="chat handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"

Produced this answer:

"Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!"

Generate call using raw=false (default)

chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"
CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": false, "options": {"temperature": 0}})

Logged these 2 prompts:

source=routes.go:179 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"
source=routes.go:212 msg="generate handler" prompt="<|im_start|>user\n<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n<|im_end|>\n<|im_start|>assistant\n"

Produced this answer:

"Olá, Mario! As três perguntas mais comuns que eu recebo são:\n\n1. Como posso melhorar a eficiência do meu trabalho?\n2. Qual é o processo para resolver um problema técnico específico?\n3. Onde posso encontrar informações detalhadas sobre um tópico específico?"

Generate call using raw=true

chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"
CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": true, "options": {"temperature": 0}})

Logged this single prompt:

source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"

Produced this answer (exactly the same as chat):

"Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!"

My experiments point towards => chat(thread) == generate(apply_template(thread), raw=false)

It is a bit hard to tell with certainty only from this test if this confirms the hypothesis because there could be something else going on and maybe
this specific case we could have got a false representation of the result.
But so far from what I've been learning and all the other experiments I've made this is the assumption I am making while proceeding until I find something else to contradict me.

edit:

Generate call using raw=false and straight system and prompt params:

system = "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."
prompt = "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"
CLIENT.generate({"model": MODEL_NAME, "system": system, "prompt": prompt, "raw": false, "options": {"temperature": 0}})

Logged these 3 prompts:

source=routes.go:179 msg="generate handler" prompt="Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"
source=routes.go:181 msg="generate handler" system="Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."
source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n"

Produced this answer:

"Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!"

This one also worked the same as chat but it would be limited to a single question from the user. It would not work the same with a follow-up question, for example because there is no way to format this in a single prompt the same way as chat. My assumptions, so far, remain unchanged.

@formigarafa commented on GitHub (Jun 19, 2024): Here, I think I now got the grip on how to run this test. here are my results: Chat call ``` thread = [ {"role": "system", "content": "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant."}, {"role": "user", "content": "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?"}, ] client.chat({"model": MODEL_NAME, "messages": thread, "options": {"temperature": 0}) ``` Logged this prompt: ``` source=routes.go:1305 msg="chat handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer: ``` "Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!" ``` Generate call using raw=false (default) ``` chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": false, "options": {"temperature": 0}}) ``` Logged these *2* prompts: ``` source=routes.go:179 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" source=routes.go:212 msg="generate handler" prompt="<|im_start|>user\n<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer: ``` "Olá, Mario! As três perguntas mais comuns que eu recebo são:\n\n1. Como posso melhorar a eficiência do meu trabalho?\n2. Qual é o processo para resolver um problema técnico específico?\n3. Onde posso encontrar informações detalhadas sobre um tópico específico?" ``` Generate call using raw=true ``` chat_logged_prompt = "<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" CLIENT.generate({"model": MODEL_NAME, "prompt": chat_logged_prompt, "raw": true, "options": {"temperature": 0}}) ``` Logged this single prompt: ``` source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer (exactly the same as chat): ``` "Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!" ``` My experiments point towards => `chat(thread) == generate(apply_template(thread), raw=false)` It is a bit hard to tell with certainty only from this test if this confirms the hypothesis because there could be something else going on and maybe this specific case we could have got a false representation of the result. But so far from what I've been learning and all the other experiments I've made this is the assumption I am making while proceeding until I find something else to contradict me. ----- edit: Generate call using raw=false and straight system and prompt params: ``` system = "Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant." prompt = "Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?" CLIENT.generate({"model": MODEL_NAME, "system": system, "prompt": prompt, "raw": false, "options": {"temperature": 0}}) ``` Logged these *3* prompts: ``` source=routes.go:179 msg="generate handler" prompt="Hi! My name is Mario. What are the top 3 questions you are mostly asked around here?" source=routes.go:181 msg="generate handler" system="Your name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant." source=routes.go:212 msg="generate handler" prompt="<|im_start|>system\nYour name is Laura. You answer every question exclusively in Portuguese. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi! My name is Mario. What are the top 3 questions you are mostly asked around here?<|im_end|>\n<|im_start|>assistant\n" ``` Produced this answer: ``` "Olá, Mario! Como assistente, eu não tenho uma lista específica de perguntas mais comuns que recebo. No entanto, posso ajudar com uma variedade de tópicos, desde respostas gerais até informações mais técnicas. Se você tiver alguma dúvida em particular ou precisar de ajuda com um tópico específico, sinta-se à vontade para perguntar!" ``` This one also worked the same as chat but **it would be limited to a single question from the user**. It would not work the same with a follow-up question, for example because there is no way to format this in a single prompt the same way as chat. My assumptions, so far, remain unchanged.

GiteaMirror commented

2026-04-22 04:47:00 -05:00

@malteneuss commented on GitHub (Jun 23, 2024):

I found a helpful Youtube video by Matt Williams that discusses the difference: https://www.youtube.com/watch?v=kaK3ye8rczA It basically comes down to convenience. For one-off questions you would use the /api/generate endpoint for quick results. For back-and-forth (like in a real conversation with a chatbot), you would use the /api/chat endpoint. Although you can also use the chat endpoint for one-off questions to fake previous responses in the "assistant" role like in e0eee85d67/2024-02-15-functioncalling/fc.py (L30) to inject some examples. Apparently some models better then understand what they should produce for the their actual response.

@malteneuss commented on GitHub (Jun 23, 2024): I found a helpful Youtube video by Matt Williams that discusses the difference: https://www.youtube.com/watch?v=kaK3ye8rczA It basically comes down to convenience. For one-off questions you would use the `/api/generate` endpoint for quick results. For back-and-forth (like in a real conversation with a chatbot), you would use the `/api/chat` endpoint. Although you can also use the chat endpoint for one-off questions to fake previous responses in the "assistant" role like in https://github.com/technovangelist/videoprojects/blob/e0eee85d67b3cf8d885d472980b03b3e819ef8c3/2024-02-15-functioncalling/fc.py#L30 to inject some examples. Apparently some models better then understand what they should produce for the their actual response.

GiteaMirror commented

2026-04-22 04:47:01 -05:00

@Propfend commented on GitHub (Sep 9, 2024):

Thanks for the responses!!

@Propfend commented on GitHub (Sep 9, 2024): Thanks for the responses!!

Sign in to join this conversation.

Branches Tags

main

mxyng/docs-cloud

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#27432