[GH-ISSUE #13189] Same prompt, inconsistent results based on ollama inference and direct inference #8719

Closed
opened 2026-04-12 21:29:22 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @lemonblock98 on GitHub (Nov 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13189

What is the issue?

I fine-tuned a Qwen3-0.6B model using the Hugging Face Transformers Trainer.

  • For model conversion, I used the convert_hf_to_gguf.py script from llama.cpp to convert my fine-tuned SafeTensors model into .gguf format.
  • For the Modelfile, I directly reused the original Modelfile for Qwen3.

Now, when I run inference with the same list of messages using the following two methods, I get inconsistent results:

  1. Using Ollama
url = "http://localhost:11434/api/chat"
data = {
    "model": "qwen3-0.6b-ft",
    "messages": messages,
    "tools": tools,
    "top_p": 0.8,
    "temperature": 0.2,
    "top_k": 20,
    "max_tokens": 2048,
    "stream": False,
    "think": True
}
response = requests.post(url, json=data).json()["message"]
  1. Using Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("...", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("...")

instruction = qwen3_tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)

model_inputs = tokenizer(instruction, add_special_tokens=False, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048,
    top_p=0.8,
    temperature=0.2,
    top_k=20
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

try:
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

I’ve already confirmed that the TEMPLATE in the Modelfile matches the original Qwen3 chat template exactly. What could be causing this discrepancy?

Relevant log output


OS

macOS

GPU

No response

CPU

Apple

Ollama version

0.13.0

Originally created by @lemonblock98 on GitHub (Nov 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13189 ### What is the issue? I fine-tuned a Qwen3-0.6B model using the Hugging Face Transformers Trainer. - For model conversion, I used the `convert_hf_to_gguf.py` script from llama.cpp to convert my fine-tuned SafeTensors model into .gguf format. - For the Modelfile, I directly reused the original Modelfile for Qwen3. Now, when I run inference with the same list of messages using the following two methods, I get inconsistent results: 1. Using Ollama ```python url = "http://localhost:11434/api/chat" data = { "model": "qwen3-0.6b-ft", "messages": messages, "tools": tools, "top_p": 0.8, "temperature": 0.2, "top_k": 20, "max_tokens": 2048, "stream": False, "think": True } response = requests.post(url, json=data).json()["message"] ``` 2. Using Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("...", use_fast=False, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("...") instruction = qwen3_tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True) model_inputs = tokenizer(instruction, add_special_tokens=False, return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=2048, top_p=0.8, temperature=0.2, top_k=20 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() try: index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") ``` I’ve already confirmed that the TEMPLATE in the Modelfile matches the original Qwen3 chat template exactly. What could be causing this discrepancy? ### Relevant log output ```shell ``` ### OS macOS ### GPU _No response_ ### CPU Apple ### Ollama version 0.13.0
GiteaMirror added the bug label 2026-04-12 21:29:22 -05:00
Author
Owner

@lemonblock98 commented on GitHub (Nov 21, 2025):

The Modelfile for Qwen3:

From ./qwen3-0.6B-ft.gguf
TEMPLATE """
{{- $lastUserIdx := -1 -}}
{{- range $idx, $msg := .Messages -}}
{{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}}
{{- end }}
{{- if or .System .Tools }}<|im_start|>system
{{ if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end -}}
<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}
{{- if and $.IsThinkSet (eq $i $lastUserIdx) }}
   {{- if $.Think -}}
      {{- " "}}/think
   {{- else -}}
      {{- " "}}/no_think
   {{- end -}}
{{- end }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}}
<think>{{ .Thinking }}</think>
{{ end -}}
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ if and $.IsThinkSet (not $.Think) -}}
<think>

</think>

{{ end -}}
{{ end }}
{{- end }}"""
PARAMETER top_k 20
PARAMETER top_p 0.80
PARAMETER repeat_penalty 1.0
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER temperature 0.2
<!-- gh-comment-id:3562132384 --> @lemonblock98 commented on GitHub (Nov 21, 2025): The Modelfile for Qwen3: ``` From ./qwen3-0.6B-ft.gguf TEMPLATE """ {{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $lastUserIdx = $idx }}{{ end -}} {{- end }} {{- if or .System .Tools }}<|im_start|>system {{ if .System }} {{ .System }} {{- end }} {{- if .Tools }} # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {{- range .Tools }} {"type": "function", "function": {{ .Function }}} {{- end }} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> {{- end -}} <|im_end|> {{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 -}} {{- if eq .Role "user" }}<|im_start|>user {{ .Content }} {{- if and $.IsThinkSet (eq $i $lastUserIdx) }} {{- if $.Think -}} {{- " "}}/think {{- else -}} {{- " "}}/no_think {{- end -}} {{- end }}<|im_end|> {{ else if eq .Role "assistant" }}<|im_start|>assistant {{ if (and $.IsThinkSet (and .Thinking (or $last (gt $i $lastUserIdx)))) -}} <think>{{ .Thinking }}</think> {{ end -}} {{ if .Content }}{{ .Content }} {{- else if .ToolCalls }}<tool_call> {{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}} {{ end }}</tool_call> {{- end }}{{ if not $last }}<|im_end|> {{ end }} {{- else if eq .Role "tool" }}<|im_start|>user <tool_response> {{ .Content }} </tool_response><|im_end|> {{ end }} {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant {{ if and $.IsThinkSet (not $.Think) -}} <think> </think> {{ end -}} {{ end }} {{- end }}""" PARAMETER top_k 20 PARAMETER top_p 0.80 PARAMETER repeat_penalty 1.0 PARAMETER stop <|im_start|> PARAMETER stop <|im_end|> PARAMETER temperature 0.2 ```
Author
Owner

@rick-github commented on GitHub (Nov 21, 2025):

What discrepancy?

<!-- gh-comment-id:3562331706 --> @rick-github commented on GitHub (Nov 21, 2025): What discrepancy?
Author
Owner

@rick-github commented on GitHub (Nov 21, 2025):

This will go much faster if you use words to explain what the perceived issue is.

<!-- gh-comment-id:3562709240 --> @rick-github commented on GitHub (Nov 21, 2025): This will go much faster if you use words to explain what the perceived issue is.
Author
Owner

@lemonblock98 commented on GitHub (Nov 21, 2025):

The core issue is that when using SAME messages and generation parameters:

  1. Inference with Transformers produces results that align with the expected output
    (the model has undergone SFT training)
  2. Inference with Ollama produces results that severely deviate from the training data.

Theoretically, as long as the same TEMPLATE is used, the inference results from both methods
should be consistent?

The inference code and generation parameters have been provided in the first message.

<!-- gh-comment-id:3563428047 --> @lemonblock98 commented on GitHub (Nov 21, 2025): The core issue is that when using SAME messages and generation parameters: 1. Inference with Transformers produces results that align with the expected output (the model has undergone SFT training) 2. Inference with Ollama produces results that severely deviate from the training data. Theoretically, as long as the same TEMPLATE is used, the inference results from both methods should be consistent? The inference code and generation parameters have been provided in the first message.
Author
Owner

@rick-github commented on GitHub (Nov 21, 2025):

Token generation is auto-complete driven by a pseudo-random number generator so results will diverge. If you control for seed and temperature then you will have an apples/apples comparison.

You haven't supplied a copy of the finetuned model, the prompt or the context so the code and parameters are not useful.

Do you get the same divergence if you use the base model?

<!-- gh-comment-id:3563460517 --> @rick-github commented on GitHub (Nov 21, 2025): Token generation is auto-complete driven by a pseudo-random number generator so results will diverge. If you control for seed and temperature then you will have an apples/apples comparison. You haven't supplied a copy of the finetuned model, the prompt or the context so the code and parameters are not useful. Do you get the same divergence if you use the base model?
Author
Owner

@lemonblock98 commented on GitHub (Nov 21, 2025):

Since the fine-tuned model is privately deployed and cannot be uploaded with context, here is the code I used to construct training data from raw messages, hoping it helps with troubleshooting:

messages = [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
tools = [{"type": "function", "function": {...}}, ..]

total_content = qwen3_tokenizer.apply_chat_template(msgs, tools=fncall_tools, tokenize=False, enable_thinking=True)
split_idx = total_content.find("<think>")
instruction = total_content[:split_idx]
output = total_content[split_idx:]

I tested direct inference using base models, but since base models are untrained, their outputs are inherently unstable, so the two calling methods cannot align in the first place.
However, with the fine-tuned model, although the reasoning content and responses have some randomness, the final tool selection is usually stable (assuming it's Tool A). But when using Ollama for inference, the model selects Tool B instead.

I want to confirm:

  1. If I specify Qwen3's official TEMPLATE in the Modelfile, can the prompt generated from raw messages be functionally equivalent to the prompt produced by qwen3_tokenizer.apply_chat_template?

  2. Or is there a way to print out the prompt after TEMPLATE concatenation each time the model is called?

<!-- gh-comment-id:3563730532 --> @lemonblock98 commented on GitHub (Nov 21, 2025): Since the fine-tuned model is privately deployed and cannot be uploaded with context, here is the code I used to construct training data from raw messages, hoping it helps with troubleshooting: ```python messages = [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] tools = [{"type": "function", "function": {...}}, ..] total_content = qwen3_tokenizer.apply_chat_template(msgs, tools=fncall_tools, tokenize=False, enable_thinking=True) split_idx = total_content.find("<think>") instruction = total_content[:split_idx] output = total_content[split_idx:] ``` I tested direct inference using base models, but since base models are untrained, their outputs are inherently unstable, so the two calling methods cannot align in the first place. However, with the fine-tuned model, although the reasoning content and responses have some randomness, the final tool selection is usually stable (assuming it's Tool A). But when using Ollama for inference, the model selects Tool B instead. I want to confirm: 1. If I specify Qwen3's official TEMPLATE in the Modelfile, can the prompt generated from raw messages be functionally equivalent to the prompt produced by `qwen3_tokenizer.apply_chat_template`? 2. Or is there a way to print out the prompt after TEMPLATE concatenation each time the model is called?
Author
Owner

@rick-github commented on GitHub (Nov 21, 2025):

  1. If I specify Qwen3's official TEMPLATE in the Modelfile, can the prompt generated from raw messages be functionally equivalent to the prompt produced by qwen3_tokenizer.apply_chat_template?

They should be functionally the same. There are differences in formatting and think handling. For example, a prompt of "hello" and passing a tool to the Jinja template generates the following:

<|im_start|>system
# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"function": {"description": "Raises x to the y power and returns the result", "name": "power", "parameters": {"properties": {"x": {"description": "", "type": "number"}, "y": {"description": "", "type": "number"}}, "required": ["x", "y"], "type": "object"}}, "type": "function"}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
hello<|im_end|>
<|im_start|>assistant

The same prompt and tool passed through ollama generates the following:

<|im_start|>system


# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name":"power","description":"Raises x to the y power and returns the result","parameters":{"type":"object","required":["x","y"],"properties":{"x":{"type":"number"},"y":{"type":"number"}}}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
hello /think<|im_end|>
<|im_start|>assistant

These are the differences:

@@ -1,11 +1,13 @@
 <|im_start|>system
+
+
 # Tools
 
 You may call one or more functions to assist with the user query.
 
 You are provided with function signatures within <tools></tools> XML tags:
 <tools>
-{"function": {"description": "Raises x to the y power and returns the result", "name": "power", "parameters": {"properties": {"x": {"description": "", "type": "number"}, "y": {"description": "", "type": "number"}}, "required": ["x", "y"], "type": "object"}}, "type": "function"}
+{"type": "function", "function": {"name":"power","description":"Raises x to the y power and returns the result","parameters":{"type":"object","required":["x","y"],"properties":{"x":{"type":"number"},"y":{"type":"number"}}}}}
 </tools>
 
 For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
@@ -13,5 +15,5 @@
 {"name": <function-name>, "arguments": <args-json-object>}
 </tool_call><|im_end|>
 <|im_start|>user
-hello<|im_end|>
+hello /think<|im_end|>
 <|im_start|>assistant

The difference in the tool definition (other than ordering) is the argument description. Depending on the definition of your tools, this may be different between the two invocations.

@@ -5,11 +5,9 @@
     "parameters": {
       "properties": {
         "x": {
-          "description": "",
           "type": "number"
         },
         "y": {
-          "description": "",
           "type": "number"
         }
       },

  1. Or is there a way to print out the prompt after TEMPLATE concatenation each time the model is called?

Add OLLAMA_DEBUG=2 to the server environment and look for "completion request" prompt in the log.

<!-- gh-comment-id:3563935019 --> @rick-github commented on GitHub (Nov 21, 2025): > 1. If I specify Qwen3's official TEMPLATE in the Modelfile, can the prompt generated from raw messages be functionally equivalent to the prompt produced by `qwen3_tokenizer.apply_chat_template`? They should be functionally the same. There are differences in formatting and think handling. For example, a prompt of "hello" and passing a tool to the Jinja template generates the following: ``` <|im_start|>system # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {"function": {"description": "Raises x to the y power and returns the result", "name": "power", "parameters": {"properties": {"x": {"description": "", "type": "number"}, "y": {"description": "", "type": "number"}}, "required": ["x", "y"], "type": "object"}}, "type": "function"} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call><|im_end|> <|im_start|>user hello<|im_end|> <|im_start|>assistant ``` The same prompt and tool passed through ollama generates the following: ``` <|im_start|>system # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {"type": "function", "function": {"name":"power","description":"Raises x to the y power and returns the result","parameters":{"type":"object","required":["x","y"],"properties":{"x":{"type":"number"},"y":{"type":"number"}}}}} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call><|im_end|> <|im_start|>user hello /think<|im_end|> <|im_start|>assistant ``` These are the differences: ```diff @@ -1,11 +1,13 @@ <|im_start|>system + + # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> -{"function": {"description": "Raises x to the y power and returns the result", "name": "power", "parameters": {"properties": {"x": {"description": "", "type": "number"}, "y": {"description": "", "type": "number"}}, "required": ["x", "y"], "type": "object"}}, "type": "function"} +{"type": "function", "function": {"name":"power","description":"Raises x to the y power and returns the result","parameters":{"type":"object","required":["x","y"],"properties":{"x":{"type":"number"},"y":{"type":"number"}}}}} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: @@ -13,5 +15,5 @@ {"name": <function-name>, "arguments": <args-json-object>} </tool_call><|im_end|> <|im_start|>user -hello<|im_end|> +hello /think<|im_end|> <|im_start|>assistant ``` The difference in the tool definition (other than ordering) is the argument description. Depending on the definition of your tools, this may be different between the two invocations. ```diff @@ -5,11 +5,9 @@ "parameters": { "properties": { "x": { - "description": "", "type": "number" }, "y": { - "description": "", "type": "number" } }, ``` > 2. Or is there a way to print out the prompt after TEMPLATE concatenation each time the model is called? Add `OLLAMA_DEBUG=2` to the server environment and look for `"completion request" prompt` in the log.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8719