[GH-ISSUE #586] /api/generate with fixed seed and temperature=0 doesn't produce deterministic results #46774

Closed
opened 2026-04-27 23:57:00 -05:00 by GiteaMirror · 44 comments
Owner

Originally created by @jmorganca on GitHub (Sep 25, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/586

Originally assigned to: @BruceMacD on GitHub.

Originally created by @jmorganca on GitHub (Sep 25, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/586 Originally assigned to: @BruceMacD on GitHub.
GiteaMirror added the bug label 2026-04-27 23:57:00 -05:00
Author
Owner

@sqs commented on GitHub (Sep 27, 2023):

I just noticed this as well.

~3 weeks ago, the following command was deterministic:

curl -d '{"prompt":"const primes=[1,2,3,","model":"codellama:7b-code","options":{"seed":1337,"temperature":0,"num_ctx":100,"stop":["\n"]}}' http://localhost:11434/api/generate

Now it is not.

<!-- gh-comment-id:1736557848 --> @sqs commented on GitHub (Sep 27, 2023): I just noticed this as well. ~3 weeks ago, the following command was deterministic: ``` curl -d '{"prompt":"const primes=[1,2,3,","model":"codellama:7b-code","options":{"seed":1337,"temperature":0,"num_ctx":100,"stop":["\n"]}}' http://localhost:11434/api/generate ``` Now it is not.
Author
Owner

@BruceMacD commented on GitHub (Oct 2, 2023):

Fixed in #663

<!-- gh-comment-id:1743595754 --> @BruceMacD commented on GitHub (Oct 2, 2023): Fixed in #663
Author
Owner

@j2l commented on GitHub (May 23, 2024):

It happens again.

<!-- gh-comment-id:2127157740 --> @j2l commented on GitHub (May 23, 2024): It happens again.
Author
Owner

@d-kleine commented on GitHub (Jun 26, 2024):

I just resolved my issue with the Ollama API docs. The model parameters to make the model deterministic (and herewith reproducible) need to be passed in with an "options" key in the json input for Ollama:

"options": {  
            "seed": 42, # fixed seed for reproduciblity (not needed when using temperature= 0)
            "temperature": 0, # temp set to zero for determinism
        }
<!-- gh-comment-id:2192444763 --> @d-kleine commented on GitHub (Jun 26, 2024): I just resolved my issue with the [Ollama API docs](https://github.com/ollama/ollama/blob/main/docs/api.md#chat-request-reproducible-outputs). The [model parameters](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values) to make the model deterministic (and herewith reproducible) need to be passed in with an `"options"` key in the json input for Ollama: ```python "options": { "seed": 42, # fixed seed for reproduciblity (not needed when using temperature= 0) "temperature": 0, # temp set to zero for determinism } ```
Author
Owner

@j2l commented on GitHub (Jun 27, 2024):

Hey @d-kleine, do you mean you just tested it and it absolutely works for you?

Because as you can see in the message from Sep 27, 2023, we all use seed + temperature + num_ctx in options:

"options":{"seed":1337,"temperature":0,"num_ctx":100 ...

<!-- gh-comment-id:2193803487 --> @j2l commented on GitHub (Jun 27, 2024): Hey @d-kleine, do you mean you just tested it and it absolutely works for you? Because as you can see in the message from Sep 27, 2023, we all use seed + temperature + num_ctx in options: > `"options":{"seed":1337,"temperature":0,"num_ctx":100 ...`
Author
Owner

@d-kleine commented on GitHub (Jun 27, 2024):

Yes, the output is deterministic and reproducible - on the same device with the same OS (in my case, Windows 10). However, if you have an different OS (I have tested it with Docker running an Ubuntu image on the same device), it will generate a similar but not identical output. So, the output is deterministic and reproducible on the same OS, but currently I have the issue to produce consistent output across different OS.

<!-- gh-comment-id:2193852189 --> @d-kleine commented on GitHub (Jun 27, 2024): Yes, the output is deterministic and reproducible - on the same device with the same OS (in my case, Windows 10). However, if you have an different OS (I have tested it with Docker running an Ubuntu image on the same device), it will generate a similar but not identical output. So, the output is **deterministic** and **reproducible** on the same OS, but currently I have the issue to produce **consistent** output across different OS.
Author
Owner

@j2l commented on GitHub (Jun 27, 2024):

Ok, thank you @d-kleine !
I use it on docker on my ubuntu host, maybe that's why.

<!-- gh-comment-id:2193861270 --> @j2l commented on GitHub (Jun 27, 2024): Ok, thank you @d-kleine ! I use it on docker on my ubuntu host, maybe that's why.
Author
Owner

@d-kleine commented on GitHub (Jun 27, 2024):

I use it on docker on my ubuntu host, maybe that's why.
What OS do you use in your Docker image (not Ubuntu too, I assume)?

What I wanted to say is that when you switch your OS running the same code, you will get a slightly differently generated output, it's inconsistent across different OS.

<!-- gh-comment-id:2193882318 --> @d-kleine commented on GitHub (Jun 27, 2024): > I use it on docker on my ubuntu host, maybe that's why. What OS do you use in your Docker image (not Ubuntu too, I assume)? What I wanted to say is that when you switch your OS running the same code, you will get a slightly differently generated output, it's inconsistent across different OS.
Author
Owner

@d-kleine commented on GitHub (Jun 27, 2024):

It seems like even with the same model params (same prompt, same model, same options like a fixed seed and temperature set to 0), the firstly generated output seems to differ from the ones after that (the secondly generated output will be consistent with all following generated outputs).

<!-- gh-comment-id:2195604073 --> @d-kleine commented on GitHub (Jun 27, 2024): It seems like even with the same model params (same prompt, same model, same options like a fixed `seed` and `temperature` set to 0), the firstly generated output seems to differ from the ones after that (the secondly generated output will be consistent with all following generated outputs).
Author
Owner

@Nayar commented on GitHub (Sep 9, 2024):

            "model": model,
            "raw": False,
            "options" : {
                "num_ctx" : 1024*8,
                "temperature" : 0,
                "seed": 42
            },
            "prompt" : prompt_template % (in_params['context']['products'],prompt),
            "stream" : False
}```

I am getting different results on MacOS (M3 Max 30GPU) and Linux (NVIDIA 4070 TI Super). Why is that so?

Both have same version:

`ollama version is 0.3.9`

<!-- gh-comment-id:2338982030 --> @Nayar commented on GitHub (Sep 9, 2024): ```data = { "model": model, "raw": False, "options" : { "num_ctx" : 1024*8, "temperature" : 0, "seed": 42 }, "prompt" : prompt_template % (in_params['context']['products'],prompt), "stream" : False }``` I am getting different results on MacOS (M3 Max 30GPU) and Linux (NVIDIA 4070 TI Super). Why is that so? Both have same version: `ollama version is 0.3.9`
Author
Owner

@d-kleine commented on GitHub (Sep 9, 2024):

@Nayar Due to the model's architecture. So try a different model (e.g. gemma2 worked good for me across different OS) or wait for PRs to be merged:
#4632
https://github.com/ggerganov/llama.cpp/issues/8353

<!-- gh-comment-id:2339003610 --> @d-kleine commented on GitHub (Sep 9, 2024): @Nayar Due to the model's architecture. So try a different model (e.g. gemma2 worked good for me across different OS) or wait for PRs to be merged: #4632 https://github.com/ggerganov/llama.cpp/issues/8353
Author
Owner

@jtyska commented on GitHub (Nov 13, 2024):

Hey everyone, I have the opposite problem. With temperature 0, the generated content is exactly the same even if the seed is set differently. Model: qwen2.5:72b using options:{"seed":42 or 43 or 44 (always same response), temperature:0}. Does someone have this problem? Any clue on how to fix it?

<!-- gh-comment-id:2473872896 --> @jtyska commented on GitHub (Nov 13, 2024): Hey everyone, I have the opposite problem. With temperature 0, the generated content is exactly the same even if the seed is set differently. Model: qwen2.5:72b using options:{"seed":42 or 43 or 44 (always same response), temperature:0}. Does someone have this problem? Any clue on how to fix it?
Author
Owner

@d-kleine commented on GitHub (Nov 13, 2024):

@jtyska Because seed and temp=0 is for making the output reproducible. If you want to generate a variable output each time you execute the generation process, don't use any seed and increase temperature to temp >0 and <=2. You could try 0.3 or 0.7 first to see if this fits your requirements.

<!-- gh-comment-id:2473894432 --> @d-kleine commented on GitHub (Nov 13, 2024): @jtyska Because seed and temp=0 is for making the output reproducible. If you want to generate a variable output each time you execute the generation process, don't use any seed and increase temperature to temp >0 and <=2. You could try 0.3 or 0.7 first to see if this fits your requirements.
Author
Owner

@jtyska commented on GitHub (Nov 13, 2024):

Thanks for your reply @d-kleine.

I want it to be reproducible per seed value (this is usually how random seeds work, right?). In other words, for the same seed, I want the model to generate the same response, but for different seeds, different responses.

<!-- gh-comment-id:2473927917 --> @jtyska commented on GitHub (Nov 13, 2024): Thanks for your reply @d-kleine. I want it to be reproducible per seed value (this is usually how random seeds work, right?). In other words, for the same seed, I want the model to generate the same response, but for different seeds, different responses.
Author
Owner

@jhpjhp1118 commented on GitHub (Nov 26, 2024):

I have one question.
I thought that num_ctx is fixed as default value (=2048) even without explicitly setting this value as 2048.
Does num_ctx change in every single execution?

<!-- gh-comment-id:2499464569 --> @jhpjhp1118 commented on GitHub (Nov 26, 2024): I have one question. I thought that `num_ctx` is fixed as default value (=2048) even without explicitly setting this value as 2048. Does `num_ctx` change in every single execution?
Author
Owner

@d-kleine commented on GitHub (Nov 26, 2024):

I thought that num_ctx is fixed as default value (=2048) even without explicitly setting this value as 2048.

Sorry, I just have revised my statement from above. You are right, every language model has a fixed predefined context length, depending the model itself (I always look it up for each model on HF). So, no, the num_ctx does not change in every single execution when you use the same model (unless explicitly modified by the user or system).

<!-- gh-comment-id:2500291967 --> @d-kleine commented on GitHub (Nov 26, 2024): > I thought that `num_ctx` is fixed as default value (=2048) even without explicitly setting this value as 2048. Sorry, I just have revised my statement from above. You are right, every language model has a fixed predefined context length, depending the model itself (I always look it up for each model on HF). So, no, the `num_ctx` does not change in every single execution when you use the same model (unless explicitly modified by the user or system).
Author
Owner

@jhpjhp1118 commented on GitHub (Nov 26, 2024):

@d-kleine Thanks for your response.
Then I have another question.
I expected a deterministic response to be obtained by simply setting the temperature to 0.
However, based on my experience, just setting the temperature to 0 gave a slightly different response.
And a deterministic response could be obtained only when setting up to num_ctx to an arbitrary fixed value. Why do I have to also set up num_ctx directly to have a consistent response, in addition to temperature?
(This phenomenon mainly occurred when num_ctx was shorter than the token length of the prompt, in my executions)
If there is something I am mistaken about, please correct it.

<!-- gh-comment-id:2500832184 --> @jhpjhp1118 commented on GitHub (Nov 26, 2024): @d-kleine Thanks for your response. Then I have another question. I expected a deterministic response to be obtained by simply setting the `temperature` to 0. However, based on my experience, just setting the `temperature` to 0 gave a slightly different response. And a deterministic response could be obtained only when setting up to `num_ctx` to an arbitrary fixed value. Why do I have to also set up `num_ctx` directly to have a consistent response, in addition to `temperature`? (This phenomenon mainly occurred when `num_ctx` was shorter than the token length of the prompt, in my executions) If there is something I am mistaken about, please correct it.
Author
Owner

@d-kleine commented on GitHub (Nov 26, 2024):

So the idea of setting temp=0 is making a language model deterministic, that means it always picks the highest-probability token, ensuring the output is always the same for a given input, regardless of the seed (but it can vary across hardware differences, floating-point precision errors, and multithreading, see linked issues above)

About num_ctx, you don't have to set this up too - a the max predefined context length is provided by the model. Sometimes it's helpful to reduce resources and inference time when shorten it. But if a part of your prompt is cut off due to a short num_ctx, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero. Therefore you always have to ensure that num_ctx is large enough to fit your entire prompt without truncation.

<!-- gh-comment-id:2500914387 --> @d-kleine commented on GitHub (Nov 26, 2024): So the idea of setting temp=0 is making a language model deterministic, that means it always picks the highest-probability token, ensuring the output is always the same for a given input, regardless of the seed (but it can vary across hardware differences, floating-point precision errors, and multithreading, see linked issues above) About `num_ctx`, you don't have to set this up too - a the max predefined context length is provided by the model. Sometimes it's helpful to reduce resources and inference time when shorten it. But if a part of your prompt is cut off due to a short `num_ctx`, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero. Therefore you always have to ensure that `num_ctx` is large enough to fit your entire prompt without truncation.
Author
Owner

@jhpjhp1118 commented on GitHub (Nov 27, 2024):

I don't understand this comment yet.

But if a part of your prompt is cut off due to a short num_ctx, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero.

After fixing the temperature to 0 and proceeding with the experiment further, it is deterministic when num_ctx is longer than prompt length, but vice versa, the response was a little inconsistent.
Even if num_ctx is shorter than prompt length, I was thinking that if the truncation method was consistent, the response would be consistent because I expected the truncated input to be consistent. In fact, even if num_ctx is shorter than prompt length, most of the responses are consistent, but sometimes there are inconsistent cases.(below 10% possibility) Why is this happening?

<!-- gh-comment-id:2502879894 --> @jhpjhp1118 commented on GitHub (Nov 27, 2024): I don't understand this comment yet. > But if a part of your prompt is cut off due to a short num_ctx, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero. After fixing the `temperature` to 0 and proceeding with the experiment further, it is deterministic when `num_ctx` is longer than prompt length, but vice versa, the response was a little inconsistent. Even if `num_ctx` is shorter than prompt length, I was thinking that if the truncation method was consistent, the response would be consistent because I expected the truncated input to be consistent. In fact, even if `num_ctx` is shorter than prompt length, most of the responses are consistent, but sometimes there are inconsistent cases.(below 10% possibility) Why is this happening?
Author
Owner

@d-kleine commented on GitHub (Nov 27, 2024):

Please provide a sample code, especially with the input, settings and the model are you using.

After fixing the temperature to 0 and proceeding with the experiment further, it is deterministic when num_ctx is longer than prompt length, but vice versa, the response was a little inconsistent.

So, first of all, remember that most language models use subword tokenization techniques. You can test this out for example here for the tokenizers OpenAI uses for their models for your input, but there are more other tokenizers, each working differently. Therefore, you need to check which tokenizer your model uses if you want to reduce the context length. And please remember: num_ctx is tokenized input + tokenized output, not the tokenized input only!

About your question to the inconsistency of the output despite temperature set to 0: This can happen due to a lot of reasons, for example that the value is truly very small close to zero (to avoid zero division), sampling techniques, etc. Therefore it's important to know the model are you using and it's underlying architecture and techniques used in it.

To your question:

Why is this happening?

That means that if important parts from your input being cut-off, the model might receive incomplete context and therefore can produce inconsistent output. Reasons for that - even with temperature 0, while the model deterministically selects the highest-probability token - can be ties in token probabilities, training biases, and positional sensitivity (e.g., primacy/recency effects), etc.

<!-- gh-comment-id:2503694993 --> @d-kleine commented on GitHub (Nov 27, 2024): Please provide a sample code, especially with the input, settings and the model are you using. > After fixing the `temperature` to 0 and proceeding with the experiment further, it is deterministic when `num_ctx` is longer than prompt length, but vice versa, the response was a little inconsistent. So, first of all, remember that most language models use subword tokenization techniques. You can test this out for example [here for the tokenizers OpenAI uses for their models](https://platform.openai.com/tokenizer) for your input, but there are more other tokenizers, each working differently. Therefore, you need to check which tokenizer your model uses if you want to reduce the context length. **And please remember: `num_ctx` is tokenized input + tokenized output**, not the tokenized input only! About your question to the inconsistency of the output despite `temperature` set to 0: This can happen due to a lot of reasons, for example that the value is truly very small close to zero (to avoid zero division), sampling techniques, etc. Therefore it's important to know the model are you using and it's underlying architecture and techniques used in it. To your question: > Why is this happening? That means that if important parts from your input being cut-off, the model might receive incomplete context and therefore can produce inconsistent output. Reasons for that - even with temperature 0, while the model deterministically selects the highest-probability token - can be ties in token probabilities, training biases, and positional sensitivity (e.g., primacy/recency effects), etc.
Author
Owner

@jhpjhp1118 commented on GitHub (Nov 28, 2024):

This is my code. I use llama3.1:8b model.

TEMPERATURE_DETERMINISTIC = 0.0

def generate_response(message, model="llama3.1", stream=False):
    url = endpoint + "/api/generate"
    headers = {
        'Content-Type': 'application/json'
    }
    data = {
        "model": model,
        "prompt": message,
        "options": {
            "temperature": TEMPERATURE_DETERMINISTIC,
            "num_ctx": 3, # experimented with 3 or 100
        },
        "stream": stream
    }
    
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        print("Error: Could not generate response. Status code: ", response.status_code)
    return response

prompt = "Hello, Mr.llama. How are you today?"
for _ in range(30):
        response_raw = pkg.Llama.generate_response(prompt)
        response = response_raw.json()['response']
        print(response_raw.json())

There are 2 points that I want to say, after some more experiments.

  • I think I found the reason of inconsistent responses, especially for my cases. I was using Kubernetes for running ollama as server. I found that inconsistent response caused mainly when the pod is changed, during repeating executions with same prompt. Is it possible that inconsistent response causes when the pod is changed?
  • Even with shorter context length (=num_ctx) than prompt, it showed consistent responses when the pod is not changed. Therefore, I still think that even if num_ctx is shorter than prompt length, responses would be consistent if temperature is 0. This means that num_ctx does not affect to consistency of responses if it is fixed.
<!-- gh-comment-id:2505565038 --> @jhpjhp1118 commented on GitHub (Nov 28, 2024): This is my code. I use `llama3.1:8b` model. ``` TEMPERATURE_DETERMINISTIC = 0.0 def generate_response(message, model="llama3.1", stream=False): url = endpoint + "/api/generate" headers = { 'Content-Type': 'application/json' } data = { "model": model, "prompt": message, "options": { "temperature": TEMPERATURE_DETERMINISTIC, "num_ctx": 3, # experimented with 3 or 100 }, "stream": stream } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: print("Error: Could not generate response. Status code: ", response.status_code) return response prompt = "Hello, Mr.llama. How are you today?" for _ in range(30): response_raw = pkg.Llama.generate_response(prompt) response = response_raw.json()['response'] print(response_raw.json()) ``` There are 2 points that I want to say, after some more experiments. - I think I found the reason of inconsistent responses, especially for my cases. I was using Kubernetes for running ollama as server. I found that inconsistent response caused mainly when the pod is changed, during repeating executions with same prompt. Is it possible that inconsistent response causes when the pod is changed? - Even with shorter context length (=`num_ctx`) than prompt, it showed consistent responses when the pod is not changed. Therefore, I still think that even if `num_ctx` is shorter than prompt length, responses would be consistent if `temperature` is 0. This means that `num_ctx` does not affect to consistency of responses if it is fixed.
Author
Owner

@d-kleine commented on GitHub (Nov 28, 2024):

My two cents on that:

  • num_ctx is way too short, maybe increase it to 8192 (LLama 3 original context length) or something like that if you want generate at least somewhat meaningful output for your input. LLama 3.1 8b supports 128k (actually ~131k) tokens.
  • I don't have deep knowledge with Kubernetes, but afaik a pod is like an instance. So, when restarting/changing the pod, the pod probably has a different state or cache, as long you don't run them with the same state (see https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/). So, yes, this can the root of why your output varies.
<!-- gh-comment-id:2505714012 --> @d-kleine commented on GitHub (Nov 28, 2024): My two cents on that: * `num_ctx` is way too short, maybe increase it to 8192 (LLama 3 original context length) or something like that if you want generate at least somewhat meaningful output for your input. LLama 3.1 8b supports 128k (actually ~131k) tokens. * I don't have deep knowledge with Kubernetes, but afaik a pod is like an instance. So, when restarting/changing the pod, the pod probably has a different state or cache, as long you don't run them with the same state (see https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/). So, yes, this can the root of why your output varies.
Author
Owner

@hissain commented on GitHub (Dec 4, 2024):

Worked for me,

import json
import requests

ollama_url_gen = "http://localhost:11434/api/generate"
ollama_model_name = "llama3.2:latest"

torch.manual_seed(42)
np.random.seed(42)

def generate_answer(question, model=ollama_model_name):
    options = {"temperature": 0.0, "num_ctx": 4096}
    payload = {"model": model, "prompt": question, "options":options, "stream": False}
    headers = {"Content-Type": "application/json"}
    response = requests.post(ollama_url_gen, headers=headers, data=json.dumps(payload))
    
    if response.status_code != 200:
        raise Exception(f"Error from Ollama: {response.text}")
    
    return response.json().get("response", "")
<!-- gh-comment-id:2516976839 --> @hissain commented on GitHub (Dec 4, 2024): Worked for me, ``` import json import requests ollama_url_gen = "http://localhost:11434/api/generate" ollama_model_name = "llama3.2:latest" torch.manual_seed(42) np.random.seed(42) def generate_answer(question, model=ollama_model_name): options = {"temperature": 0.0, "num_ctx": 4096} payload = {"model": model, "prompt": question, "options":options, "stream": False} headers = {"Content-Type": "application/json"} response = requests.post(ollama_url_gen, headers=headers, data=json.dumps(payload)) if response.status_code != 200: raise Exception(f"Error from Ollama: {response.text}") return response.json().get("response", "") ```
Author
Owner

@jtyska commented on GitHub (Jan 6, 2025):

I'm back here. I'm still getting some non-expected results with the temperature parameter. Please take a look at these 10 calls to API/generate with the exact same prompt, temperature=0, and varying seeds (which shouldn't do anything since the output should be deterministic, right?). The first part is the counter of the responses, and the second is the exact payload and received responses. If you do parallel requests with varying temperatures (10 seeds for each), for instance,temperature=0 and temperature=10, this problem worsens, and temperature 0 generates several different responses to the same prompt. Could you test that and see if you can reproduce it? I suspect it is a real bug.

Response counter for Temperature 0 (same model, same prompt, different seeds)

{ "Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with!": 1,
"Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.\n\nIf you need any specific assistance, feel free to ask!": 9}

payload/response details for each seed --- first response is different from the others)

Starting tests with temperature: 0...

Seed 0
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 0,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with!

############
Seed 1
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 1,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 2
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 2,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 3
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 3,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 4
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 4,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 5
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 5,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 6
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 6,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 7
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 7,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 8
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 8,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############
Seed 9
PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 9,
"temperature": 0
},
"stream": false
}

RESPONSE GOT

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############

...done with temperature: 0.

--
Response counter for parallel requests temperature=0 and temperature=10
Temperature = 0 - Count number by unique response
[
7,
2,
1
]
Temperature = 10 - Count number by unique response
[
1,
1,
1,
1,
1,
1,
1,
1,
1,
1
]

Response counter for parallel requests of 8 different temperatures
Temperature = 0 - Count number by unique response
[
4,
1,
1,
1,
2,
1
]
*all other temperatures generated 10 different responses

<!-- gh-comment-id:2573482214 --> @jtyska commented on GitHub (Jan 6, 2025): I'm back here. I'm still getting some non-expected results with the temperature parameter. Please take a look at these 10 calls to API/generate with the exact same prompt, temperature=0, and varying seeds (which shouldn't do anything since the output should be deterministic, right?). The first part is the counter of the responses, and the second is the exact payload and received responses. If you do parallel requests with varying temperatures (10 seeds for each), for instance,temperature=0 and temperature=10, this problem worsens, and temperature 0 generates several different responses to the same prompt. Could you test that and see if you can reproduce it? I suspect it is a real bug. **Response counter for Temperature 0 (same model, same prompt, different seeds)** { "Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with!": **1**, "Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.\n\nIf you need any specific assistance, feel free to ask!": **9**} **payload/response details for each seed --- first response is different from the others)** Starting tests with temperature: 0... ##### Seed 0 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 0, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with! ###### ############ ######## ##### Seed 1 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 1, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 2 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 2, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 3 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 3, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 4 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 4, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 5 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 5, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 6 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 6, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 7 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 7, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 8 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 8, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 9 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 9, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ...done with temperature: 0. -- **Response counter for parallel requests temperature=0 and temperature=10** Temperature = 0 - Count number by unique response [ 7, 2, 1 ] Temperature = 10 - Count number by unique response [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ] -- **Response counter for parallel requests of 8 different temperatures** Temperature = 0 - Count number by unique response [ 4, 1, 1, 1, 2, 1 ] *all other temperatures generated 10 different responses
Author
Owner

@jessegross commented on GitHub (Jan 6, 2025):

Yes, there is known non-determinism as a result of prompt caching (which affects the first response) and parallelism. It's because floating point does not give exactly the same results for different combinations of operations that are mathematically equal. The effect is more pronounced on some hardware - for example, Nvidia GPUs show it more than Apple.

<!-- gh-comment-id:2573733361 --> @jessegross commented on GitHub (Jan 6, 2025): Yes, there is known non-determinism as a result of prompt caching (which affects the first response) and parallelism. It's because floating point does not give exactly the same results for different combinations of operations that are mathematically equal. The effect is more pronounced on some hardware - for example, Nvidia GPUs show it more than Apple.
Author
Owner

@jtyska commented on GitHub (Jan 6, 2025):

Actually, the parallelism is on the processes that are doing the API requests. Ollama server is answering them sequentially. Each time I run multiple API requests to the same model with different seeds/temperature (0 and some other value), I get completely random responses to temperature 0. This isn't supposed to happen, right?

About the prompt caching, is it possible to disable it?

<!-- gh-comment-id:2573826708 --> @jtyska commented on GitHub (Jan 6, 2025): Actually, the parallelism is on the processes that are doing the API requests. Ollama server is answering them sequentially. Each time I run multiple API requests to the same model with different seeds/temperature (0 and some other value), I get completely random responses to temperature 0. This isn't supposed to happen, right? About the prompt caching, is it possible to disable it?
Author
Owner

@d-kleine commented on GitHub (Jan 7, 2025):

About the prompt caching, is it possible to disable it?

Not yet, see #5760

I think it would also be good to have these settings mentioned here: https://github.com/ggerganov/llama.cpp/issues/8353
These would make generated outputs truly deterministic

<!-- gh-comment-id:2574165726 --> @d-kleine commented on GitHub (Jan 7, 2025): > About the prompt caching, is it possible to disable it? Not yet, see #5760 I think it would also be good to have these settings mentioned here: https://github.com/ggerganov/llama.cpp/issues/8353 These would make generated outputs truly deterministic
Author
Owner

@d-kleine commented on GitHub (Jan 7, 2025):

@jtyska I just have seen this discussion here: https://github.com/ggerganov/llama.cpp/discussions/3005#discussioncomment-11151329

Have you tried this?


Edit: @jtyska I just saw that you have asked in the linked thread: Afaik there is no direct --sampling-seq option in ollama. But afaik I understand this param from it's documentation (see under "Sampling params", it decides in which order you are running the samplers.

So, have you tried just using "top_k"=1 to get consistently reproducible outputs?
Please also see ollama's param doc for reference. (just wanting to show where I got the param from; its description is actually not quite precise; top_k considers the k most probable tokens per each generation step. Therefore, top_k=1 will always select the most probable token at each step of the generation process.)

<!-- gh-comment-id:2575338459 --> @d-kleine commented on GitHub (Jan 7, 2025): @jtyska I just have seen this discussion here: https://github.com/ggerganov/llama.cpp/discussions/3005#discussioncomment-11151329 Have you tried this? --- Edit: @jtyska I just saw that you have asked in the linked thread: Afaik there is no direct `--sampling-seq` option in ollama. But afaik I understand this param from it's [documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md) (see under "Sampling params", it decides in which order you are running the samplers. So, have you tried just using `"top_k"=1` to get consistently reproducible outputs? Please also see [ollama's param doc](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter) for reference. (just wanting to show where I got the param from; its description is actually not quite precise; `top_k` considers the k most probable tokens per each generation step. Therefore, `top_k=1` will always select the most probable token at each step of the generation process.)
Author
Owner

@d-kleine commented on GitHub (Jan 14, 2025):

@jtyska Did this resolve your issue or do you need further help with this?

<!-- gh-comment-id:2589601571 --> @d-kleine commented on GitHub (Jan 14, 2025): @jtyska Did this resolve your issue or do you need further help with this?
Author
Owner

@jtyska commented on GitHub (Jan 14, 2025):

Hey @d-kleine, I'm running multiple experiments with temperature/seed values and different models, but I'm not confident that the behavior is consistent. I'm waiting for my experiments to finish, and then I will analyze their outputs to check if they are consistent or not. The workaround that I'm trying to avoid prompt-caching and any other parallel interference is not making the same model to answer multiple parallel requests. Also, when a specific experiment with seed/temperature ends, I unload the model (keep_alive=0) and load it again with the new parameters. Let's see.

I will also try to figure out how to use the top_k within the API when I have time. I let you how it goes.

<!-- gh-comment-id:2589693242 --> @jtyska commented on GitHub (Jan 14, 2025): Hey @d-kleine, I'm running multiple experiments with temperature/seed values and different models, but I'm not confident that the behavior is consistent. I'm waiting for my experiments to finish, and then I will analyze their outputs to check if they are consistent or not. The workaround that I'm trying to avoid prompt-caching and any other parallel interference is not making the same model to answer multiple parallel requests. Also, when a specific experiment with seed/temperature ends, I unload the model (keep_alive=0) and load it again with the new parameters. Let's see. I will also try to figure out how to use the top_k within the API when I have time. I let you how it goes.
Author
Owner

@huynhducloi00 commented on GitHub (Feb 14, 2025):

this is a pretty bad issue. it basically means the platform answer differently for the same prompt. Imagine that this is used in a yes/no setup. an answer of 'yes' is 180 degree difference from no.

Without this being fixed. i don't see how Ollama can be used.

<!-- gh-comment-id:2658118094 --> @huynhducloi00 commented on GitHub (Feb 14, 2025): this is a pretty bad issue. it basically means the platform answer differently for the same prompt. Imagine that this is used in a yes/no setup. an answer of 'yes' is 180 degree difference from no. Without this being fixed. i don't see how Ollama can be used.
Author
Owner

@d-kleine commented on GitHub (Feb 14, 2025):

I don't set the temperature, top_p, or top_k. Setting those three to 0 will make the seed redundant

That's not true - the seed still influences the initial probability distribution affecting the model's token selection process, even for a top_p (limiting the cumulative probability distribution) near 0 or a low top_k (restricting the number of candidate tokens). Also, setting those parameters to 0 (except temperature, but even then this will be close to 0) doesn't make sense imho.

<!-- gh-comment-id:2659201116 --> @d-kleine commented on GitHub (Feb 14, 2025): > I don't set the `temperature`, `top_p`, or `top_k`. Setting those three to `0` will make the seed redundant That's not true - the seed still influences the initial probability distribution affecting the model's token selection process, even for a `top_p` (limiting the cumulative probability distribution) near 0 or a low `top_k` (restricting the number of candidate tokens). Also, setting those parameters to 0 (except `temperature`, but even then this will be close to 0) doesn't make sense imho.
Author
Owner

@d-kleine commented on GitHub (Feb 14, 2025):

No problem - just to be precise, setting top_p to zero is technically fine when temperature is zero since this overrides the nucleus sampling anyway (adjust either temperature or top_p, but not both simultaneously). But top_k must remain ≥1 to allow token selection.

<!-- gh-comment-id:2659304119 --> @d-kleine commented on GitHub (Feb 14, 2025): No problem - just to be precise, setting `top_p` to zero is technically fine when temperature is zero since this overrides the nucleus sampling anyway (adjust either `temperature` or `top_p`, but not both simultaneously). But `top_k` must remain ≥1 to allow token selection.
Author
Owner

@huynhducloi00 commented on GitHub (Feb 14, 2025):

The issue still happens with top_k=1, seed=42, temperature=0, top_p (default, don't know). It does not make any sense. i ask it the same prompt. first time gives answer 'yes', second time gives answer 'no'. So, should i report this answer as yes or no in the paper. lol. This is really no use.

<!-- gh-comment-id:2659821522 --> @huynhducloi00 commented on GitHub (Feb 14, 2025): The issue still happens with top_k=1, seed=42, temperature=0, top_p (default, don't know). It does not make any sense. i ask it the same prompt. first time gives answer 'yes', second time gives answer 'no'. So, should i report this answer as yes or no in the paper. lol. This is really no use.
Author
Owner

@d-kleine commented on GitHub (Feb 14, 2025):

@huynhducloi00 Please provide the model (and code, if possible)

<!-- gh-comment-id:2659855136 --> @d-kleine commented on GitHub (Feb 14, 2025): @huynhducloi00 Please provide the model (and code, if possible)
Author
Owner

@huynhducloi00 commented on GitHub (Feb 14, 2025):

it can be reproduced with model 'deepseek-r1:14b'

from ollama import generate
MODEL='deepseek-r1:14b'
def get_ollama_order_flaky(prompt):
    response = generate(MODEL, prompt,options={'temperature': 0, 'seed':42, 'top_k':1})
    return response['response']

prompt="""
download from here https://justpaste.it/gk2xt
"""

It sometimes give answer yes, sometimes no, especially when you flip a different prompt in between.

<!-- gh-comment-id:2660005601 --> @huynhducloi00 commented on GitHub (Feb 14, 2025): it can be reproduced with model 'deepseek-r1:14b' ``` from ollama import generate MODEL='deepseek-r1:14b' def get_ollama_order_flaky(prompt): response = generate(MODEL, prompt,options={'temperature': 0, 'seed':42, 'top_k':1}) return response['response'] ``` prompt=""" download from here https://justpaste.it/gk2xt """ It sometimes give answer yes, sometimes no, especially when you flip a different prompt in between.
Author
Owner

@d-kleine commented on GitHub (Feb 14, 2025):

@huynhducloi00 Please try with options={"temperature": 0, "seed": 42, "top_k": 1, "top_p": 1}, at least on my end (GPU) it's reproducible consistently across multiple runs (with r1 1.5b); Please let me know if that suits your requirements.

Edit:
The prompt should be further engineered, e.g. making it more precise that it's a question and what the answer format should look like, e.g.

can `testCompositeKeys` be flaky depending on the order of which it is run compared to other tests? Answer with "Yes" or "No" only, nothing more.*

if you want to output only "Yes" or "No". For example, I have added a ? (indicating a question) and specified the desired output format a little more. This makes the input clearer for the model which has implications for the token selection process in the output.

<!-- gh-comment-id:2660126310 --> @d-kleine commented on GitHub (Feb 14, 2025): @huynhducloi00 Please try with `options={"temperature": 0, "seed": 42, "top_k": 1, "top_p": 1}`, at least on my end (GPU) it's reproducible consistently across multiple runs (with r1 1.5b); Please let me know if that suits your requirements. Edit: The prompt should be further engineered, e.g. making it more precise that it's a question and what the answer format should look like, e.g. ```text can `testCompositeKeys` be flaky depending on the order of which it is run compared to other tests? Answer with "Yes" or "No" only, nothing more.* ``` if you want to output only "Yes" or "No". For example, I have added a `?` (indicating a question) and specified the desired output format a little more. This makes the input clearer for the model which has implications for the token selection process in the output.
Author
Owner

@huynhducloi00 commented on GitHub (Feb 14, 2025):

Thanks a lot for the response. That helps a lot

<!-- gh-comment-id:2660442294 --> @huynhducloi00 commented on GitHub (Feb 14, 2025): Thanks a lot for the response. That helps a lot
Author
Owner

@huynhducloi00 commented on GitHub (Feb 15, 2025):

sadly, "top_p" does not solve it:
Here is the result of 10 questions:

Run 1:
[(5, False),
 (6, True),
 (7, False),
 (8, True),
 (9, True),
 (10, True),
 (11, True),
 (12, True),
 (13, False),
 (14, False)]

Run 2:
[(5, False),
 (6, True),
 (7, True),
 (8, True),
 (9, False),
 (10, True),
 (11, True),
 (12, True),
 (13, False),
 (14, True)]

They are off (different) at index 7.

model being test is deepseek-r1:14b. The option is "temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000

I am not sure why this is not being prioritized. This is very easy to reproduce. Is it due to the randomness in the quantization. I guess deepseek-r1:14b is a quantized model.

<!-- gh-comment-id:2661120274 --> @huynhducloi00 commented on GitHub (Feb 15, 2025): sadly, "top_p" does not solve it: Here is the result of 10 questions: ``` Run 1: [(5, False), (6, True), (7, False), (8, True), (9, True), (10, True), (11, True), (12, True), (13, False), (14, False)] Run 2: [(5, False), (6, True), (7, True), (8, True), (9, False), (10, True), (11, True), (12, True), (13, False), (14, True)] ``` They are off (different) at index 7. * here is the answer at run 1: https://justpaste.it/dc73s * Here is the answer at run 2: https://justpaste.it/3t6y9 model being test is deepseek-r1:14b. The option is `"temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000` I am not sure why this is not being prioritized. This is very easy to reproduce. Is it due to the randomness in the quantization. I guess deepseek-r1:14b is a quantized model.
Author
Owner

@d-kleine commented on GitHub (Feb 16, 2025):

Have you tried without limiting the context length (num_ctx)?

There can be another numerous reasons that can introduce randomness, e.g.

  • PRNG initialization between runs
  • parallel GPU computation (nothing can be done about this; you can double-check if this behaviour is the same on CPU only too)
  • quantization method (Q4_K_M)
  • floating point arithmetic (fp32, fp16, bfloat16)
  • etc. pp.
<!-- gh-comment-id:2661178715 --> @d-kleine commented on GitHub (Feb 16, 2025): Have you tried without limiting the context length (`num_ctx`)? There can be another numerous reasons that can introduce randomness, e.g. - PRNG initialization between runs - parallel GPU computation (nothing can be done about this; you can double-check if this behaviour is the same on CPU only too) - quantization method (Q4_K_M) - floating point arithmetic (fp32, fp16, bfloat16) - etc. pp.
Author
Owner

@dan31 commented on GitHub (Mar 26, 2025):

Wow this is disastrous. We encounter this a lot now. What is the true reason quantized models cannot produce deterministic outputs when all non-determinism is off in ollama? This is a completely serial execution on GPU by a single client for us with a Q4_K_M model, the most popular quantization format.

<!-- gh-comment-id:2752926987 --> @dan31 commented on GitHub (Mar 26, 2025): Wow this is disastrous. We encounter this a lot now. What is the true reason quantized models cannot produce deterministic outputs when all non-determinism is off in ollama? This is a completely serial execution on GPU by a single client for us with a Q4_K_M model, the most popular quantization format.
Author
Owner

@kevin-pw commented on GitHub (Mar 26, 2025):

model being test is deepseek-r1:14b. The option is "temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000

I wrote a longer comment in the related (and still open) issue here: https://github.com/ollama/ollama/issues/5321#issuecomment-2755465128 but the main takeaway is: Several models also produce inconsistent embeddings for the same inputs, which can significantly affect quality of downstream applications, for example RAG. That also means sampler options (like temperature, seed, top_k, etc.) cannot fix this issue.

<!-- gh-comment-id:2755495345 --> @kevin-pw commented on GitHub (Mar 26, 2025): > model being test is deepseek-r1:14b. The option is `"temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000` I wrote a longer comment in the related (and still open) issue here: https://github.com/ollama/ollama/issues/5321#issuecomment-2755465128 but the main takeaway is: Several models also produce inconsistent embeddings for the same inputs, which can significantly affect quality of downstream applications, for example RAG. That also means sampler options (like `temperature`, `seed`, `top_k`, etc.) cannot fix this issue.
Author
Owner

@flexorx commented on GitHub (Mar 26, 2025):

@kevin-pw do you know if this issue is recently introduced?
I couldn't find any mentions of a similar issue until recently.

<!-- gh-comment-id:2755878435 --> @flexorx commented on GitHub (Mar 26, 2025): @kevin-pw do you know if this issue is recently introduced? I couldn't find any mentions of a similar issue until recently.
Author
Owner

@huynhducloi00 commented on GitHub (May 9, 2025):

yes, not sure why the team does not prioritize this

<!-- gh-comment-id:2867991195 --> @huynhducloi00 commented on GitHub (May 9, 2025): yes, not sure why the team does not prioritize this
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46774