[GH-ISSUE #586] /api/generate with fixed seed and temperature=0 doesn't produce deterministic results #46774

New Issue

GiteaMirror · 2026-04-27T23:57:00-05:00

GiteaMirror commented

2026-04-27 23:57:00 -05:00

Originally created by @jmorganca on GitHub (Sep 25, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/586

Originally assigned to: @BruceMacD on GitHub.

Originally created by @jmorganca on GitHub (Sep 25, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/586 Originally assigned to: @BruceMacD on GitHub.

GiteaMirror added the bug label 2026-04-27 23:57:00 -05:00

GiteaMirror closed this issue

2026-04-27 23:57:11 -05:00

GiteaMirror commented

2026-04-27 23:57:15 -05:00

@sqs commented on GitHub (Sep 27, 2023):

I just noticed this as well.

~3 weeks ago, the following command was deterministic:

curl -d '{"prompt":"const primes=[1,2,3,","model":"codellama:7b-code","options":{"seed":1337,"temperature":0,"num_ctx":100,"stop":["\n"]}}' http://localhost:11434/api/generate

Now it is not.

@sqs commented on GitHub (Sep 27, 2023): I just noticed this as well. ~3 weeks ago, the following command was deterministic: ``` curl -d '{"prompt":"const primes=[1,2,3,","model":"codellama:7b-code","options":{"seed":1337,"temperature":0,"num_ctx":100,"stop":["\n"]}}' http://localhost:11434/api/generate ``` Now it is not.

GiteaMirror commented

2026-04-27 23:57:18 -05:00

@BruceMacD commented on GitHub (Oct 2, 2023):

Fixed in #663

@BruceMacD commented on GitHub (Oct 2, 2023): Fixed in #663

GiteaMirror commented

2026-04-27 23:57:20 -05:00

@j2l commented on GitHub (May 23, 2024):

It happens again.

@j2l commented on GitHub (May 23, 2024): It happens again.

GiteaMirror commented

2026-04-27 23:57:23 -05:00

@d-kleine commented on GitHub (Jun 26, 2024):

I just resolved my issue with the Ollama API docs. The model parameters to make the model deterministic (and herewith reproducible) need to be passed in with an "options" key in the json input for Ollama:

"options": {  
            "seed": 42, # fixed seed for reproduciblity (not needed when using temperature= 0)
            "temperature": 0, # temp set to zero for determinism
        }

@d-kleine commented on GitHub (Jun 26, 2024): I just resolved my issue with the [Ollama API docs](https://github.com/ollama/ollama/blob/main/docs/api.md#chat-request-reproducible-outputs). The [model parameters](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values) to make the model deterministic (and herewith reproducible) need to be passed in with an `"options"` key in the json input for Ollama: ```python "options": { "seed": 42, # fixed seed for reproduciblity (not needed when using temperature= 0) "temperature": 0, # temp set to zero for determinism } ```

GiteaMirror commented

2026-04-27 23:57:26 -05:00

@j2l commented on GitHub (Jun 27, 2024):

Hey @d-kleine, do you mean you just tested it and it absolutely works for you?

Because as you can see in the message from Sep 27, 2023, we all use seed + temperature + num_ctx in options:

"options":{"seed":1337,"temperature":0,"num_ctx":100 ...

@j2l commented on GitHub (Jun 27, 2024): Hey @d-kleine, do you mean you just tested it and it absolutely works for you? Because as you can see in the message from Sep 27, 2023, we all use seed + temperature + num_ctx in options: > `"options":{"seed":1337,"temperature":0,"num_ctx":100 ...`

GiteaMirror commented

2026-04-27 23:57:27 -05:00

@d-kleine commented on GitHub (Jun 27, 2024):

Yes, the output is deterministic and reproducible - on the same device with the same OS (in my case, Windows 10). However, if you have an different OS (I have tested it with Docker running an Ubuntu image on the same device), it will generate a similar but not identical output. So, the output is deterministic and reproducible on the same OS, but currently I have the issue to produce consistent output across different OS.

@d-kleine commented on GitHub (Jun 27, 2024): Yes, the output is deterministic and reproducible - on the same device with the same OS (in my case, Windows 10). However, if you have an different OS (I have tested it with Docker running an Ubuntu image on the same device), it will generate a similar but not identical output. So, the output is **deterministic** and **reproducible** on the same OS, but currently I have the issue to produce **consistent** output across different OS.

GiteaMirror commented

2026-04-27 23:57:28 -05:00

@j2l commented on GitHub (Jun 27, 2024):

Ok, thank you @d-kleine !
I use it on docker on my ubuntu host, maybe that's why.

@j2l commented on GitHub (Jun 27, 2024): Ok, thank you @d-kleine ! I use it on docker on my ubuntu host, maybe that's why.

GiteaMirror commented

2026-04-27 23:57:31 -05:00

@d-kleine commented on GitHub (Jun 27, 2024):

I use it on docker on my ubuntu host, maybe that's why.
What OS do you use in your Docker image (not Ubuntu too, I assume)?

What I wanted to say is that when you switch your OS running the same code, you will get a slightly differently generated output, it's inconsistent across different OS.

@d-kleine commented on GitHub (Jun 27, 2024): > I use it on docker on my ubuntu host, maybe that's why. What OS do you use in your Docker image (not Ubuntu too, I assume)? What I wanted to say is that when you switch your OS running the same code, you will get a slightly differently generated output, it's inconsistent across different OS.

GiteaMirror commented

2026-04-27 23:57:32 -05:00

@d-kleine commented on GitHub (Jun 27, 2024):

It seems like even with the same model params (same prompt, same model, same options like a fixed seed and temperature set to 0), the firstly generated output seems to differ from the ones after that (the secondly generated output will be consistent with all following generated outputs).

@d-kleine commented on GitHub (Jun 27, 2024): It seems like even with the same model params (same prompt, same model, same options like a fixed `seed` and `temperature` set to 0), the firstly generated output seems to differ from the ones after that (the secondly generated output will be consistent with all following generated outputs).

GiteaMirror commented

2026-04-27 23:57:38 -05:00

@Nayar commented on GitHub (Sep 9, 2024):

            "model": model,
            "raw": False,
            "options" : {
                "num_ctx" : 1024*8,
                "temperature" : 0,
                "seed": 42
            },
            "prompt" : prompt_template % (in_params['context']['products'],prompt),
            "stream" : False
}```

I am getting different results on MacOS (M3 Max 30GPU) and Linux (NVIDIA 4070 TI Super). Why is that so?

Both have same version:

`ollama version is 0.3.9`

@Nayar commented on GitHub (Sep 9, 2024): ```data = { "model": model, "raw": False, "options" : { "num_ctx" : 1024*8, "temperature" : 0, "seed": 42 }, "prompt" : prompt_template % (in_params['context']['products'],prompt), "stream" : False }``` I am getting different results on MacOS (M3 Max 30GPU) and Linux (NVIDIA 4070 TI Super). Why is that so? Both have same version: `ollama version is 0.3.9`

GiteaMirror commented

2026-04-27 23:57:41 -05:00

@d-kleine commented on GitHub (Sep 9, 2024):

@Nayar Due to the model's architecture. So try a different model (e.g. gemma2 worked good for me across different OS) or wait for PRs to be merged:
#4632
https://github.com/ggerganov/llama.cpp/issues/8353

@d-kleine commented on GitHub (Sep 9, 2024): @Nayar Due to the model's architecture. So try a different model (e.g. gemma2 worked good for me across different OS) or wait for PRs to be merged: #4632 https://github.com/ggerganov/llama.cpp/issues/8353

GiteaMirror commented

2026-04-27 23:57:47 -05:00

@jtyska commented on GitHub (Nov 13, 2024):

Hey everyone, I have the opposite problem. With temperature 0, the generated content is exactly the same even if the seed is set differently. Model: qwen2.5:72b using options:{"seed":42 or 43 or 44 (always same response), temperature:0}. Does someone have this problem? Any clue on how to fix it?

@jtyska commented on GitHub (Nov 13, 2024): Hey everyone, I have the opposite problem. With temperature 0, the generated content is exactly the same even if the seed is set differently. Model: qwen2.5:72b using options:{"seed":42 or 43 or 44 (always same response), temperature:0}. Does someone have this problem? Any clue on how to fix it?

GiteaMirror commented

2026-04-27 23:57:52 -05:00

@d-kleine commented on GitHub (Nov 13, 2024):

@jtyska Because seed and temp=0 is for making the output reproducible. If you want to generate a variable output each time you execute the generation process, don't use any seed and increase temperature to temp >0 and <=2. You could try 0.3 or 0.7 first to see if this fits your requirements.

@d-kleine commented on GitHub (Nov 13, 2024): @jtyska Because seed and temp=0 is for making the output reproducible. If you want to generate a variable output each time you execute the generation process, don't use any seed and increase temperature to temp >0 and <=2. You could try 0.3 or 0.7 first to see if this fits your requirements.

GiteaMirror commented

2026-04-27 23:57:58 -05:00

@jtyska commented on GitHub (Nov 13, 2024):

Thanks for your reply @d-kleine.

I want it to be reproducible per seed value (this is usually how random seeds work, right?). In other words, for the same seed, I want the model to generate the same response, but for different seeds, different responses.

@jtyska commented on GitHub (Nov 13, 2024): Thanks for your reply @d-kleine. I want it to be reproducible per seed value (this is usually how random seeds work, right?). In other words, for the same seed, I want the model to generate the same response, but for different seeds, different responses.

GiteaMirror commented

2026-04-27 23:58:04 -05:00

@jhpjhp1118 commented on GitHub (Nov 26, 2024):

I have one question.
I thought that num_ctx is fixed as default value (=2048) even without explicitly setting this value as 2048.
Does num_ctx change in every single execution?

@jhpjhp1118 commented on GitHub (Nov 26, 2024): I have one question. I thought that `num_ctx` is fixed as default value (=2048) even without explicitly setting this value as 2048. Does `num_ctx` change in every single execution?

GiteaMirror commented

2026-04-27 23:58:08 -05:00

@d-kleine commented on GitHub (Nov 26, 2024):

I thought that num_ctx is fixed as default value (=2048) even without explicitly setting this value as 2048.

Sorry, I just have revised my statement from above. You are right, every language model has a fixed predefined context length, depending the model itself (I always look it up for each model on HF). So, no, the num_ctx does not change in every single execution when you use the same model (unless explicitly modified by the user or system).

@d-kleine commented on GitHub (Nov 26, 2024): > I thought that `num_ctx` is fixed as default value (=2048) even without explicitly setting this value as 2048. Sorry, I just have revised my statement from above. You are right, every language model has a fixed predefined context length, depending the model itself (I always look it up for each model on HF). So, no, the `num_ctx` does not change in every single execution when you use the same model (unless explicitly modified by the user or system).

GiteaMirror commented

2026-04-27 23:58:09 -05:00

@jhpjhp1118 commented on GitHub (Nov 26, 2024):

@d-kleine Thanks for your response.
Then I have another question.
I expected a deterministic response to be obtained by simply setting the temperature to 0.
However, based on my experience, just setting the temperature to 0 gave a slightly different response.
And a deterministic response could be obtained only when setting up to num_ctx to an arbitrary fixed value. Why do I have to also set up num_ctx directly to have a consistent response, in addition to temperature?
(This phenomenon mainly occurred when num_ctx was shorter than the token length of the prompt, in my executions)
If there is something I am mistaken about, please correct it.

@jhpjhp1118 commented on GitHub (Nov 26, 2024): @d-kleine Thanks for your response. Then I have another question. I expected a deterministic response to be obtained by simply setting the `temperature` to 0. However, based on my experience, just setting the `temperature` to 0 gave a slightly different response. And a deterministic response could be obtained only when setting up to `num_ctx` to an arbitrary fixed value. Why do I have to also set up `num_ctx` directly to have a consistent response, in addition to `temperature`? (This phenomenon mainly occurred when `num_ctx` was shorter than the token length of the prompt, in my executions) If there is something I am mistaken about, please correct it.

GiteaMirror commented

2026-04-27 23:58:12 -05:00

@d-kleine commented on GitHub (Nov 26, 2024):

So the idea of setting temp=0 is making a language model deterministic, that means it always picks the highest-probability token, ensuring the output is always the same for a given input, regardless of the seed (but it can vary across hardware differences, floating-point precision errors, and multithreading, see linked issues above)

About num_ctx, you don't have to set this up too - a the max predefined context length is provided by the model. Sometimes it's helpful to reduce resources and inference time when shorten it. But if a part of your prompt is cut off due to a short num_ctx, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero. Therefore you always have to ensure that num_ctx is large enough to fit your entire prompt without truncation.

@d-kleine commented on GitHub (Nov 26, 2024): So the idea of setting temp=0 is making a language model deterministic, that means it always picks the highest-probability token, ensuring the output is always the same for a given input, regardless of the seed (but it can vary across hardware differences, floating-point precision errors, and multithreading, see linked issues above) About `num_ctx`, you don't have to set this up too - a the max predefined context length is provided by the model. Sometimes it's helpful to reduce resources and inference time when shorten it. But if a part of your prompt is cut off due to a short `num_ctx`, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero. Therefore you always have to ensure that `num_ctx` is large enough to fit your entire prompt without truncation.

GiteaMirror commented

2026-04-27 23:58:18 -05:00

@jhpjhp1118 commented on GitHub (Nov 27, 2024):

I don't understand this comment yet.

But if a part of your prompt is cut off due to a short num_ctx, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero.

After fixing the temperature to 0 and proceeding with the experiment further, it is deterministic when num_ctx is longer than prompt length, but vice versa, the response was a little inconsistent.
Even if num_ctx is shorter than prompt length, I was thinking that if the truncation method was consistent, the response would be consistent because I expected the truncated input to be consistent. In fact, even if num_ctx is shorter than prompt length, most of the responses are consistent, but sometimes there are inconsistent cases.(below 10% possibility) Why is this happening?

@jhpjhp1118 commented on GitHub (Nov 27, 2024): I don't understand this comment yet. > But if a part of your prompt is cut off due to a short num_ctx, the model's understanding of the task or context changes, leading to variations in output, even if other parameters like temperature are fixed to zero. After fixing the `temperature` to 0 and proceeding with the experiment further, it is deterministic when `num_ctx` is longer than prompt length, but vice versa, the response was a little inconsistent. Even if `num_ctx` is shorter than prompt length, I was thinking that if the truncation method was consistent, the response would be consistent because I expected the truncated input to be consistent. In fact, even if `num_ctx` is shorter than prompt length, most of the responses are consistent, but sometimes there are inconsistent cases.(below 10% possibility) Why is this happening?

GiteaMirror commented

2026-04-27 23:58:24 -05:00

@d-kleine commented on GitHub (Nov 27, 2024):

Please provide a sample code, especially with the input, settings and the model are you using.

After fixing the temperature to 0 and proceeding with the experiment further, it is deterministic when num_ctx is longer than prompt length, but vice versa, the response was a little inconsistent.

So, first of all, remember that most language models use subword tokenization techniques. You can test this out for example here for the tokenizers OpenAI uses for their models for your input, but there are more other tokenizers, each working differently. Therefore, you need to check which tokenizer your model uses if you want to reduce the context length. And please remember: num_ctx is tokenized input + tokenized output, not the tokenized input only!

About your question to the inconsistency of the output despite temperature set to 0: This can happen due to a lot of reasons, for example that the value is truly very small close to zero (to avoid zero division), sampling techniques, etc. Therefore it's important to know the model are you using and it's underlying architecture and techniques used in it.

To your question:

Why is this happening?

That means that if important parts from your input being cut-off, the model might receive incomplete context and therefore can produce inconsistent output. Reasons for that - even with temperature 0, while the model deterministically selects the highest-probability token - can be ties in token probabilities, training biases, and positional sensitivity (e.g., primacy/recency effects), etc.

@d-kleine commented on GitHub (Nov 27, 2024): Please provide a sample code, especially with the input, settings and the model are you using. > After fixing the `temperature` to 0 and proceeding with the experiment further, it is deterministic when `num_ctx` is longer than prompt length, but vice versa, the response was a little inconsistent. So, first of all, remember that most language models use subword tokenization techniques. You can test this out for example [here for the tokenizers OpenAI uses for their models](https://platform.openai.com/tokenizer) for your input, but there are more other tokenizers, each working differently. Therefore, you need to check which tokenizer your model uses if you want to reduce the context length. **And please remember: `num_ctx` is tokenized input + tokenized output**, not the tokenized input only! About your question to the inconsistency of the output despite `temperature` set to 0: This can happen due to a lot of reasons, for example that the value is truly very small close to zero (to avoid zero division), sampling techniques, etc. Therefore it's important to know the model are you using and it's underlying architecture and techniques used in it. To your question: > Why is this happening? That means that if important parts from your input being cut-off, the model might receive incomplete context and therefore can produce inconsistent output. Reasons for that - even with temperature 0, while the model deterministically selects the highest-probability token - can be ties in token probabilities, training biases, and positional sensitivity (e.g., primacy/recency effects), etc.

GiteaMirror commented

2026-04-27 23:58:25 -05:00

@jhpjhp1118 commented on GitHub (Nov 28, 2024):

This is my code. I use llama3.1:8b model.

TEMPERATURE_DETERMINISTIC = 0.0

def generate_response(message, model="llama3.1", stream=False):
    url = endpoint + "/api/generate"
    headers = {
        'Content-Type': 'application/json'
    }
    data = {
        "model": model,
        "prompt": message,
        "options": {
            "temperature": TEMPERATURE_DETERMINISTIC,
            "num_ctx": 3, # experimented with 3 or 100
        },
        "stream": stream
    }
    
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        print("Error: Could not generate response. Status code: ", response.status_code)
    return response

prompt = "Hello, Mr.llama. How are you today?"
for _ in range(30):
        response_raw = pkg.Llama.generate_response(prompt)
        response = response_raw.json()['response']
        print(response_raw.json())

There are 2 points that I want to say, after some more experiments.

I think I found the reason of inconsistent responses, especially for my cases. I was using Kubernetes for running ollama as server. I found that inconsistent response caused mainly when the pod is changed, during repeating executions with same prompt. Is it possible that inconsistent response causes when the pod is changed?
Even with shorter context length (=num_ctx) than prompt, it showed consistent responses when the pod is not changed. Therefore, I still think that even if num_ctx is shorter than prompt length, responses would be consistent if temperature is 0. This means that num_ctx does not affect to consistency of responses if it is fixed.

@jhpjhp1118 commented on GitHub (Nov 28, 2024): This is my code. I use `llama3.1:8b` model. ``` TEMPERATURE_DETERMINISTIC = 0.0 def generate_response(message, model="llama3.1", stream=False): url = endpoint + "/api/generate" headers = { 'Content-Type': 'application/json' } data = { "model": model, "prompt": message, "options": { "temperature": TEMPERATURE_DETERMINISTIC, "num_ctx": 3, # experimented with 3 or 100 }, "stream": stream } response = requests.post(url, headers=headers, json=data) if response.status_code != 200: print("Error: Could not generate response. Status code: ", response.status_code) return response prompt = "Hello, Mr.llama. How are you today?" for _ in range(30): response_raw = pkg.Llama.generate_response(prompt) response = response_raw.json()['response'] print(response_raw.json()) ``` There are 2 points that I want to say, after some more experiments. - I think I found the reason of inconsistent responses, especially for my cases. I was using Kubernetes for running ollama as server. I found that inconsistent response caused mainly when the pod is changed, during repeating executions with same prompt. Is it possible that inconsistent response causes when the pod is changed? - Even with shorter context length (=`num_ctx`) than prompt, it showed consistent responses when the pod is not changed. Therefore, I still think that even if `num_ctx` is shorter than prompt length, responses would be consistent if `temperature` is 0. This means that `num_ctx` does not affect to consistency of responses if it is fixed.

GiteaMirror commented

2026-04-27 23:58:26 -05:00

@d-kleine commented on GitHub (Nov 28, 2024):

My two cents on that:

num_ctx is way too short, maybe increase it to 8192 (LLama 3 original context length) or something like that if you want generate at least somewhat meaningful output for your input. LLama 3.1 8b supports 128k (actually ~131k) tokens.
I don't have deep knowledge with Kubernetes, but afaik a pod is like an instance. So, when restarting/changing the pod, the pod probably has a different state or cache, as long you don't run them with the same state (see https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/). So, yes, this can the root of why your output varies.

@d-kleine commented on GitHub (Nov 28, 2024): My two cents on that: * `num_ctx` is way too short, maybe increase it to 8192 (LLama 3 original context length) or something like that if you want generate at least somewhat meaningful output for your input. LLama 3.1 8b supports 128k (actually ~131k) tokens. * I don't have deep knowledge with Kubernetes, but afaik a pod is like an instance. So, when restarting/changing the pod, the pod probably has a different state or cache, as long you don't run them with the same state (see https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/). So, yes, this can the root of why your output varies.

GiteaMirror commented

2026-04-27 23:58:27 -05:00

@hissain commented on GitHub (Dec 4, 2024):

Worked for me,

import json
import requests

ollama_url_gen = "http://localhost:11434/api/generate"
ollama_model_name = "llama3.2:latest"

torch.manual_seed(42)
np.random.seed(42)

def generate_answer(question, model=ollama_model_name):
    options = {"temperature": 0.0, "num_ctx": 4096}
    payload = {"model": model, "prompt": question, "options":options, "stream": False}
    headers = {"Content-Type": "application/json"}
    response = requests.post(ollama_url_gen, headers=headers, data=json.dumps(payload))
    
    if response.status_code != 200:
        raise Exception(f"Error from Ollama: {response.text}")
    
    return response.json().get("response", "")

@hissain commented on GitHub (Dec 4, 2024): Worked for me, ``` import json import requests ollama_url_gen = "http://localhost:11434/api/generate" ollama_model_name = "llama3.2:latest" torch.manual_seed(42) np.random.seed(42) def generate_answer(question, model=ollama_model_name): options = {"temperature": 0.0, "num_ctx": 4096} payload = {"model": model, "prompt": question, "options":options, "stream": False} headers = {"Content-Type": "application/json"} response = requests.post(ollama_url_gen, headers=headers, data=json.dumps(payload)) if response.status_code != 200: raise Exception(f"Error from Ollama: {response.text}") return response.json().get("response", "") ```

GiteaMirror commented

2026-04-27 23:58:29 -05:00

@jtyska commented on GitHub (Jan 6, 2025):

I'm back here. I'm still getting some non-expected results with the temperature parameter. Please take a look at these 10 calls to API/generate with the exact same prompt, temperature=0, and varying seeds (which shouldn't do anything since the output should be deterministic, right?). The first part is the counter of the responses, and the second is the exact payload and received responses. If you do parallel requests with varying temperatures (10 seeds for each), for instance,temperature=0 and temperature=10, this problem worsens, and temperature 0 generates several different responses to the same prompt. Could you test that and see if you can reproduce it? I suspect it is a real bug.

Response counter for Temperature 0 (same model, same prompt, different seeds)

{ "Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with!": 1,
"Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.\n\nIf you need any specific assistance, feel free to ask!": 9}

payload/response details for each seed --- first response is different from the others)

Starting tests with temperature: 0...

Seed 0

PAYLOAD SENT to api/generate

{
"model": "qwen2.5:1.5b",
"stop": "<|endoftext|>",
"prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?",
"options": {
"seed": 0,
"temperature": 0
},
"stream": false
}

Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.

If you need any specific assistance, feel free to ask!

############

...done with temperature: 0.

--
Response counter for parallel requests temperature=0 and temperature=10
Temperature = 0 - Count number by unique response
[
7,
2,
1
]
Temperature = 10 - Count number by unique response
[
1,
1,
1,
1,
1,
1,
1,
1,
1,
1
]

Response counter for parallel requests of 8 different temperatures
Temperature = 0 - Count number by unique response
[
4,
1,
1,
1,
2,
1
]
*all other temperatures generated 10 different responses

@jtyska commented on GitHub (Jan 6, 2025): I'm back here. I'm still getting some non-expected results with the temperature parameter. Please take a look at these 10 calls to API/generate with the exact same prompt, temperature=0, and varying seeds (which shouldn't do anything since the output should be deterministic, right?). The first part is the counter of the responses, and the second is the exact payload and received responses. If you do parallel requests with varying temperatures (10 seeds for each), for instance,temperature=0 and temperature=10, this problem worsens, and temperature 0 generates several different responses to the same prompt. Could you test that and see if you can reproduce it? I suspect it is a real bug. **Response counter for Temperature 0 (same model, same prompt, different seeds)** { "Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with!": **1**, "Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc.\n\nIf you need any specific assistance, feel free to ask!": **9**} **payload/response details for each seed --- first response is different from the others)** Starting tests with temperature: 0... ##### Seed 0 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 0, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist with tasks such as writing essays, generating creative ideas, or even helping you learn new languages. Just let me know what you need assistance with! ###### ############ ######## ##### Seed 1 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 1, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 2 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 2, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 3 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 3, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 4 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 4, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 5 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 5, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 6 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 6, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 7 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 7, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 8 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 8, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ##### Seed 9 ##### ###### PAYLOAD SENT to api/generate ######## { "model": "qwen2.5:1.5b", "stop": "<|endoftext|>", "prompt": "Hi! How are you? I am testing the impact of different temperatures in the response generated by you, could you please generate any answer you with for this request?", "options": { "seed": 9, "temperature": 0 }, "stream": false } ###### RESPONSE GOT ######## Hello! I'm just an AI language model and don't have feelings or emotions like humans do. However, I can help you find information on a wide range of topics and assist you with tasks such as writing essays, creating presentations, answering questions about science, history, literature, etc. If you need any specific assistance, feel free to ask! ###### ############ ######## ...done with temperature: 0. -- **Response counter for parallel requests temperature=0 and temperature=10** Temperature = 0 - Count number by unique response [ 7, 2, 1 ] Temperature = 10 - Count number by unique response [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ] -- **Response counter for parallel requests of 8 different temperatures** Temperature = 0 - Count number by unique response [ 4, 1, 1, 1, 2, 1 ] *all other temperatures generated 10 different responses

GiteaMirror commented

2026-04-27 23:58:30 -05:00

@jessegross commented on GitHub (Jan 6, 2025):

Yes, there is known non-determinism as a result of prompt caching (which affects the first response) and parallelism. It's because floating point does not give exactly the same results for different combinations of operations that are mathematically equal. The effect is more pronounced on some hardware - for example, Nvidia GPUs show it more than Apple.

@jessegross commented on GitHub (Jan 6, 2025): Yes, there is known non-determinism as a result of prompt caching (which affects the first response) and parallelism. It's because floating point does not give exactly the same results for different combinations of operations that are mathematically equal. The effect is more pronounced on some hardware - for example, Nvidia GPUs show it more than Apple.

GiteaMirror commented

2026-04-27 23:58:31 -05:00

@jtyska commented on GitHub (Jan 6, 2025):

Actually, the parallelism is on the processes that are doing the API requests. Ollama server is answering them sequentially. Each time I run multiple API requests to the same model with different seeds/temperature (0 and some other value), I get completely random responses to temperature 0. This isn't supposed to happen, right?

About the prompt caching, is it possible to disable it?

@jtyska commented on GitHub (Jan 6, 2025): Actually, the parallelism is on the processes that are doing the API requests. Ollama server is answering them sequentially. Each time I run multiple API requests to the same model with different seeds/temperature (0 and some other value), I get completely random responses to temperature 0. This isn't supposed to happen, right? About the prompt caching, is it possible to disable it?

GiteaMirror commented

2026-04-27 23:58:33 -05:00

@d-kleine commented on GitHub (Jan 7, 2025):

About the prompt caching, is it possible to disable it?

Not yet, see #5760

I think it would also be good to have these settings mentioned here: https://github.com/ggerganov/llama.cpp/issues/8353
These would make generated outputs truly deterministic

@d-kleine commented on GitHub (Jan 7, 2025): > About the prompt caching, is it possible to disable it? Not yet, see #5760 I think it would also be good to have these settings mentioned here: https://github.com/ggerganov/llama.cpp/issues/8353 These would make generated outputs truly deterministic

GiteaMirror commented

2026-04-27 23:58:36 -05:00

@d-kleine commented on GitHub (Jan 7, 2025):

@jtyska I just have seen this discussion here: https://github.com/ggerganov/llama.cpp/discussions/3005#discussioncomment-11151329

Have you tried this?

Edit: @jtyska I just saw that you have asked in the linked thread: Afaik there is no direct --sampling-seq option in ollama. But afaik I understand this param from it's documentation (see under "Sampling params", it decides in which order you are running the samplers.

So, have you tried just using "top_k"=1 to get consistently reproducible outputs?
Please also see ollama's param doc for reference. (just wanting to show where I got the param from; its description is actually not quite precise; top_k considers the k most probable tokens per each generation step. Therefore, top_k=1 will always select the most probable token at each step of the generation process.)

@d-kleine commented on GitHub (Jan 7, 2025): @jtyska I just have seen this discussion here: https://github.com/ggerganov/llama.cpp/discussions/3005#discussioncomment-11151329 Have you tried this? --- Edit: @jtyska I just saw that you have asked in the linked thread: Afaik there is no direct `--sampling-seq` option in ollama. But afaik I understand this param from it's [documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md) (see under "Sampling params", it decides in which order you are running the samplers. So, have you tried just using `"top_k"=1` to get consistently reproducible outputs? Please also see [ollama's param doc](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter) for reference. (just wanting to show where I got the param from; its description is actually not quite precise; `top_k` considers the k most probable tokens per each generation step. Therefore, `top_k=1` will always select the most probable token at each step of the generation process.)

GiteaMirror commented

2026-04-27 23:58:37 -05:00

@d-kleine commented on GitHub (Jan 14, 2025):

@jtyska Did this resolve your issue or do you need further help with this?

@d-kleine commented on GitHub (Jan 14, 2025): @jtyska Did this resolve your issue or do you need further help with this?

GiteaMirror commented

2026-04-27 23:58:38 -05:00

@jtyska commented on GitHub (Jan 14, 2025):

Hey @d-kleine, I'm running multiple experiments with temperature/seed values and different models, but I'm not confident that the behavior is consistent. I'm waiting for my experiments to finish, and then I will analyze their outputs to check if they are consistent or not. The workaround that I'm trying to avoid prompt-caching and any other parallel interference is not making the same model to answer multiple parallel requests. Also, when a specific experiment with seed/temperature ends, I unload the model (keep_alive=0) and load it again with the new parameters. Let's see.

I will also try to figure out how to use the top_k within the API when I have time. I let you how it goes.

@jtyska commented on GitHub (Jan 14, 2025): Hey @d-kleine, I'm running multiple experiments with temperature/seed values and different models, but I'm not confident that the behavior is consistent. I'm waiting for my experiments to finish, and then I will analyze their outputs to check if they are consistent or not. The workaround that I'm trying to avoid prompt-caching and any other parallel interference is not making the same model to answer multiple parallel requests. Also, when a specific experiment with seed/temperature ends, I unload the model (keep_alive=0) and load it again with the new parameters. Let's see. I will also try to figure out how to use the top_k within the API when I have time. I let you how it goes.

GiteaMirror commented

2026-04-27 23:58:39 -05:00

@huynhducloi00 commented on GitHub (Feb 14, 2025):

this is a pretty bad issue. it basically means the platform answer differently for the same prompt. Imagine that this is used in a yes/no setup. an answer of 'yes' is 180 degree difference from no.

Without this being fixed. i don't see how Ollama can be used.

@huynhducloi00 commented on GitHub (Feb 14, 2025): this is a pretty bad issue. it basically means the platform answer differently for the same prompt. Imagine that this is used in a yes/no setup. an answer of 'yes' is 180 degree difference from no. Without this being fixed. i don't see how Ollama can be used.

GiteaMirror commented

2026-04-27 23:58:40 -05:00

@d-kleine commented on GitHub (Feb 14, 2025):

I don't set the temperature, top_p, or top_k. Setting those three to 0 will make the seed redundant

That's not true - the seed still influences the initial probability distribution affecting the model's token selection process, even for a top_p (limiting the cumulative probability distribution) near 0 or a low top_k (restricting the number of candidate tokens). Also, setting those parameters to 0 (except temperature, but even then this will be close to 0) doesn't make sense imho.

@d-kleine commented on GitHub (Feb 14, 2025): > I don't set the `temperature`, `top_p`, or `top_k`. Setting those three to `0` will make the seed redundant That's not true - the seed still influences the initial probability distribution affecting the model's token selection process, even for a `top_p` (limiting the cumulative probability distribution) near 0 or a low `top_k` (restricting the number of candidate tokens). Also, setting those parameters to 0 (except `temperature`, but even then this will be close to 0) doesn't make sense imho.

GiteaMirror commented

2026-04-27 23:58:40 -05:00

@d-kleine commented on GitHub (Feb 14, 2025):

No problem - just to be precise, setting top_p to zero is technically fine when temperature is zero since this overrides the nucleus sampling anyway (adjust either temperature or top_p, but not both simultaneously). But top_k must remain ≥1 to allow token selection.

@d-kleine commented on GitHub (Feb 14, 2025): No problem - just to be precise, setting `top_p` to zero is technically fine when temperature is zero since this overrides the nucleus sampling anyway (adjust either `temperature` or `top_p`, but not both simultaneously). But `top_k` must remain ≥1 to allow token selection.

GiteaMirror commented

2026-04-27 23:58:41 -05:00

@huynhducloi00 commented on GitHub (Feb 14, 2025):

The issue still happens with top_k=1, seed=42, temperature=0, top_p (default, don't know). It does not make any sense. i ask it the same prompt. first time gives answer 'yes', second time gives answer 'no'. So, should i report this answer as yes or no in the paper. lol. This is really no use.

@huynhducloi00 commented on GitHub (Feb 14, 2025): The issue still happens with top_k=1, seed=42, temperature=0, top_p (default, don't know). It does not make any sense. i ask it the same prompt. first time gives answer 'yes', second time gives answer 'no'. So, should i report this answer as yes or no in the paper. lol. This is really no use.

GiteaMirror commented

2026-04-27 23:58:42 -05:00

@d-kleine commented on GitHub (Feb 14, 2025):

@huynhducloi00 Please provide the model (and code, if possible)

@d-kleine commented on GitHub (Feb 14, 2025): @huynhducloi00 Please provide the model (and code, if possible)

GiteaMirror commented

2026-04-27 23:58:43 -05:00

@huynhducloi00 commented on GitHub (Feb 14, 2025):

it can be reproduced with model 'deepseek-r1:14b'

from ollama import generate
MODEL='deepseek-r1:14b'
def get_ollama_order_flaky(prompt):
    response = generate(MODEL, prompt,options={'temperature': 0, 'seed':42, 'top_k':1})
    return response['response']

prompt="""
download from here https://justpaste.it/gk2xt
"""

It sometimes give answer yes, sometimes no, especially when you flip a different prompt in between.

@huynhducloi00 commented on GitHub (Feb 14, 2025): it can be reproduced with model 'deepseek-r1:14b' ``` from ollama import generate MODEL='deepseek-r1:14b' def get_ollama_order_flaky(prompt): response = generate(MODEL, prompt,options={'temperature': 0, 'seed':42, 'top_k':1}) return response['response'] ``` prompt=""" download from here https://justpaste.it/gk2xt """ It sometimes give answer yes, sometimes no, especially when you flip a different prompt in between.

GiteaMirror commented

2026-04-27 23:58:44 -05:00

@d-kleine commented on GitHub (Feb 14, 2025):

@huynhducloi00 Please try with options={"temperature": 0, "seed": 42, "top_k": 1, "top_p": 1}, at least on my end (GPU) it's reproducible consistently across multiple runs (with r1 1.5b); Please let me know if that suits your requirements.

Edit:
The prompt should be further engineered, e.g. making it more precise that it's a question and what the answer format should look like, e.g.

can `testCompositeKeys` be flaky depending on the order of which it is run compared to other tests? Answer with "Yes" or "No" only, nothing more.*

if you want to output only "Yes" or "No". For example, I have added a ? (indicating a question) and specified the desired output format a little more. This makes the input clearer for the model which has implications for the token selection process in the output.

@d-kleine commented on GitHub (Feb 14, 2025): @huynhducloi00 Please try with `options={"temperature": 0, "seed": 42, "top_k": 1, "top_p": 1}`, at least on my end (GPU) it's reproducible consistently across multiple runs (with r1 1.5b); Please let me know if that suits your requirements. Edit: The prompt should be further engineered, e.g. making it more precise that it's a question and what the answer format should look like, e.g. ```text can `testCompositeKeys` be flaky depending on the order of which it is run compared to other tests? Answer with "Yes" or "No" only, nothing more.* ``` if you want to output only "Yes" or "No". For example, I have added a `?` (indicating a question) and specified the desired output format a little more. This makes the input clearer for the model which has implications for the token selection process in the output.

GiteaMirror commented

2026-04-27 23:58:44 -05:00

@huynhducloi00 commented on GitHub (Feb 14, 2025):

Thanks a lot for the response. That helps a lot

@huynhducloi00 commented on GitHub (Feb 14, 2025): Thanks a lot for the response. That helps a lot

GiteaMirror commented

2026-04-27 23:58:45 -05:00

@huynhducloi00 commented on GitHub (Feb 15, 2025):

sadly, "top_p" does not solve it:
Here is the result of 10 questions:

Run 1:
[(5, False),
 (6, True),
 (7, False),
 (8, True),
 (9, True),
 (10, True),
 (11, True),
 (12, True),
 (13, False),
 (14, False)]

Run 2:
[(5, False),
 (6, True),
 (7, True),
 (8, True),
 (9, False),
 (10, True),
 (11, True),
 (12, True),
 (13, False),
 (14, True)]

They are off (different) at index 7.

here is the answer at run 1: https://justpaste.it/dc73s
Here is the answer at run 2: https://justpaste.it/3t6y9

model being test is deepseek-r1:14b. The option is "temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000

I am not sure why this is not being prioritized. This is very easy to reproduce. Is it due to the randomness in the quantization. I guess deepseek-r1:14b is a quantized model.

@huynhducloi00 commented on GitHub (Feb 15, 2025): sadly, "top_p" does not solve it: Here is the result of 10 questions: ``` Run 1: [(5, False), (6, True), (7, False), (8, True), (9, True), (10, True), (11, True), (12, True), (13, False), (14, False)] Run 2: [(5, False), (6, True), (7, True), (8, True), (9, False), (10, True), (11, True), (12, True), (13, False), (14, True)] ``` They are off (different) at index 7. * here is the answer at run 1: https://justpaste.it/dc73s * Here is the answer at run 2: https://justpaste.it/3t6y9 model being test is deepseek-r1:14b. The option is `"temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000` I am not sure why this is not being prioritized. This is very easy to reproduce. Is it due to the randomness in the quantization. I guess deepseek-r1:14b is a quantized model.

GiteaMirror commented

2026-04-27 23:58:46 -05:00

@d-kleine commented on GitHub (Feb 16, 2025):

Have you tried without limiting the context length (num_ctx)?

There can be another numerous reasons that can introduce randomness, e.g.

PRNG initialization between runs
parallel GPU computation (nothing can be done about this; you can double-check if this behaviour is the same on CPU only too)
quantization method (Q4_K_M)
floating point arithmetic (fp32, fp16, bfloat16)
etc. pp.

@d-kleine commented on GitHub (Feb 16, 2025): Have you tried without limiting the context length (`num_ctx`)? There can be another numerous reasons that can introduce randomness, e.g. - PRNG initialization between runs - parallel GPU computation (nothing can be done about this; you can double-check if this behaviour is the same on CPU only too) - quantization method (Q4_K_M) - floating point arithmetic (fp32, fp16, bfloat16) - etc. pp.

GiteaMirror commented

2026-04-27 23:58:49 -05:00

@dan31 commented on GitHub (Mar 26, 2025):

Wow this is disastrous. We encounter this a lot now. What is the true reason quantized models cannot produce deterministic outputs when all non-determinism is off in ollama? This is a completely serial execution on GPU by a single client for us with a Q4_K_M model, the most popular quantization format.

@dan31 commented on GitHub (Mar 26, 2025): Wow this is disastrous. We encounter this a lot now. What is the true reason quantized models cannot produce deterministic outputs when all non-determinism is off in ollama? This is a completely serial execution on GPU by a single client for us with a Q4_K_M model, the most popular quantization format.

GiteaMirror commented

2026-04-27 23:58:50 -05:00

@kevin-pw commented on GitHub (Mar 26, 2025):

model being test is deepseek-r1:14b. The option is "temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000

I wrote a longer comment in the related (and still open) issue here: https://github.com/ollama/ollama/issues/5321#issuecomment-2755465128 but the main takeaway is: Several models also produce inconsistent embeddings for the same inputs, which can significantly affect quality of downstream applications, for example RAG. That also means sampler options (like temperature, seed, top_k, etc.) cannot fix this issue.

@kevin-pw commented on GitHub (Mar 26, 2025): > model being test is deepseek-r1:14b. The option is `"temperature": 0, "seed": 42, "top_k": 1, 'top_p':1,'num_ctx':10000` I wrote a longer comment in the related (and still open) issue here: https://github.com/ollama/ollama/issues/5321#issuecomment-2755465128 but the main takeaway is: Several models also produce inconsistent embeddings for the same inputs, which can significantly affect quality of downstream applications, for example RAG. That also means sampler options (like `temperature`, `seed`, `top_k`, etc.) cannot fix this issue.

GiteaMirror commented

2026-04-27 23:58:52 -05:00

@flexorx commented on GitHub (Mar 26, 2025):

@kevin-pw do you know if this issue is recently introduced?
I couldn't find any mentions of a similar issue until recently.

@flexorx commented on GitHub (Mar 26, 2025): @kevin-pw do you know if this issue is recently introduced? I couldn't find any mentions of a similar issue until recently.

GiteaMirror commented

2026-04-27 23:58:53 -05:00

@huynhducloi00 commented on GitHub (May 9, 2025):

yes, not sure why the team does not prioritize this

@huynhducloi00 commented on GitHub (May 9, 2025): yes, not sure why the team does not prioritize this

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#46774

[GH-ISSUE #586] /api/generate with fixed seed and temperature=0 doesn't produce deterministic results #46774

Seed 0

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 1

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 2

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 3

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 4

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 5

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 6

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 7

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 8

PAYLOAD SENT to api/generate

RESPONSE GOT

############

Seed 9

PAYLOAD SENT to api/generate

RESPONSE GOT

############

-- Response counter for parallel requests temperature=0 and temperature=10 Temperature = 0 - Count number by unique response [ 7, 2, 1 ] Temperature = 10 - Count number by unique response [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ]

[GH-ISSUE #586] `/api/generate` with fixed seed and temperature=0 doesn't produce deterministic results #46774

--
Response counter for parallel requests temperature=0 and temperature=10
Temperature = 0 - Count number by unique response
[
7,
2,
1
]
Temperature = 10 - Count number by unique response
[
1,
1,
1,
1,
1,
1,
1,
1,
1,
1
]