[GH-ISSUE #5321] Llama3: Generated outputs inconsistent despite seed and temperature #49843

Open
opened 2026-04-28 13:09:39 -05:00 by GiteaMirror · 22 comments
Owner

Originally created by @d-kleine on GitHub (Jun 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5321

What is the issue?

Follow-up of #586

Even though the output is deterministic and reproducible with a fixed seed, a temperature set to 0 and a fixed num_ctx, the generated output of Llama 3 slightly differs in the first executing of this code and the second execution of this code (without kernel restart). The following executions will be the same as for the second execution:

Code snippet taken from LLMs from scratch - Evaluation with Ollama:

import urllib.request
import json


def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048 # must be set, otherwise slightly random output
        }
    }

    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

result = query_model("What do Llamas eat?")
print(result)

Output of execution no. 1 (output can vary):

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay as a snack or as a main meal.
3. Grains: Llamas may be fed grains like oats, barley, or corn as an occasional treat or to supplement their diet.
4. Fruits and vegetables: Fresh fruits and veggies, such as apples, carrots, and sweet potatoes, can be given as treats or added to their meals for variety.
5. Leaves and shrubs: Llamas will also eat leaves from trees and shrubs, like willow or cedar.

In the wild, llamas might eat:

* Various grasses and plants
* Leaves from trees and shrubs
* Fruits and berries
* Bark (in some cases)

Domesticated llamas, on the other hand, typically receive a diet that includes:

* Hay as their main staple
* Grains or pellets as a supplement
* Fresh fruits and veggies as treats

It's essential to provide llamas with a balanced diet that meets their nutritional needs. Consult with a veterinarian or an experienced llama breeder to determine the best feeding plan for your llama.

Output for execution no. 2 to execution no. n (output should be reproducible):

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay cubes or loose hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their diet. However, these should be given in moderation to avoid digestive issues.
4. Fruits and vegetables: Fresh fruits and veggies can be a tasty treat for llamas. Some favorites include apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas need access to mineral supplements, such as salt licks or loose minerals, to ensure they're getting the necessary nutrients.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, like willow or cedar.
2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or birch.
3. Mosses: Llamas have been known to graze on mosses and other non-woody plant material.

It's essential to provide a balanced diet for your llama, taking into account their age, size, and individual needs. Consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.

Observations:

  • As you can see from the outputs, the output from the first execution will be random, whereas the output of the second execution and all subsequent executions will be generated consistently deterministic.
  • I have tried using different platforms (Windows, Docker using Ubuntu image), and it seems like the generated outputs differ across those different OS: The first one will always be somewhat random, but the following ones are consistent on a platform. But for example on Windows, this code produced a different consistent deterministic output than Ubuntu did.
  • I have tried to set a Python hashseed, this did not solve the issue.

Linux, macOS, Windows, Docker, WSL2

GPU

Nvidia

CPU

AMD

Ollama version

0.1.46

Originally created by @d-kleine on GitHub (Jun 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5321 ### What is the issue? Follow-up of #586 Even though the output is **deterministic** and **reproducible** with a fixed `seed`, a `temperature` set to 0 and a fixed `num_ctx`, the generated output of **Llama 3** slightly differs in the first executing of this code and the second execution of this code (without kernel restart). The following executions will be the same as for the second execution: Code snippet taken from [LLMs from scratch - Evaluation with Ollama](https://github.com/rasbt/LLMs-from-scratch/blob/1db199995121afc56146f92ec502b68df17e9c0a/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb): ```python import urllib.request import json def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"): # Create the data payload as a dictionary data = { "model": model, "messages": [ { "role": "user", "content": prompt } ], "options": { "seed": 123, "temperature": 0, "num_ctx": 2048 # must be set, otherwise slightly random output } } # Convert the dictionary to a JSON formatted string and encode it to bytes payload = json.dumps(data).encode("utf-8") # Create a request object, setting the method to POST and adding necessary headers request = urllib.request.Request(url, data=payload, method="POST") request.add_header("Content-Type", "application/json") # Send the request and capture the response response_data = "" with urllib.request.urlopen(request) as response: # Read and decode the response while True: line = response.readline().decode("utf-8") if not line: break response_json = json.loads(line) response_data += response_json["message"]["content"] return response_data result = query_model("What do Llamas eat?") print(result) ``` Output of execution no. $1$ (output can vary): ``` Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of: 1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds. 2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay as a snack or as a main meal. 3. Grains: Llamas may be fed grains like oats, barley, or corn as an occasional treat or to supplement their diet. 4. Fruits and vegetables: Fresh fruits and veggies, such as apples, carrots, and sweet potatoes, can be given as treats or added to their meals for variety. 5. Leaves and shrubs: Llamas will also eat leaves from trees and shrubs, like willow or cedar. In the wild, llamas might eat: * Various grasses and plants * Leaves from trees and shrubs * Fruits and berries * Bark (in some cases) Domesticated llamas, on the other hand, typically receive a diet that includes: * Hay as their main staple * Grains or pellets as a supplement * Fresh fruits and veggies as treats It's essential to provide llamas with a balanced diet that meets their nutritional needs. Consult with a veterinarian or an experienced llama breeder to determine the best feeding plan for your llama. ``` Output for execution no. $2$ to execution no. $n$ (output should be reproducible): ``` Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of: 1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds. 2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay cubes or loose hay. 3. Grains: Llamas may receive grains like oats, barley, or corn as part of their diet. However, these should be given in moderation to avoid digestive issues. 4. Fruits and vegetables: Fresh fruits and veggies can be a tasty treat for llamas. Some favorites include apples, carrots, sweet potatoes, and leafy greens like kale or spinach. 5. Minerals: Llamas need access to mineral supplements, such as salt licks or loose minerals, to ensure they're getting the necessary nutrients. In the wild, llamas might also eat: 1. Leaves: They'll munch on leaves from trees and shrubs, like willow or cedar. 2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or birch. 3. Mosses: Llamas have been known to graze on mosses and other non-woody plant material. It's essential to provide a balanced diet for your llama, taking into account their age, size, and individual needs. Consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama. ``` **Observations:** - As you can see from the outputs, the output from the first execution will be random, whereas the output of the second execution and all subsequent executions will be generated consistently deterministic. - I have tried using different platforms (Windows, Docker using Ubuntu image), and it seems like the generated outputs differ across those different OS: The first one will always be somewhat random, but the following ones are consistent on a platform. But for example on Windows, this code produced a different consistent deterministic output than Ubuntu did. - I have tried to set a Python hashseed, this did not solve the issue. Linux, macOS, Windows, Docker, WSL2 ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.46
GiteaMirror added the bug label 2026-04-28 13:09:39 -05:00
Author
Owner

@mitar commented on GitHub (Jul 17, 2024):

Your version does include ead259d877 so I am not sure why.

<!-- gh-comment-id:2234282393 --> @mitar commented on GitHub (Jul 17, 2024): Your version does include https://github.com/ollama/ollama/commit/ead259d877fc8b20f7943f1f9e8eeaae0acfa52a so I am not sure why.
Author
Owner

@sayap commented on GitHub (Jul 18, 2024):

Can try to apply this patch:

diff --git a/llm/server.go b/llm/server.go
index 36c0e0b5..b93b5b6c 100644
--- a/llm/server.go
+++ b/llm/server.go
@@ -734,7 +734,7 @@ func (s *llmServer) Completion(ctx context.Context, req CompletionRequest, fn fu
 		"seed":              req.Options.Seed,
 		"stop":              req.Options.Stop,
 		"image_data":        req.Images,
-		"cache_prompt":      true,
+		"cache_prompt":      false,
 	}
 
 	// Make sure the server is ready

The cache_prompt flag was set to true by commit a64570dca. From https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints, it says:

prompt: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if cache_prompt is true, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated.

Once I have applied this patch, I can get the exact same output when sending the same prompt with the same seed and the same temperature, regardless of kernel restart. For example:

$ curl http://localhost:11434/api/chat -d '{
  "model": "phi3:medium",
  "messages": [
    {
      "role": "user",
      "content": "Tell me a short story about ghost. Limit it to 20 words."
    }
  ],
  "options": {
    "seed": 666,
    "temperature": 0.666
  },
  "stream": false
}'

1st output:

{
  "model": "phi3:medium",
  "created_at": "2024-07-18T00:09:38.40377522Z",
  "message": {
    "role": "assistant",
    "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 7185272872,
  "load_duration": 3884720,
  "prompt_eval_count": 21,
  "prompt_eval_duration": 1255198000,
  "eval_count": 32,
  "eval_duration": 5884770000
}

2nd output:

{
  "model": "phi3:medium",
  "created_at": "2024-07-18T00:09:49.3024363Z",
  "message": {
    "role": "assistant",
    "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 7152606421,
  "load_duration": 4006631,
  "prompt_eval_count": 21,
  "prompt_eval_duration": 1255757000,
  "eval_count": 32,
  "eval_duration": 5892071000
}

I guess this flag should be made configurable?

<!-- gh-comment-id:2234857301 --> @sayap commented on GitHub (Jul 18, 2024): Can try to apply this patch: ```diff diff --git a/llm/server.go b/llm/server.go index 36c0e0b5..b93b5b6c 100644 --- a/llm/server.go +++ b/llm/server.go @@ -734,7 +734,7 @@ func (s *llmServer) Completion(ctx context.Context, req CompletionRequest, fn fu "seed": req.Options.Seed, "stop": req.Options.Stop, "image_data": req.Images, - "cache_prompt": true, + "cache_prompt": false, } // Make sure the server is ready ``` The `cache_prompt` flag was set to true by commit a64570dca. From https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints, it says: > prompt: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if cache_prompt is true, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. Once I have applied this patch, I can get the exact same output when sending the same prompt with the same `seed` and the same `temperature`, regardless of kernel restart. For example: ```console $ curl http://localhost:11434/api/chat -d '{ "model": "phi3:medium", "messages": [ { "role": "user", "content": "Tell me a short story about ghost. Limit it to 20 words." } ], "options": { "seed": 666, "temperature": 0.666 }, "stream": false }' ``` 1st output: ```json { "model": "phi3:medium", "created_at": "2024-07-18T00:09:38.40377522Z", "message": { "role": "assistant", "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls." }, "done_reason": "stop", "done": true, "total_duration": 7185272872, "load_duration": 3884720, "prompt_eval_count": 21, "prompt_eval_duration": 1255198000, "eval_count": 32, "eval_duration": 5884770000 } ``` 2nd output: ```json { "model": "phi3:medium", "created_at": "2024-07-18T00:09:49.3024363Z", "message": { "role": "assistant", "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls." }, "done_reason": "stop", "done": true, "total_duration": 7152606421, "load_duration": 4006631, "prompt_eval_count": 21, "prompt_eval_duration": 1255757000, "eval_count": 32, "eval_duration": 5892071000 } ``` I guess this flag should be made configurable?
Author
Owner

@d-kleine commented on GitHub (Jul 18, 2024):

To be honest, idk if that fixes the issue. For the output, you have used a different model, a different prompt and not validated it across different OS.

The KV cache is actually a helpful feature, but it might be initialized differently across different OS. Therefore, disabling it might fix this issue, but does not solve the resolve the issue with KV caching initialization.
https://github.com/ggerganov/llama.cpp/issues/4902

But you actually brought me on a idea to bypass KV cache just by setting num_keep=0 (this would not disable it, but at least no tokens will be stored in the cache then).

Idk how to install ollama with your changes, neither on Ubuntu nor on Windows, I will test it once the new Ollama version having it implemented will be released. Thanks for the PR anyways!


BTW I have also opened a PR on llama.cpp for making outputs 100% deterministic:
https://github.com/ggerganov/llama.cpp/discussions/8265

When using temperature=0, a small coefficient might be used to prevent zero division. In some cases, this might slightly change the generated output, depending on the model used. Therefore, it would be better to turn off beam search and multinomial sampling for deterministic sampling.

Setting a seed only makes sense when using non-deterministic sampling, such as Top-k or Top-p sampling, to ensure reproducibility. This example code here for Ollama doesn't fully make sense because you would not need to set a seed as with temperature=0 the generated output would be deterministic anyways. But when setting a temperature > 0.0, you would need to set a seed as well to make the output reproducible.

<!-- gh-comment-id:2236235661 --> @d-kleine commented on GitHub (Jul 18, 2024): To be honest, idk if that fixes the issue. For the output, you have used a different model, a different prompt and not validated it across different OS. The KV cache is actually a helpful feature, but it might be initialized differently across different OS. Therefore, disabling it might fix this issue, but does not solve the resolve the issue with KV caching initialization. https://github.com/ggerganov/llama.cpp/issues/4902 But you actually brought me on a idea to bypass KV cache just by setting `num_keep=0` (this would not disable it, but at least no tokens will be stored in the cache then). Idk how to install ollama with your changes, neither on Ubuntu nor on Windows, I will test it once the new Ollama version having it implemented will be released. Thanks for the PR anyways! --- BTW I have also opened a PR on llama.cpp for making outputs 100% deterministic: https://github.com/ggerganov/llama.cpp/discussions/8265 When using `temperature=0`, a small coefficient might be used to prevent zero division. In some cases, this might slightly change the generated output, depending on the model used. Therefore, it would be better to turn off beam search and multinomial sampling for **deterministic** sampling. Setting a `seed` only makes sense when using **non-deterministic** sampling, such as Top-k or Top-p sampling, to ensure reproducibility. [This example code here for Ollama](https://github.com/ollama/ollama/blob/main/docs/api.md#chat-request-reproducible-outputs) doesn't fully make sense because you would not need to set a `seed` as with `temperature=0` the generated output would be deterministic anyways. But when setting a temperature > 0.0, you would need to set a seed as well to make the output reproducible.
Author
Owner

@psambit9791 commented on GitHub (Jan 2, 2025):

Is there any update on this issue?
I have also been encountering this issue.

<!-- gh-comment-id:2568055910 --> @psambit9791 commented on GitHub (Jan 2, 2025): Is there any update on this issue? I have also been encountering this issue.
Author
Owner

@sisp commented on GitHub (Feb 26, 2025):

Same here with Microsoft's Phi-3 Mini model.

<!-- gh-comment-id:2684569483 --> @sisp commented on GitHub (Feb 26, 2025): Same here with Microsoft's Phi-3 Mini model.
Author
Owner

@lemassykoi commented on GitHub (Mar 16, 2025):

Same here with Ollama 0.6.1

I wrote a script to test with models : https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5

Seed and Temperature are fixed values.
For each model in the list, the script restart ollama service, then send the same query twice to the model.
If the model output is strictly identical, test is ok.
If the model output is different from pass 1 to pass 2, test is failed.

Image

<!-- gh-comment-id:2727539874 --> @lemassykoi commented on GitHub (Mar 16, 2025): Same here with Ollama 0.6.1 I wrote a script to test with models : https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5 Seed and Temperature are fixed values. For each model in the list, the script restart ollama service, then send the same query twice to the model. If the model output is strictly identical, test is ok. If the model output is different from pass 1 to pass 2, test is failed. ![Image](https://github.com/user-attachments/assets/e328860b-5845-4183-8354-4ed0784120e5)
Author
Owner

@kevin-pw commented on GitHub (Mar 26, 2025):

I can confirm inconsistent outputs with Ollama v0.6.3-rc0 for several models I tested:
llama3.2:latest
llama3.2-vision:11b
gemma3:12b

I noticed that llama3.2:latest produces inconsistent results for identical inputs not only on the /generate endpoint but also on the /embed endpoint. That means the model produces different probability distributions for the same inputs. That also means that adjusting sampler options (like temperature, seed, top_p, top_k, etc.) cannot fix this problem.

The inconsistent outputs present a serious issue because they degrade the quality of any downstream applications. In my tests, the cosine similarity between the different embeddings for the same inputs was as low as 99.4%, but I suspect that similarity could drop even lower. Especially for RAG applications where similarities for large datasets often fall within a small range, this inconsistency appears beyond acceptable.

I am not including any code here because the inconsistencies are somewhat difficult to reproduce. I have found the inconsistencies to occur on the same machine when I run Ollama on CPU vs GPU, and when running Ollama using the installed version vs a docker container vs compiling the development version from source. Sometimes these changes cause the inconsistencies, and sometimes they do not.

@rick-github Could you take another look at this issue? I believe the inconsistencies are serious enough to warrant attention, but I know you have a long list of priorities. Shout-out to the Ollama team for your amazing work!

<!-- gh-comment-id:2755465128 --> @kevin-pw commented on GitHub (Mar 26, 2025): I can confirm inconsistent outputs with Ollama `v0.6.3-rc0` for several models I tested: `llama3.2:latest` `llama3.2-vision:11b` `gemma3:12b` I noticed that `llama3.2:latest` produces inconsistent results for identical inputs not only on the `/generate` endpoint but also on the `/embed` endpoint. That means the model produces different probability distributions for the same inputs. That also means that adjusting sampler options (like `temperature`, `seed`, `top_p`, `top_k`, etc.) cannot fix this problem. The inconsistent outputs present a serious issue because they degrade the quality of any downstream applications. In my tests, the cosine similarity between the different embeddings for the same inputs was as low as 99.4%, but I suspect that similarity could drop even lower. Especially for RAG applications where similarities for large datasets often fall within a small range, this inconsistency appears beyond acceptable. I am not including any code here because the inconsistencies are somewhat difficult to reproduce. I have found the inconsistencies to occur on the same machine when I run Ollama on CPU vs GPU, and when running Ollama using the installed version vs a docker container vs compiling the development version from source. Sometimes these changes cause the inconsistencies, and sometimes they do not. @rick-github Could you take another look at this issue? I believe the inconsistencies are serious enough to warrant attention, but I know you have a long list of priorities. Shout-out to the Ollama team for your amazing work!
Author
Owner

@rick-github commented on GitHub (Mar 26, 2025):

A quick first pass with generating embeddings with llama3.2:latest failed to show any inconsistencies. Can you give me an idea of the type of input, chunk length and context length you are using?

<!-- gh-comment-id:2755877165 --> @rick-github commented on GitHub (Mar 26, 2025): A quick first pass with generating embeddings with llama3.2:latest failed to show any inconsistencies. Can you give me an idea of the type of input, chunk length and context length you are using?
Author
Owner

@flexorx commented on GitHub (Mar 26, 2025):

@rick-github please consider running this, for instance:
https://github.com/ollama/ollama/issues/5321#issuecomment-2727539874
https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5
In our case, the issue is present with pretty much any prompt and with temperature, top_k, top_p, seed all zeroed out and repetition_penalty set to 1.0 (all non-determinism OFF), on various quantizations of mistral 24B, mistral 24B-3.1, gemma3 of various sizes, and various window sizes like 2k, 4k, 8k, on Nvidia GPU both with and without parallelism for ollama set (surely on the same OS and the same computer).

<!-- gh-comment-id:2755880683 --> @flexorx commented on GitHub (Mar 26, 2025): @rick-github please consider running this, for instance: https://github.com/ollama/ollama/issues/5321#issuecomment-2727539874 https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5 In our case, the issue is present with pretty much any prompt and with temperature, top_k, top_p, seed all zeroed out and repetition_penalty set to 1.0 (all non-determinism OFF), on various quantizations of mistral 24B, mistral 24B-3.1, gemma3 of various sizes, and various window sizes like 2k, 4k, 8k, on Nvidia GPU both with and without parallelism for ollama set (surely on the same OS and the same computer).
Author
Owner

@lemassykoi commented on GitHub (Mar 27, 2025):

New version with embed testing: https://gist.github.com/lemassykoi/5a6c0d655b5923e9588eef68d12fcbd2

It will test all your available models from your local ollama, excepted some with special words in model name like code, or embed (which can't chat or generate)

Some models are not capable of embedding, they will appear as invalid models without raising exception.

Image

ollama 0.6.2

Image

<!-- gh-comment-id:2756234640 --> @lemassykoi commented on GitHub (Mar 27, 2025): New version with embed testing: [https://gist.github.com/lemassykoi/5a6c0d655b5923e9588eef68d12fcbd2](url) It will test all your available models from your local ollama, excepted some with special words in model name like `code`, or `embed` (which can't chat or generate) Some models are not capable of embedding, they will appear as invalid models without raising exception. ![Image](https://github.com/user-attachments/assets/f4c19e94-aad3-4134-bb4e-4637840c3041) ollama 0.6.2 ![Image](https://github.com/user-attachments/assets/f97feb5b-8765-4ac8-aee1-39c3d535013d)
Author
Owner

@kevin-pw commented on GitHub (Mar 27, 2025):

@rick-github Thank you for looking into this issue!

I am able to reproduce the inconsistent embedding results by running Ollama compiled from source and setting CUDA_VISIBLE_DEVICES either to 0 or to -1.

First, I compile from source as usual. I am using v0.6.3-rc0, which is commit e5d84fb:
cmake -B build
cmake --build build

I then explicitly use the GPU by setting:
export CUDA_VISIBLE_DEVICES=0

Then, run Ollama:
go run . serve

In a separate terminal, I issue a curl command to Ollama:

curl -X POST "http://localhost:11434/api/embed" \
     -H "Content-Type: application/json" \
     -d '{
           "model": "llama3.2:latest",
           "input": "What is the weather like today? Note that we are located on the moon.",
           "raw": false
         }'

The response is:
{"model":"llama3.2:latest","embeddings":[[0.00094434456,0.013325062,-0.026114173,...,-0.03395605,0.0069057234,-0.010010444]],"total_duration":1517785083,"load_duration":1414543924,"prompt_eval_count":16}

I then stop Ollama with CTRL + C, and explicitly use the CPU by setting:
export CUDA_VISIBLE_DEVICES=-1

Then, run Ollama:
go run . serve

Using the same curl command as above, the response is:
{"model":"llama3.2:latest","embeddings":[[0.0010927601,0.014933009,-0.024558352,...,-0.03514633,0.0063932026,-0.009075297]],"total_duration":225640647,"load_duration":1596095,"prompt_eval_count":16}

As you can see, using the GPU vs CPU to compute identical curl requests results in different embeddings. The cosine similarity between the different embeddings is 99.86% in this example but I have seen lower similarities. Ideally, the similarity should be 100%. I haven’t changed any parameters other than using the GPU vs CPU. That means my curl requests use the default context length, chunk size, etc.

@lemassykoi I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10.

<!-- gh-comment-id:2758516810 --> @kevin-pw commented on GitHub (Mar 27, 2025): @rick-github Thank you for looking into this issue! I am able to reproduce the inconsistent embedding results by running Ollama compiled from source and setting `CUDA_VISIBLE_DEVICES` either to `0` or to `-1`. First, I compile from source as usual. I am using `v0.6.3-rc0`, which is commit `e5d84fb`: `cmake -B build` `cmake --build build` I then explicitly use the GPU by setting: `export CUDA_VISIBLE_DEVICES=0` Then, run Ollama: `go run . serve` In a separate terminal, I issue a curl command to Ollama: ``` curl -X POST "http://localhost:11434/api/embed" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2:latest", "input": "What is the weather like today? Note that we are located on the moon.", "raw": false }' ``` The response is: `{"model":"llama3.2:latest","embeddings":[[0.00094434456,0.013325062,-0.026114173,...,-0.03395605,0.0069057234,-0.010010444]],"total_duration":1517785083,"load_duration":1414543924,"prompt_eval_count":16}` I then stop Ollama with CTRL + C, and explicitly use the CPU by setting: `export CUDA_VISIBLE_DEVICES=-1` Then, run Ollama: `go run . serve` Using the same curl command as above, the response is: `{"model":"llama3.2:latest","embeddings":[[0.0010927601,0.014933009,-0.024558352,...,-0.03514633,0.0063932026,-0.009075297]],"total_duration":225640647,"load_duration":1596095,"prompt_eval_count":16}` As you can see, using the GPU vs CPU to compute identical curl requests results in different embeddings. The cosine similarity between the different embeddings is 99.86% in this example but I have seen lower similarities. Ideally, the similarity should be 100%. I haven’t changed any parameters other than using the GPU vs CPU. That means my curl requests use the default context length, chunk size, etc. @lemassykoi I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10.
Author
Owner

@lemassykoi commented on GitHub (Mar 27, 2025):

@lemassykoi I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10.

Yes, I don't switch between CPU and GPU

I'm using Debian 12

<!-- gh-comment-id:2758592670 --> @lemassykoi commented on GitHub (Mar 27, 2025): > [@lemassykoi](https://github.com/lemassykoi) I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10. Yes, I don't switch between CPU and GPU I'm using Debian 12
Author
Owner

@sisp commented on GitHub (Mar 27, 2025):

@kevin-pw Consistency across CPU and GPU cannot be guaranteed: https://pytorch.org/docs/stable/notes/randomness.html

<!-- gh-comment-id:2758779850 --> @sisp commented on GitHub (Mar 27, 2025): @kevin-pw Consistency across CPU and GPU cannot be guaranteed: https://pytorch.org/docs/stable/notes/randomness.html
Author
Owner

@kevin-pw commented on GitHub (Mar 28, 2025):

Summary

It looks like three different issues might cause different embeddings, logits, and text generation for the same inputs:

  1. Generating the output on different operating systems
  2. Generating the outputs on CPU vs GPU
  3. Generating the outputs using the KV cache

After some digging, issues 1) and 2) do not appear to have an easy fix. These issues could be problematic for downstream applications like RAG, but it may be possible to mitigate those issues by carefully working around them. Issue 3) can be addressed by avoiding the use of the KV cache.

@rick-github I have two suggestions:

  • Perhaps it would be useful to highlight that Ollama outputs are non-deterministic in the Ollama docs. I think this could avoid future frustration with developers :)
  • Perhaps using or not using the KV cache should be a parameter option.

Apologies for the long rant – this was a deep dive.

1) Generating the output on different operating systems

Observing different results on different operating systems is consistent with the following issues posted on the llama.cpp repo:
https://github.com/ggml-org/llama.cpp/issues/2582
https://github.com/ggml-org/llama.cpp/discussions/2100#discussioncomment-6353790

Unfortunately, the two issues above were never fully resolved. The second issue suggests building a portable binary by statically linking the CUDA libraries, but I haven’t tested if that approach actually achieves consistent results between operating systems.

In my tests, I ran Ollama compiled from source directly on Ubuntu Linux 24.10, and I ran Ollama within a docker container that uses Ubuntu Linux 20.04. While running both of those Ollama instances on the same machine, gemma3:12b produced different results for the same text + image input.

To reproduce this issue:

Compile Ollama v0.6.3-rc0 from source as described in the docs:
Run Ollama with go run . serve
Run the python code below **
On my Ubuntu Linux 24.10 machine, the response is: {"most_likely_text": "nbg2m", "less_likely_text": "nbg2m"} This response contains the correct letters and numbers shown in the image.

To receive a different response, stop Ollama with CTRL + C and:
Build the docker image with docker build -t ollama . as described in the docs.
Run Ollama in docker with docker run --gpus=all -v ollama:/root/.ollama -p 127.0.0.1:11434:11434 --name ollama ollama
Run the python code below **
On my machine, the response is: {"most_likely_text": "nby2m", "less_likely_text": "nbyzm"} This response DOES NOT contain the correct letters and numbers shown in the image.

2) Generating the outputs on CPU vs GPU

As described in the PyTorch docs, consistency across CPU and GPU cannot be guaranteed. In fact, a large number of CUDA algorithms are non-deterministic. Some algorithms have deterministic but slower equivalents, but several other algorithms cannot behave deterministically at all (Thank you, @sisp !). So it may not ever be possible to produce consistent outputs with Ollama across different hardware.

To reproduce this issue:

Run Ollama within the docker image built in section 1) and using the additional flags -e CUDA_VISIBLE_DEVICES=0 (for GPU) or -e CUDA_VISIBLE_DEVICES=-1 (for CPU).
Run the python code below **
On GPU, the response was: {"most_likely_text": "nby2m", "less_likely_text": "nbyzm"}
On CPU, the response was: {"most_likely_text": "nby2m", "less_likely_text": "nbytm"} (with t instead of z, but both responses are incorrect.)

3) Generating the outputs using the KV cache

The KV cache temporarily stores part of a prompt and its response so that identical prompts do not have to be regenerated by the model when submitting the same prompt twice. However, using the stored results can cause the /generate endpoint to produce different results when processing the same inputs twice in quick succession.

The /embed endpoint is unaffected by this issue because using the KV cache is explicitly set to false
01aa788722/runner/llamarunner/runner.go (L703)

Relevant known issues related to KV cache:
https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535
https://github.com/ggml-org/llama.cpp/issues/3014

To reproduce this issue:

Run Ollama within the docker image built in section 1) and using the additional flags -e CUDA_VISIBLE_DEVICES=-1 (for CPU) and then run the the python code below ** twice.
First response: {"most_likely_text": "nby2m", "less_likely_text": "nbytm"}
Second response: {"most_likely_text": "n2m", "less_likely_text": "bg2m"}

As you can see, the first and second response are completely different. Both responses are incorrect when compared to the letters and numbers contained in the image.

Why different results for identical inputs present a significant problem

Downstream applications like retrieval augmented generation (RAG), classification, annotation, or semantic search depend on consistent embedding, logit, and text generation. Those applications usually compare similarity between embeddings, so any uncertainty in embeddings reduces the quality of results that those applications can produce. For large datasets, similarities between embeddings often fall within a relatively small range, so even small differences in embeddings can make those applications unusable.

Possible workarounds to generate consistent outputs for the same inputs (to be confirmed):

  • For 1) use a docker image to make Ollama portable between operating systems. I haven’t tested this approach. If someone does test this, please let us know.
  • For 2) make sure you produce results on the same hardware. For example, if you generate an embeddings database for a RAG application, make sure that embeddings for subsequent searches are produced on the same hardware.
  • For 3) potentially disable the KV cache on the /generate endpoint for the llama runner:
  • 01aa788722/runner/llamarunner/runner.go (L611)
    or for the ollama runner:
    01aa788722/runner/ollamarunner/runner.go (L600)
** Click to reveal code used to investigate all three causes of different results for the same inputs

image “lettersandnumbers.jpg”:

Image

import requests
import base64

# Get image as base64
def get_base64_image(image_filename):
    with open(image_filename, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
    return encoded_string

# Define the API endpoint
url = "http://localhost:11434/api/generate"

# Define the payload
payload = {
    "model": "gemma3:12b",
    "prompt": "Respond exclusivly with the exact letters and numbers shown in this image. Provide your most likely guess, and a less likely guess. Respond using JSON.",
    "options": {"temperature": 0.0,
                "seed": 0,
                "top_k": 1,
                },
    "format": {
        "type": "object",
        "properties": {
            "most_likely_text": {"type": "string"},
            "less_likely_text": {"type": "string"},
        },
        "required": [
            "most_likely_text",
            "less_likely_text"
            ]
    },
    "stream": False,
    "images": [get_base64_image(image_filename="lettersandnumbers.jpg")],
}

# Send the POST request
try:
    response = requests.post(url, json=payload)
    
    # Check the response
    if response.status_code == 200:
        print("Raw response text:", response.text)
        # Parse the JSON response
        data = response.json()
        # Extract and print the response field
        resp = data.get("response", "Response not found in response")
        print("Response:", resp)
    else:
        print(f"Failed with status code: {response.status_code}")
        print(response.text)
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")
<!-- gh-comment-id:2762270187 --> @kevin-pw commented on GitHub (Mar 28, 2025): # Summary It looks like three different issues might cause different embeddings, logits, and text generation for the same inputs: 1) Generating the output on different operating systems 2) Generating the outputs on CPU vs GPU 3) Generating the outputs using the KV cache After some digging, issues 1) and 2) do not appear to have an easy fix. These issues could be problematic for downstream applications like RAG, but it may be possible to mitigate those issues by carefully working around them. Issue 3) can be addressed by avoiding the use of the KV cache. @rick-github I have two suggestions: - Perhaps it would be useful to highlight that Ollama outputs are non-deterministic in the Ollama docs. I think this could avoid future frustration with developers :) - Perhaps using or not using the KV cache should be a parameter option. Apologies for the long rant – this was a deep dive. # 1) Generating the output on different operating systems Observing different results on different operating systems is consistent with the following issues posted on the llama.cpp repo: https://github.com/ggml-org/llama.cpp/issues/2582 https://github.com/ggml-org/llama.cpp/discussions/2100#discussioncomment-6353790 Unfortunately, the two issues above were never fully resolved. The second issue suggests building a portable binary by statically linking the CUDA libraries, but I haven’t tested if that approach actually achieves consistent results between operating systems. In my tests, I ran Ollama compiled from source directly on Ubuntu Linux 24.10, and I ran Ollama within a docker container that uses Ubuntu Linux 20.04. While running both of those Ollama instances on the same machine, `gemma3:12b` produced different results for the same text + image input. ### To reproduce this issue: Compile Ollama `v0.6.3-rc0` from source as [described in the docs](https://github.com/ollama/ollama/blob/main/docs/development.md): Run Ollama with `go run . serve` Run the python code below \*\* On my Ubuntu Linux 24.10 machine, the response is: `{"most_likely_text": "nbg2m", "less_likely_text": "nbg2m"}` This response contains the correct letters and numbers shown in the image. To receive a different response, stop Ollama with CTRL + C and: Build the docker image with `docker build -t ollama .` as [described in the docs](https://github.com/ollama/ollama/blob/main/docs/docker.md). Run Ollama in docker with `docker run --gpus=all -v ollama:/root/.ollama -p 127.0.0.1:11434:11434 --name ollama ollama` Run the python code below \*\* On my machine, the response is: `{"most_likely_text": "nby2m", "less_likely_text": "nbyzm"}` This response DOES NOT contain the correct letters and numbers shown in the image. # 2) Generating the outputs on CPU vs GPU [As described in the PyTorch docs](https://pytorch.org/docs/stable/notes/randomness.html), consistency across CPU and GPU cannot be guaranteed. In fact, a large number of CUDA algorithms are non-deterministic. [Some algorithms have deterministic but slower equivalents, but several other algorithms cannot behave deterministically at all](https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms) (Thank you, @sisp !). So it may not ever be possible to produce consistent outputs with Ollama across different hardware. ### To reproduce this issue: Run Ollama within the docker image built in section 1) and using the additional flags `-e CUDA_VISIBLE_DEVICES=0` (for GPU) or `-e CUDA_VISIBLE_DEVICES=-1` (for CPU). Run the python code below \*\* On GPU, the response was: `{"most_likely_text": "nby2m", "less_likely_text": "nbyzm"}` On CPU, the response was: `{"most_likely_text": "nby2m", "less_likely_text": "nbytm"}` (with `t` instead of `z`, but both responses are incorrect.) # 3) Generating the outputs using the KV cache The KV cache temporarily stores part of a prompt and its response so that identical prompts do not have to be regenerated by the model when submitting the same prompt twice. However, using the stored results can cause the `/generate` endpoint to produce different results when processing the same inputs twice in quick succession. The `/embed` endpoint is unaffected by this issue because using the KV cache is explicitly set to `false` https://github.com/ollama/ollama/blob/01aa7887221e7bd286ebcb14a088c94ba1c22a99/runner/llamarunner/runner.go#L703 Relevant known issues related to KV cache: https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535 https://github.com/ggml-org/llama.cpp/issues/3014 ### To reproduce this issue: Run Ollama within the docker image built in section 1) and using the additional flags `-e CUDA_VISIBLE_DEVICES=-1` (for CPU) and then run the the python code below ** twice. First response: `{"most_likely_text": "nby2m", "less_likely_text": "nbytm"}` Second response: `{"most_likely_text": "n2m", "less_likely_text": "bg2m"}` As you can see, the first and second response are completely different. Both responses are incorrect when compared to the letters and numbers contained in the image. # Why different results for identical inputs present a significant problem Downstream applications like retrieval augmented generation (RAG), classification, annotation, or semantic search depend on consistent embedding, logit, and text generation. Those applications usually compare similarity between embeddings, so any uncertainty in embeddings reduces the quality of results that those applications can produce. For large datasets, similarities between embeddings often fall within a relatively small range, so even small differences in embeddings can make those applications unusable. # Possible workarounds to generate consistent outputs for the same inputs (to be confirmed): - For 1) use a docker image to make Ollama portable between operating systems. I haven’t tested this approach. If someone does test this, please let us know. - For 2) make sure you produce results on the same hardware. For example, if you generate an embeddings database for a RAG application, make sure that embeddings for subsequent searches are produced on the same hardware. - For 3) potentially disable the KV cache on the `/generate` endpoint for the llama runner: - https://github.com/ollama/ollama/blob/01aa7887221e7bd286ebcb14a088c94ba1c22a99/runner/llamarunner/runner.go#L611 or for the ollama runner: https://github.com/ollama/ollama/blob/01aa7887221e7bd286ebcb14a088c94ba1c22a99/runner/ollamarunner/runner.go#L600 <details> <summary>** Click to reveal code used to investigate all three causes of different results for the same inputs</summary> image “lettersandnumbers.jpg”: ![Image](https://github.com/user-attachments/assets/2d566ee4-ce53-47ed-81bc-1b2a948a4d8c) ``` import requests import base64 # Get image as base64 def get_base64_image(image_filename): with open(image_filename, "rb") as image_file: encoded_string = base64.b64encode(image_file.read()).decode("utf-8") return encoded_string # Define the API endpoint url = "http://localhost:11434/api/generate" # Define the payload payload = { "model": "gemma3:12b", "prompt": "Respond exclusivly with the exact letters and numbers shown in this image. Provide your most likely guess, and a less likely guess. Respond using JSON.", "options": {"temperature": 0.0, "seed": 0, "top_k": 1, }, "format": { "type": "object", "properties": { "most_likely_text": {"type": "string"}, "less_likely_text": {"type": "string"}, }, "required": [ "most_likely_text", "less_likely_text" ] }, "stream": False, "images": [get_base64_image(image_filename="lettersandnumbers.jpg")], } # Send the POST request try: response = requests.post(url, json=payload) # Check the response if response.status_code == 200: print("Raw response text:", response.text) # Parse the JSON response data = response.json() # Extract and print the response field resp = data.get("response", "Response not found in response") print("Response:", resp) else: print(f"Failed with status code: {response.status_code}") print(response.text) except requests.exceptions.RequestException as e: print(f"An error occurred: {e}") ``` </details>
Author
Owner

@flexorx commented on GitHub (Mar 30, 2025):

I can't get it @kevin-pw , in our case we are running this stuff on the same OS (RedHat), on the same computer, in the same environment, without any docker, without parallelism and purely on GPU. We do NOT do anything explicit or special rgd KV cache, we just set all temperature, top_P top_K and the rest of this stuff to 0, repetition_penalty to 1.0 to remove ANY non-determinism and assure results fully reproducible. What is the cause in this case? The whole discussion is going in a sentiment like "there's nothing we can do", but the issue has obviously not been an issue previously in our conditions and only occurred sometime 2024 as a kind of bug.

<!-- gh-comment-id:2764321895 --> @flexorx commented on GitHub (Mar 30, 2025): I can't get it @kevin-pw , in our case we are running this stuff on the same OS (RedHat), on the same computer, in the same environment, without any docker, without parallelism and purely on GPU. We do NOT do anything explicit or special rgd KV cache, we just set all temperature, top_P top_K and the rest of this stuff to 0, repetition_penalty to 1.0 to remove ANY non-determinism and assure results fully reproducible. What is the cause in this case? The whole discussion is going in a sentiment like "there's nothing we can do", but the issue has obviously not been an issue previously in our conditions and only occurred sometime 2024 as a kind of bug.
Author
Owner

@kevin-pw commented on GitHub (Mar 30, 2025):

We do NOT do anything explicit or special rgd KV cache

@flexorx Does the following correctly describe your issue when you submit multiple identical input queries?

  1. The text generated by the first input query is different from the text generated by the second input query.
  2. The text generated by the second and all following input queries is identical.

If that is the case, the issue is caused by the KV cache. In the current version 0.6.3 of Ollama, no input parameter is available to disable the KV cache on the /generate endpoint. To eliminate the issue, you would need to implement workaround 3) in my comment above by modifying the two lines in the source code.

<!-- gh-comment-id:2764354219 --> @kevin-pw commented on GitHub (Mar 30, 2025): > We do NOT do anything explicit or special rgd KV cache @flexorx Does the following correctly describe your issue when you submit multiple identical input queries? 1. The text generated by the first input query is different from the text generated by the second input query. 2. The text generated by the second and all following input queries is identical. If that is the case, the issue is caused by the KV cache. In the current version `0.6.3` of Ollama, no input parameter is available to disable the KV cache on the `/generate` endpoint. To eliminate the issue, you would need to implement workaround 3) in my comment above by modifying the two lines in the source code.
Author
Owner

@flexorx commented on GitHub (Mar 30, 2025):

@kevin-pw yes, indeed, it is often that the first query result is different from the second and on consecutive queries for the same input. However, if we also do some other input previously to the first input of the specific query, then the first result for this specific query after this other input would differ from the first result of this specific query but without this other input prior to it.

So, essentially, we can say two things:

  1. It is always that the result of the first input of one particular query is different from the second and on inputs of that same query, given that all inputs preceding to that first input remain intact, and that no other queries interleave these first, second and on inputs of this particular query.

  2. Generally, if we perform a sequence of queries A,B,C,D, then the results would vary for any permutation of these queries' order, such as B,C,A,D, D,A,C,B etc. Therefore, in general, the results are not immutable to the permutation of queries and are path-dependent therefore.

<!-- gh-comment-id:2764356474 --> @flexorx commented on GitHub (Mar 30, 2025): @kevin-pw yes, indeed, it is often that the first query result is different from the second and on consecutive queries for the same input. However, if we also do some other input previously to the first input of the specific query, then the first result for this specific query after this other input would differ from the first result of this specific query but without this other input prior to it. So, essentially, we can say two things: 1. It is always that the result of the first input of one particular query is different from the second and on inputs of that same query, given that all inputs preceding to that first input remain intact, and that no other queries interleave these first, second and on inputs of this particular query. 2. Generally, if we perform a sequence of queries A,B,C,D, then the results would vary for any permutation of these queries' order, such as B,C,A,D, D,A,C,B etc. Therefore, in general, the results are not immutable to the permutation of queries and are path-dependent therefore.
Author
Owner

@JakeBeaver commented on GitHub (Mar 31, 2025):

I found some simple repro steps for the /api/chat endpoint

  1. Repeat sending a POST request with this body:
{
    "model": "gemma3:1b",
    "messages": [
        { "role": "user", "content": "What is your name?" }
    ],
    "stream": false,
    "options": { "temperature": 0, "top_p": 1, "top_k": 1, "seed": 123 }
}
  1. First response after model load:
{
    "model": "gemma3:1b",
    "created_at": "2025-03-31T15:17:52.130833Z",
    "message": {
        "role": "assistant",
        "content": "Hello! I’m Gemma, created by the Gemma team at Google DeepMind."
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 2484392600,
    "load_duration": 1680573700,
    "prompt_eval_count": 14,
    "prompt_eval_duration": 609479800,
    "eval_count": 18,
    "eval_duration": 193867500
}
  1. Every subsequent response:
{
    "model": "gemma3:1b",
    "created_at": "2025-03-31T15:18:22.5370125Z",
    "message": {
        "role": "assistant",
        "content": "I’m Gemma, created by the Gemma team at Google DeepMind."
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 509578700,
    "load_duration": 133404300,
    "prompt_eval_count": 14,
    "prompt_eval_duration": 161069100,
    "eval_count": 16,
    "eval_duration": 214586700
}
  1. Unload the model by sending this in a POST request body:
{
    "model": "gemma3:1b",
    "messages": [],
    "stream": false,
    "keep_alive": 0
}
  1. Repeat from step 1

Repeating this gives the same result in a loop. After the first response, message.content loses the Hello! and eval_count changes from 18 to 16 for all subsequent responses.

At least its enough to unload with an API request, so I don't have to rely on shell scripts rebooting all of ollama, but still, seems wasteful to keep keep_alive as 0 and force the model unload after every request.

<!-- gh-comment-id:2766606921 --> @JakeBeaver commented on GitHub (Mar 31, 2025): I found some simple repro steps for the `/api/chat` endpoint 1. Repeat sending a POST request with this body: ```json { "model": "gemma3:1b", "messages": [ { "role": "user", "content": "What is your name?" } ], "stream": false, "options": { "temperature": 0, "top_p": 1, "top_k": 1, "seed": 123 } } ``` 2. First response after model load: ```json { "model": "gemma3:1b", "created_at": "2025-03-31T15:17:52.130833Z", "message": { "role": "assistant", "content": "Hello! I’m Gemma, created by the Gemma team at Google DeepMind." }, "done_reason": "stop", "done": true, "total_duration": 2484392600, "load_duration": 1680573700, "prompt_eval_count": 14, "prompt_eval_duration": 609479800, "eval_count": 18, "eval_duration": 193867500 } ``` 3. Every subsequent response: ```json { "model": "gemma3:1b", "created_at": "2025-03-31T15:18:22.5370125Z", "message": { "role": "assistant", "content": "I’m Gemma, created by the Gemma team at Google DeepMind." }, "done_reason": "stop", "done": true, "total_duration": 509578700, "load_duration": 133404300, "prompt_eval_count": 14, "prompt_eval_duration": 161069100, "eval_count": 16, "eval_duration": 214586700 } ``` 4. Unload the model by sending this in a POST request body: ```json { "model": "gemma3:1b", "messages": [], "stream": false, "keep_alive": 0 } ``` 5. Repeat from step 1 Repeating this gives the same result in a loop. After the first response, `message.content` loses the `Hello!` and `eval_count` changes from `18` to `16` for all subsequent responses. At least its enough to unload with an API request, so I don't have to rely on shell scripts rebooting all of ollama, but still, seems wasteful to keep `keep_alive` as `0` and force the model unload after every request.
Author
Owner

@wyli commented on GitHub (Mar 31, 2025):

disabling kvcache as mentioned by @kevin-pw works for me.. I put a possible implementation here https://github.com/ollama/ollama/pull/10064 to make it configurable.

<!-- gh-comment-id:2766876659 --> @wyli commented on GitHub (Mar 31, 2025): disabling kvcache as mentioned by @kevin-pw works for me.. I put a possible implementation here https://github.com/ollama/ollama/pull/10064 to make it configurable.
Author
Owner

@d-kleine commented on GitHub (Apr 4, 2025):

@sayap already identified the KV cache as the root of the problem with generating consistent reproducible outputs almost a year ago. He also submitted a fix which worked at least when I tested it back then (#5760), but this PR has not been merged.

I have also tried to make the PRNG for the cache initialization consistent with a seed, but never got this really working. The thing about disabling the KV caching forces the LLM fully recompute the attention matrices, therefore using more memory than having the KV caching enabled.

<!-- gh-comment-id:2778274735 --> @d-kleine commented on GitHub (Apr 4, 2025): @sayap already identified the KV cache as the root of the problem with generating consistent reproducible outputs almost a year ago. He also submitted a fix which worked at least when I tested it back then (#5760), but this PR has not been merged. I have also tried to make the PRNG for the cache initialization consistent with a seed, but never got this really working. The thing about disabling the KV caching forces the LLM fully recompute the attention matrices, therefore using more memory than having the KV caching enabled.
Author
Owner

@wyli commented on GitHub (Apr 4, 2025):

not sure why that PR was not considered as well... in general these don't change the default and increase flexibility.

(I think the analysis here has already demonstrate the numerical differences https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)

<!-- gh-comment-id:2778728542 --> @wyli commented on GitHub (Apr 4, 2025): not sure why that PR was not considered as well... in general these don't change the default and increase flexibility. (I think the analysis here has already demonstrate the numerical differences https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)
Author
Owner

@Jonas-Wessner commented on GitHub (Feb 25, 2026):

I confirm the bug on ubuntu 24.04.3, running on L40s GPUs.
When I execute the script for the first time, I get a random output for iteration 0 of the loop. For subsequent iterations, I get a different, but consistent output.
If I run the script again, the bug is gone.
If I change something about the prompt (causing some cache reload I suppose), the bug can be reproduced again.
I hope this can be fixed soon, since otherwise it is hard to guarantee reproducible experiment results.

from openai import OpenAI
import hashlib

# Initialize client with custom base URL
client = OpenAI(
    base_url="http://localhost:11435/v1",  # your Ollama/OpenAI-compatible server
    api_key="ollama"                        # non-empty placeholder
)

prompt = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me a random list of numbers please. But really really random please."}
]

outputs = []

for i in range(3):
    response = client.chat.completions.create(
        model="llama3.3:70b",
        messages=prompt,
        seed=42,
        temperature=0,
    )
    text = response.choices[0].message.content
    outputs.append(text)
    print(f"Run {i+1}: {text}")

# Check if all outputs are identical
hashes = [hashlib.sha256(o.encode()).hexdigest() for o in outputs]
all_same = len(set(hashes)) == 1
print("\nAll outputs identical:", all_same)
<!-- gh-comment-id:3960972233 --> @Jonas-Wessner commented on GitHub (Feb 25, 2026): I confirm the bug on ubuntu 24.04.3, running on L40s GPUs. When I execute the script for the first time, I get a random output for iteration 0 of the loop. For subsequent iterations, I get a different, but consistent output. If I run the script again, the bug is gone. If I change something about the prompt (causing some cache reload I suppose), the bug can be reproduced again. I hope this can be fixed soon, since otherwise it is hard to guarantee reproducible experiment results. ``` from openai import OpenAI import hashlib # Initialize client with custom base URL client = OpenAI( base_url="http://localhost:11435/v1", # your Ollama/OpenAI-compatible server api_key="ollama" # non-empty placeholder ) prompt = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Give me a random list of numbers please. But really really random please."} ] outputs = [] for i in range(3): response = client.chat.completions.create( model="llama3.3:70b", messages=prompt, seed=42, temperature=0, ) text = response.choices[0].message.content outputs.append(text) print(f"Run {i+1}: {text}") # Check if all outputs are identical hashes = [hashlib.sha256(o.encode()).hexdigest() for o in outputs] all_same = len(set(hashes)) == 1 print("\nAll outputs identical:", all_same) ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49843