[GH-ISSUE #5321] Llama3: Generated outputs inconsistent despite seed and temperature #65369

New Issue

GiteaMirror · 2026-05-03T20:58:26-05:00

GiteaMirror commented

2026-05-03 20:58:26 -05:00

Originally created by @d-kleine on GitHub (Jun 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5321

What is the issue?

Follow-up of #586

Even though the output is deterministic and reproducible with a fixed seed, a temperature set to 0 and a fixed num_ctx, the generated output of Llama 3 slightly differs in the first executing of this code and the second execution of this code (without kernel restart). The following executions will be the same as for the second execution:

Code snippet taken from LLMs from scratch - Evaluation with Ollama:

import urllib.request
import json


def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048 # must be set, otherwise slightly random output
        }
    }

    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

result = query_model("What do Llamas eat?")
print(result)

Output of execution no. 1 (output can vary):

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay as a snack or as a main meal.
3. Grains: Llamas may be fed grains like oats, barley, or corn as an occasional treat or to supplement their diet.
4. Fruits and vegetables: Fresh fruits and veggies, such as apples, carrots, and sweet potatoes, can be given as treats or added to their meals for variety.
5. Leaves and shrubs: Llamas will also eat leaves from trees and shrubs, like willow or cedar.

In the wild, llamas might eat:

* Various grasses and plants
* Leaves from trees and shrubs
* Fruits and berries
* Bark (in some cases)

Domesticated llamas, on the other hand, typically receive a diet that includes:

* Hay as their main staple
* Grains or pellets as a supplement
* Fresh fruits and veggies as treats

It's essential to provide llamas with a balanced diet that meets their nutritional needs. Consult with a veterinarian or an experienced llama breeder to determine the best feeding plan for your llama.

Output for execution no. 2 to execution no. n (output should be reproducible):

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay cubes or loose hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their diet. However, these should be given in moderation to avoid digestive issues.
4. Fruits and vegetables: Fresh fruits and veggies can be a tasty treat for llamas. Some favorites include apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas need access to mineral supplements, such as salt licks or loose minerals, to ensure they're getting the necessary nutrients.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, like willow or cedar.
2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or birch.
3. Mosses: Llamas have been known to graze on mosses and other non-woody plant material.

It's essential to provide a balanced diet for your llama, taking into account their age, size, and individual needs. Consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.

Observations:

As you can see from the outputs, the output from the first execution will be random, whereas the output of the second execution and all subsequent executions will be generated consistently deterministic.
I have tried using different platforms (Windows, Docker using Ubuntu image), and it seems like the generated outputs differ across those different OS: The first one will always be somewhat random, but the following ones are consistent on a platform. But for example on Windows, this code produced a different consistent deterministic output than Ubuntu did.
I have tried to set a Python hashseed, this did not solve the issue.

Linux, macOS, Windows, Docker, WSL2

GPU

Nvidia

CPU

AMD

Ollama version

0.1.46

Originally created by @d-kleine on GitHub (Jun 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5321 ### What is the issue? Follow-up of #586 Even though the output is **deterministic** and **reproducible** with a fixed `seed`, a `temperature` set to 0 and a fixed `num_ctx`, the generated output of **Llama 3** slightly differs in the first executing of this code and the second execution of this code (without kernel restart). The following executions will be the same as for the second execution: Code snippet taken from [LLMs from scratch - Evaluation with Ollama](https://github.com/rasbt/LLMs-from-scratch/blob/1db199995121afc56146f92ec502b68df17e9c0a/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb): ```python import urllib.request import json def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"): # Create the data payload as a dictionary data = { "model": model, "messages": [ { "role": "user", "content": prompt } ], "options": { "seed": 123, "temperature": 0, "num_ctx": 2048 # must be set, otherwise slightly random output } } # Convert the dictionary to a JSON formatted string and encode it to bytes payload = json.dumps(data).encode("utf-8") # Create a request object, setting the method to POST and adding necessary headers request = urllib.request.Request(url, data=payload, method="POST") request.add_header("Content-Type", "application/json") # Send the request and capture the response response_data = "" with urllib.request.urlopen(request) as response: # Read and decode the response while True: line = response.readline().decode("utf-8") if not line: break response_json = json.loads(line) response_data += response_json["message"]["content"] return response_data result = query_model("What do Llamas eat?") print(result) ``` Output of execution no. $1$ (output can vary): ``` Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of: 1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds. 2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay as a snack or as a main meal. 3. Grains: Llamas may be fed grains like oats, barley, or corn as an occasional treat or to supplement their diet. 4. Fruits and vegetables: Fresh fruits and veggies, such as apples, carrots, and sweet potatoes, can be given as treats or added to their meals for variety. 5. Leaves and shrubs: Llamas will also eat leaves from trees and shrubs, like willow or cedar. In the wild, llamas might eat: * Various grasses and plants * Leaves from trees and shrubs * Fruits and berries * Bark (in some cases) Domesticated llamas, on the other hand, typically receive a diet that includes: * Hay as their main staple * Grains or pellets as a supplement * Fresh fruits and veggies as treats It's essential to provide llamas with a balanced diet that meets their nutritional needs. Consult with a veterinarian or an experienced llama breeder to determine the best feeding plan for your llama. ``` Output for execution no. $2$ to execution no. $n$ (output should be reproducible): ``` Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of: 1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds. 2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy munching on hay cubes or loose hay. 3. Grains: Llamas may receive grains like oats, barley, or corn as part of their diet. However, these should be given in moderation to avoid digestive issues. 4. Fruits and vegetables: Fresh fruits and veggies can be a tasty treat for llamas. Some favorites include apples, carrots, sweet potatoes, and leafy greens like kale or spinach. 5. Minerals: Llamas need access to mineral supplements, such as salt licks or loose minerals, to ensure they're getting the necessary nutrients. In the wild, llamas might also eat: 1. Leaves: They'll munch on leaves from trees and shrubs, like willow or cedar. 2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or birch. 3. Mosses: Llamas have been known to graze on mosses and other non-woody plant material. It's essential to provide a balanced diet for your llama, taking into account their age, size, and individual needs. Consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama. ``` **Observations:** - As you can see from the outputs, the output from the first execution will be random, whereas the output of the second execution and all subsequent executions will be generated consistently deterministic. - I have tried using different platforms (Windows, Docker using Ubuntu image), and it seems like the generated outputs differ across those different OS: The first one will always be somewhat random, but the following ones are consistent on a platform. But for example on Windows, this code produced a different consistent deterministic output than Ubuntu did. - I have tried to set a Python hashseed, this did not solve the issue. Linux, macOS, Windows, Docker, WSL2 ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.46

GiteaMirror added the bug label 2026-05-03 20:58:26 -05:00

GiteaMirror commented

2026-05-03 21:01:33 -05:00

@mitar commented on GitHub (Jul 17, 2024):

Your version does include ead259d877 so I am not sure why.

@mitar commented on GitHub (Jul 17, 2024): Your version does include https://github.com/ollama/ollama/commit/ead259d877fc8b20f7943f1f9e8eeaae0acfa52a so I am not sure why.

GiteaMirror commented

2026-05-03 21:01:34 -05:00

@sayap commented on GitHub (Jul 18, 2024):

Can try to apply this patch:

diff --git a/llm/server.go b/llm/server.go
index 36c0e0b5..b93b5b6c 100644
--- a/llm/server.go
+++ b/llm/server.go
@@ -734,7 +734,7 @@ func (s *llmServer) Completion(ctx context.Context, req CompletionRequest, fn fu
 		"seed":              req.Options.Seed,
 		"stop":              req.Options.Stop,
 		"image_data":        req.Images,
-		"cache_prompt":      true,
+		"cache_prompt":      false,
 	}
 
 	// Make sure the server is ready

The cache_prompt flag was set to true by commit a64570dca. From https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints, it says:

prompt: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if cache_prompt is true, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated.

Once I have applied this patch, I can get the exact same output when sending the same prompt with the same seed and the same temperature, regardless of kernel restart. For example:

$ curl http://localhost:11434/api/chat -d '{
  "model": "phi3:medium",
  "messages": [
    {
      "role": "user",
      "content": "Tell me a short story about ghost. Limit it to 20 words."
    }
  ],
  "options": {
    "seed": 666,
    "temperature": 0.666
  },
  "stream": false
}'

1st output:

{
  "model": "phi3:medium",
  "created_at": "2024-07-18T00:09:38.40377522Z",
  "message": {
    "role": "assistant",
    "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 7185272872,
  "load_duration": 3884720,
  "prompt_eval_count": 21,
  "prompt_eval_duration": 1255198000,
  "eval_count": 32,
  "eval_duration": 5884770000
}

2nd output:

{
  "model": "phi3:medium",
  "created_at": "2024-07-18T00:09:49.3024363Z",
  "message": {
    "role": "assistant",
    "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 7152606421,
  "load_duration": 4006631,
  "prompt_eval_count": 21,
  "prompt_eval_duration": 1255757000,
  "eval_count": 32,
  "eval_duration": 5892071000
}

I guess this flag should be made configurable?

@sayap commented on GitHub (Jul 18, 2024): Can try to apply this patch: ```diff diff --git a/llm/server.go b/llm/server.go index 36c0e0b5..b93b5b6c 100644 --- a/llm/server.go +++ b/llm/server.go @@ -734,7 +734,7 @@ func (s *llmServer) Completion(ctx context.Context, req CompletionRequest, fn fu "seed": req.Options.Seed, "stop": req.Options.Stop, "image_data": req.Images, - "cache_prompt": true, + "cache_prompt": false, } // Make sure the server is ready ``` The `cache_prompt` flag was set to true by commit a64570dca. From https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints, it says: > prompt: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if cache_prompt is true, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. Once I have applied this patch, I can get the exact same output when sending the same prompt with the same `seed` and the same `temperature`, regardless of kernel restart. For example: ```console $ curl http://localhost:11434/api/chat -d '{ "model": "phi3:medium", "messages": [ { "role": "user", "content": "Tell me a short story about ghost. Limit it to 20 words." } ], "options": { "seed": 666, "temperature": 0.666 }, "stream": false }' ``` 1st output: ```json { "model": "phi3:medium", "created_at": "2024-07-18T00:09:38.40377522Z", "message": { "role": "assistant", "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls." }, "done_reason": "stop", "done": true, "total_duration": 7185272872, "load_duration": 3884720, "prompt_eval_count": 21, "prompt_eval_duration": 1255198000, "eval_count": 32, "eval_duration": 5884770000 } ``` 2nd output: ```json { "model": "phi3:medium", "created_at": "2024-07-18T00:09:49.3024363Z", "message": { "role": "assistant", "content": " A lonely ghost haunted an old mansion, seeking companionship. One day, a curious child visited; they found friendship within the walls." }, "done_reason": "stop", "done": true, "total_duration": 7152606421, "load_duration": 4006631, "prompt_eval_count": 21, "prompt_eval_duration": 1255757000, "eval_count": 32, "eval_duration": 5892071000 } ``` I guess this flag should be made configurable?

GiteaMirror commented

2026-05-03 21:01:36 -05:00

@d-kleine commented on GitHub (Jul 18, 2024):

To be honest, idk if that fixes the issue. For the output, you have used a different model, a different prompt and not validated it across different OS.

The KV cache is actually a helpful feature, but it might be initialized differently across different OS. Therefore, disabling it might fix this issue, but does not solve the resolve the issue with KV caching initialization.
https://github.com/ggerganov/llama.cpp/issues/4902

But you actually brought me on a idea to bypass KV cache just by setting num_keep=0 (this would not disable it, but at least no tokens will be stored in the cache then).

Idk how to install ollama with your changes, neither on Ubuntu nor on Windows, I will test it once the new Ollama version having it implemented will be released. Thanks for the PR anyways!

BTW I have also opened a PR on llama.cpp for making outputs 100% deterministic:
https://github.com/ggerganov/llama.cpp/discussions/8265

When using temperature=0, a small coefficient might be used to prevent zero division. In some cases, this might slightly change the generated output, depending on the model used. Therefore, it would be better to turn off beam search and multinomial sampling for deterministic sampling.

Setting a seed only makes sense when using non-deterministic sampling, such as Top-k or Top-p sampling, to ensure reproducibility. This example code here for Ollama doesn't fully make sense because you would not need to set a seed as with temperature=0 the generated output would be deterministic anyways. But when setting a temperature > 0.0, you would need to set a seed as well to make the output reproducible.

@d-kleine commented on GitHub (Jul 18, 2024): To be honest, idk if that fixes the issue. For the output, you have used a different model, a different prompt and not validated it across different OS. The KV cache is actually a helpful feature, but it might be initialized differently across different OS. Therefore, disabling it might fix this issue, but does not solve the resolve the issue with KV caching initialization. https://github.com/ggerganov/llama.cpp/issues/4902 But you actually brought me on a idea to bypass KV cache just by setting `num_keep=0` (this would not disable it, but at least no tokens will be stored in the cache then). Idk how to install ollama with your changes, neither on Ubuntu nor on Windows, I will test it once the new Ollama version having it implemented will be released. Thanks for the PR anyways! --- BTW I have also opened a PR on llama.cpp for making outputs 100% deterministic: https://github.com/ggerganov/llama.cpp/discussions/8265 When using `temperature=0`, a small coefficient might be used to prevent zero division. In some cases, this might slightly change the generated output, depending on the model used. Therefore, it would be better to turn off beam search and multinomial sampling for **deterministic** sampling. Setting a `seed` only makes sense when using **non-deterministic** sampling, such as Top-k or Top-p sampling, to ensure reproducibility. [This example code here for Ollama](https://github.com/ollama/ollama/blob/main/docs/api.md#chat-request-reproducible-outputs) doesn't fully make sense because you would not need to set a `seed` as with `temperature=0` the generated output would be deterministic anyways. But when setting a temperature > 0.0, you would need to set a seed as well to make the output reproducible.

GiteaMirror commented

2026-05-03 21:01:38 -05:00

@psambit9791 commented on GitHub (Jan 2, 2025):

Is there any update on this issue?
I have also been encountering this issue.

@psambit9791 commented on GitHub (Jan 2, 2025): Is there any update on this issue? I have also been encountering this issue.

GiteaMirror commented

2026-05-03 21:01:41 -05:00

@sisp commented on GitHub (Feb 26, 2025):

Same here with Microsoft's Phi-3 Mini model.

@sisp commented on GitHub (Feb 26, 2025): Same here with Microsoft's Phi-3 Mini model.

GiteaMirror commented

2026-05-03 21:01:44 -05:00

@lemassykoi commented on GitHub (Mar 16, 2025):

Same here with Ollama 0.6.1

I wrote a script to test with models : https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5

Seed and Temperature are fixed values.
For each model in the list, the script restart ollama service, then send the same query twice to the model.
If the model output is strictly identical, test is ok.
If the model output is different from pass 1 to pass 2, test is failed.

@lemassykoi commented on GitHub (Mar 16, 2025): Same here with Ollama 0.6.1 I wrote a script to test with models : https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5 Seed and Temperature are fixed values. For each model in the list, the script restart ollama service, then send the same query twice to the model. If the model output is strictly identical, test is ok. If the model output is different from pass 1 to pass 2, test is failed. ![Image](https://github.com/user-attachments/assets/e328860b-5845-4183-8354-4ed0784120e5)

GiteaMirror commented

2026-05-03 21:01:47 -05:00

@kevin-pw commented on GitHub (Mar 26, 2025):

I can confirm inconsistent outputs with Ollama v0.6.3-rc0 for several models I tested:
llama3.2:latest
llama3.2-vision:11b
gemma3:12b

I noticed that llama3.2:latest produces inconsistent results for identical inputs not only on the /generate endpoint but also on the /embed endpoint. That means the model produces different probability distributions for the same inputs. That also means that adjusting sampler options (like temperature, seed, top_p, top_k, etc.) cannot fix this problem.

The inconsistent outputs present a serious issue because they degrade the quality of any downstream applications. In my tests, the cosine similarity between the different embeddings for the same inputs was as low as 99.4%, but I suspect that similarity could drop even lower. Especially for RAG applications where similarities for large datasets often fall within a small range, this inconsistency appears beyond acceptable.

I am not including any code here because the inconsistencies are somewhat difficult to reproduce. I have found the inconsistencies to occur on the same machine when I run Ollama on CPU vs GPU, and when running Ollama using the installed version vs a docker container vs compiling the development version from source. Sometimes these changes cause the inconsistencies, and sometimes they do not.

@rick-github Could you take another look at this issue? I believe the inconsistencies are serious enough to warrant attention, but I know you have a long list of priorities. Shout-out to the Ollama team for your amazing work!

@kevin-pw commented on GitHub (Mar 26, 2025): I can confirm inconsistent outputs with Ollama `v0.6.3-rc0` for several models I tested: `llama3.2:latest` `llama3.2-vision:11b` `gemma3:12b` I noticed that `llama3.2:latest` produces inconsistent results for identical inputs not only on the `/generate` endpoint but also on the `/embed` endpoint. That means the model produces different probability distributions for the same inputs. That also means that adjusting sampler options (like `temperature`, `seed`, `top_p`, `top_k`, etc.) cannot fix this problem. The inconsistent outputs present a serious issue because they degrade the quality of any downstream applications. In my tests, the cosine similarity between the different embeddings for the same inputs was as low as 99.4%, but I suspect that similarity could drop even lower. Especially for RAG applications where similarities for large datasets often fall within a small range, this inconsistency appears beyond acceptable. I am not including any code here because the inconsistencies are somewhat difficult to reproduce. I have found the inconsistencies to occur on the same machine when I run Ollama on CPU vs GPU, and when running Ollama using the installed version vs a docker container vs compiling the development version from source. Sometimes these changes cause the inconsistencies, and sometimes they do not. @rick-github Could you take another look at this issue? I believe the inconsistencies are serious enough to warrant attention, but I know you have a long list of priorities. Shout-out to the Ollama team for your amazing work!

GiteaMirror commented

2026-05-03 21:01:49 -05:00

@rick-github commented on GitHub (Mar 26, 2025):

A quick first pass with generating embeddings with llama3.2:latest failed to show any inconsistencies. Can you give me an idea of the type of input, chunk length and context length you are using?

@rick-github commented on GitHub (Mar 26, 2025): A quick first pass with generating embeddings with llama3.2:latest failed to show any inconsistencies. Can you give me an idea of the type of input, chunk length and context length you are using?

GiteaMirror commented

2026-05-03 21:01:53 -05:00

@flexorx commented on GitHub (Mar 26, 2025):

@rick-github please consider running this, for instance:
https://github.com/ollama/ollama/issues/5321#issuecomment-2727539874
https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5
In our case, the issue is present with pretty much any prompt and with temperature, top_k, top_p, seed all zeroed out and repetition_penalty set to 1.0 (all non-determinism OFF), on various quantizations of mistral 24B, mistral 24B-3.1, gemma3 of various sizes, and various window sizes like 2k, 4k, 8k, on Nvidia GPU both with and without parallelism for ollama set (surely on the same OS and the same computer).

@flexorx commented on GitHub (Mar 26, 2025): @rick-github please consider running this, for instance: https://github.com/ollama/ollama/issues/5321#issuecomment-2727539874 https://gist.github.com/lemassykoi/e1423068d1d976961953d86609877fd5 In our case, the issue is present with pretty much any prompt and with temperature, top_k, top_p, seed all zeroed out and repetition_penalty set to 1.0 (all non-determinism OFF), on various quantizations of mistral 24B, mistral 24B-3.1, gemma3 of various sizes, and various window sizes like 2k, 4k, 8k, on Nvidia GPU both with and without parallelism for ollama set (surely on the same OS and the same computer).

GiteaMirror commented

2026-05-03 21:01:55 -05:00

@lemassykoi commented on GitHub (Mar 27, 2025):

New version with embed testing: https://gist.github.com/lemassykoi/5a6c0d655b5923e9588eef68d12fcbd2

It will test all your available models from your local ollama, excepted some with special words in model name like code, or embed (which can't chat or generate)

Some models are not capable of embedding, they will appear as invalid models without raising exception.

ollama 0.6.2

@lemassykoi commented on GitHub (Mar 27, 2025): New version with embed testing: [https://gist.github.com/lemassykoi/5a6c0d655b5923e9588eef68d12fcbd2](url) It will test all your available models from your local ollama, excepted some with special words in model name like `code`, or `embed` (which can't chat or generate) Some models are not capable of embedding, they will appear as invalid models without raising exception. ![Image](https://github.com/user-attachments/assets/f4c19e94-aad3-4134-bb4e-4637840c3041) ollama 0.6.2 ![Image](https://github.com/user-attachments/assets/f97feb5b-8765-4ac8-aee1-39c3d535013d)

GiteaMirror commented

2026-05-03 21:01:58 -05:00

@kevin-pw commented on GitHub (Mar 27, 2025):

@rick-github Thank you for looking into this issue!

I am able to reproduce the inconsistent embedding results by running Ollama compiled from source and setting CUDA_VISIBLE_DEVICES either to 0 or to -1.

First, I compile from source as usual. I am using v0.6.3-rc0, which is commit e5d84fb:
cmake -B build
cmake --build build

I then explicitly use the GPU by setting:
export CUDA_VISIBLE_DEVICES=0

Then, run Ollama:
go run . serve

In a separate terminal, I issue a curl command to Ollama:

curl -X POST "http://localhost:11434/api/embed" \
     -H "Content-Type: application/json" \
     -d '{
           "model": "llama3.2:latest",
           "input": "What is the weather like today? Note that we are located on the moon.",
           "raw": false
         }'

The response is:
{"model":"llama3.2:latest","embeddings":[[0.00094434456,0.013325062,-0.026114173,...,-0.03395605,0.0069057234,-0.010010444]],"total_duration":1517785083,"load_duration":1414543924,"prompt_eval_count":16}

I then stop Ollama with CTRL + C, and explicitly use the CPU by setting:
export CUDA_VISIBLE_DEVICES=-1

Then, run Ollama:
go run . serve

Using the same curl command as above, the response is:
{"model":"llama3.2:latest","embeddings":[[0.0010927601,0.014933009,-0.024558352,...,-0.03514633,0.0063932026,-0.009075297]],"total_duration":225640647,"load_duration":1596095,"prompt_eval_count":16}

As you can see, using the GPU vs CPU to compute identical curl requests results in different embeddings. The cosine similarity between the different embeddings is 99.86% in this example but I have seen lower similarities. Ideally, the similarity should be 100%. I haven’t changed any parameters other than using the GPU vs CPU. That means my curl requests use the default context length, chunk size, etc.

@lemassykoi I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10.

@kevin-pw commented on GitHub (Mar 27, 2025): @rick-github Thank you for looking into this issue! I am able to reproduce the inconsistent embedding results by running Ollama compiled from source and setting `CUDA_VISIBLE_DEVICES` either to `0` or to `-1`. First, I compile from source as usual. I am using `v0.6.3-rc0`, which is commit `e5d84fb`: `cmake -B build` `cmake --build build` I then explicitly use the GPU by setting: `export CUDA_VISIBLE_DEVICES=0` Then, run Ollama: `go run . serve` In a separate terminal, I issue a curl command to Ollama: ``` curl -X POST "http://localhost:11434/api/embed" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2:latest", "input": "What is the weather like today? Note that we are located on the moon.", "raw": false }' ``` The response is: `{"model":"llama3.2:latest","embeddings":[[0.00094434456,0.013325062,-0.026114173,...,-0.03395605,0.0069057234,-0.010010444]],"total_duration":1517785083,"load_duration":1414543924,"prompt_eval_count":16}` I then stop Ollama with CTRL + C, and explicitly use the CPU by setting: `export CUDA_VISIBLE_DEVICES=-1` Then, run Ollama: `go run . serve` Using the same curl command as above, the response is: `{"model":"llama3.2:latest","embeddings":[[0.0010927601,0.014933009,-0.024558352,...,-0.03514633,0.0063932026,-0.009075297]],"total_duration":225640647,"load_duration":1596095,"prompt_eval_count":16}` As you can see, using the GPU vs CPU to compute identical curl requests results in different embeddings. The cosine similarity between the different embeddings is 99.86% in this example but I have seen lower similarities. Ideally, the similarity should be 100%. I haven’t changed any parameters other than using the GPU vs CPU. That means my curl requests use the default context length, chunk size, etc. @lemassykoi I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10.

GiteaMirror commented

2026-05-03 21:02:05 -05:00

@lemassykoi commented on GitHub (Mar 27, 2025):

@lemassykoi I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10.

Yes, I don't switch between CPU and GPU

I'm using Debian 12

@lemassykoi commented on GitHub (Mar 27, 2025): > [@lemassykoi](https://github.com/lemassykoi) I tried your script but was unable to reproduce the inconsistent embedding results. Your script restarts Ollama without changing any other settings (like switching from GPU to CPU), is that right? A restart alone did not produce inconsistencies for me. Perhaps this depends on the machine Ollama is running on. I use Ubuntu Linux 24.10. Yes, I don't switch between CPU and GPU I'm using Debian 12

GiteaMirror commented

2026-05-03 21:02:12 -05:00

@sisp commented on GitHub (Mar 27, 2025):

@kevin-pw Consistency across CPU and GPU cannot be guaranteed: https://pytorch.org/docs/stable/notes/randomness.html

@sisp commented on GitHub (Mar 27, 2025): @kevin-pw Consistency across CPU and GPU cannot be guaranteed: https://pytorch.org/docs/stable/notes/randomness.html

GiteaMirror commented

2026-05-03 21:02:17 -05:00

@kevin-pw commented on GitHub (Mar 28, 2025):

Summary

It looks like three different issues might cause different embeddings, logits, and text generation for the same inputs:

Generating the output on different operating systems
Generating the outputs on CPU vs GPU
Generating the outputs using the KV cache

After some digging, issues 1) and 2) do not appear to have an easy fix. These issues could be problematic for downstream applications like RAG, but it may be possible to mitigate those issues by carefully working around them. Issue 3) can be addressed by avoiding the use of the KV cache.

@rick-github I have two suggestions:

Perhaps it would be useful to highlight that Ollama outputs are non-deterministic in the Ollama docs. I think this could avoid future frustration with developers :)
Perhaps using or not using the KV cache should be a parameter option.

Apologies for the long rant – this was a deep dive.

1) Generating the output on different operating systems

Observing different results on different operating systems is consistent with the following issues posted on the llama.cpp repo:
https://github.com/ggml-org/llama.cpp/issues/2582
https://github.com/ggml-org/llama.cpp/discussions/2100#discussioncomment-6353790

Unfortunately, the two issues above were never fully resolved. The second issue suggests building a portable binary by statically linking the CUDA libraries, but I haven’t tested if that approach actually achieves consistent results between operating systems.

In my tests, I ran Ollama compiled from source directly on Ubuntu Linux 24.10, and I ran Ollama within a docker container that uses Ubuntu Linux 20.04. While running both of those Ollama instances on the same machine, gemma3:12b produced different results for the same text + image input.

To reproduce this issue:

Compile Ollama v0.6.3-rc0 from source as described in the docs:
Run Ollama with go run . serve
Run the python code below **
On my Ubuntu Linux 24.10 machine, the response is: {"most_likely_text": "nbg2m", "less_likely_text": "nbg2m"} This response contains the correct letters and numbers shown in the image.

To receive a different response, stop Ollama with CTRL + C and:
Build the docker image with docker build -t ollama . as described in the docs.
Run Ollama in docker with docker run --gpus=all -v ollama:/root/.ollama -p 127.0.0.1:11434:11434 --name ollama ollama
Run the python code below **
On my machine, the response is: {"most_likely_text": "nby2m", "less_likely_text": "nbyzm"} This response DOES NOT contain the correct letters and numbers shown in the image.

2) Generating the outputs on CPU vs GPU

As described in the PyTorch docs, consistency across CPU and GPU cannot be guaranteed. In fact, a large number of CUDA algorithms are non-deterministic. Some algorithms have deterministic but slower equivalents, but several other algorithms cannot behave deterministically at all (Thank you, @sisp !). So it may not ever be possible to produce consistent outputs with Ollama across different hardware.

To reproduce this issue:

Run Ollama within the docker image built in section 1) and using the additional flags -e CUDA_VISIBLE_DEVICES=0 (for GPU) or -e CUDA_VISIBLE_DEVICES=-1 (for CPU).
Run the python code below **
On GPU, the response was: {"most_likely_text": "nby2m", "less_likely_text": "nbyzm"}
On CPU, the response was: {"most_likely_text": "nby2m", "less_likely_text": "nbytm"} (with t instead of z, but both responses are incorrect.)

3) Generating the outputs using the KV cache

The KV cache temporarily stores part of a prompt and its response so that identical prompts do not have to be regenerated by the model when submitting the same prompt twice. However, using the stored results can cause the /generate endpoint to produce different results when processing the same inputs twice in quick succession.

The /embed endpoint is unaffected by this issue because using the KV cache is explicitly set to false
01aa788722/runner/llamarunner/runner.go (L703)

Relevant known issues related to KV cache:
https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535
https://github.com/ggml-org/llama.cpp/issues/3014

To reproduce this issue:

Run Ollama within the docker image built in section 1) and using the additional flags -e CUDA_VISIBLE_DEVICES=-1 (for CPU) and then run the the python code below ** twice.
First response: {"most_likely_text": "nby2m", "less_likely_text": "nbytm"}
Second response: {"most_likely_text": "n2m", "less_likely_text": "bg2m"}

As you can see, the first and second response are completely different. Both responses are incorrect when compared to the letters and numbers contained in the image.

Why different results for identical inputs present a significant problem

Downstream applications like retrieval augmented generation (RAG), classification, annotation, or semantic search depend on consistent embedding, logit, and text generation. Those applications usually compare similarity between embeddings, so any uncertainty in embeddings reduces the quality of results that those applications can produce. For large datasets, similarities between embeddings often fall within a relatively small range, so even small differences in embeddings can make those applications unusable.

Possible workarounds to generate consistent outputs for the same inputs (to be confirmed):

For 1) use a docker image to make Ollama portable between operating systems. I haven’t tested this approach. If someone does test this, please let us know.
For 2) make sure you produce results on the same hardware. For example, if you generate an embeddings database for a RAG application, make sure that embeddings for subsequent searches are produced on the same hardware.
For 3) potentially disable the KV cache on the /generate endpoint for the llama runner:
01aa788722/runner/llamarunner/runner.go (L611)
or for the ollama runner:
01aa788722/runner/ollamarunner/runner.go (L600)

** Click to reveal code used to investigate all three causes of different results for the same inputs

image “lettersandnumbers.jpg”:

import requests
import base64

# Get image as base64
def get_base64_image(image_filename):
    with open(image_filename, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
    return encoded_string

# Define the API endpoint
url = "http://localhost:11434/api/generate"

# Define the payload
payload = {
    "model": "gemma3:12b",
    "prompt": "Respond exclusivly with the exact letters and numbers shown in this image. Provide your most likely guess, and a less likely guess. Respond using JSON.",
    "options": {"temperature": 0.0,
                "seed": 0,
                "top_k": 1,
                },
    "format": {
        "type": "object",
        "properties": {
            "most_likely_text": {"type": "string"},
            "less_likely_text": {"type": "string"},
        },
        "required": [
            "most_likely_text",
            "less_likely_text"
            ]
    },
    "stream": False,
    "images": [get_base64_image(image_filename="lettersandnumbers.jpg")],
}

# Send the POST request
try:
    response = requests.post(url, json=payload)
    
    # Check the response
    if response.status_code == 200:
        print("Raw response text:", response.text)
        # Parse the JSON response
        data = response.json()
        # Extract and print the response field
        resp = data.get("response", "Response not found in response")
        print("Response:", resp)
    else:
        print(f"Failed with status code: {response.status_code}")
        print(response.text)
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

@kevin-pw commented on GitHub (Mar 28, 2025): # Summary It looks like three different issues might cause different embeddings, logits, and text generation for the same inputs: 1) Generating the output on different operating systems 2) Generating the outputs on CPU vs GPU 3) Generating the outputs using the KV cache After some digging, issues 1) and 2) do not appear to have an easy fix. These issues could be problematic for downstream applications like RAG, but it may be possible to mitigate those issues by carefully working around them. Issue 3) can be addressed by avoiding the use of the KV cache. @rick-github I have two suggestions: - Perhaps it would be useful to highlight that Ollama outputs are non-deterministic in the Ollama docs. I think this could avoid future frustration with developers :) - Perhaps using or not using the KV cache should be a parameter option. Apologies for the long rant – this was a deep dive. # 1) Generating the output on different operating systems Observing different results on different operating systems is consistent with the following issues posted on the llama.cpp repo: https://github.com/ggml-org/llama.cpp/issues/2582 https://github.com/ggml-org/llama.cpp/discussions/2100#discussioncomment-6353790 Unfortunately, the two issues above were never fully resolved. The second issue suggests building a portable binary by statically linking the CUDA libraries, but I haven’t tested if that approach actually achieves consistent results between operating systems. In my tests, I ran Ollama compiled from source directly on Ubuntu Linux 24.10, and I ran Ollama within a docker container that uses Ubuntu Linux 20.04. While running both of those Ollama instances on the same machine, `gemma3:12b` produced different results for the same text + image input. ### To reproduce this issue: Compile Ollama `v0.6.3-rc0` from source as [described in the docs](https://github.com/ollama/ollama/blob/main/docs/development.md): Run Ollama with `go run . serve` Run the python code below \*\* On my Ubuntu Linux 24.10 machine, the response is: `{"most_likely_text": "nbg2m", "less_likely_text": "nbg2m"}` This response contains the correct letters and numbers shown in the image. To receive a different response, stop Ollama with CTRL + C and: Build the docker image with `docker build -t ollama .` as [described in the docs](https://github.com/ollama/ollama/blob/main/docs/docker.md). Run Ollama in docker with `docker run --gpus=all -v ollama:/root/.ollama -p 127.0.0.1:11434:11434 --name ollama ollama` Run the python code below \*\* On my machine, the response is: `{"most_likely_text": "nby2m", "less_likely_text": "nbyzm"}` This response DOES NOT contain the correct letters and numbers shown in the image. # 2) Generating the outputs on CPU vs GPU [As described in the PyTorch docs](https://pytorch.org/docs/stable/notes/randomness.html), consistency across CPU and GPU cannot be guaranteed. In fact, a large number of CUDA algorithms are non-deterministic. [Some algorithms have deterministic but slower equivalents, but several other algorithms cannot behave deterministically at all](https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch.use_deterministic_algorithms) (Thank you, @sisp !). So it may not ever be possible to produce consistent outputs with Ollama across different hardware. ### To reproduce this issue: Run Ollama within the docker image built in section 1) and using the additional flags `-e CUDA_VISIBLE_DEVICES=0` (for GPU) or `-e CUDA_VISIBLE_DEVICES=-1` (for CPU). Run the python code below \*\* On GPU, the response was: `{"most_likely_text": "nby2m", "less_likely_text": "nbyzm"}` On CPU, the response was: `{"most_likely_text": "nby2m", "less_likely_text": "nbytm"}` (with `t` instead of `z`, but both responses are incorrect.) # 3) Generating the outputs using the KV cache The KV cache temporarily stores part of a prompt and its response so that identical prompts do not have to be regenerated by the model when submitting the same prompt twice. However, using the stored results can cause the `/generate` endpoint to produce different results when processing the same inputs twice in quick succession. The `/embed` endpoint is unaffected by this issue because using the KV cache is explicitly set to `false` https://github.com/ollama/ollama/blob/01aa7887221e7bd286ebcb14a088c94ba1c22a99/runner/llamarunner/runner.go#L703 Relevant known issues related to KV cache: https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535 https://github.com/ggml-org/llama.cpp/issues/3014 ### To reproduce this issue: Run Ollama within the docker image built in section 1) and using the additional flags `-e CUDA_VISIBLE_DEVICES=-1` (for CPU) and then run the the python code below ** twice. First response: `{"most_likely_text": "nby2m", "less_likely_text": "nbytm"}` Second response: `{"most_likely_text": "n2m", "less_likely_text": "bg2m"}` As you can see, the first and second response are completely different. Both responses are incorrect when compared to the letters and numbers contained in the image. # Why different results for identical inputs present a significant problem Downstream applications like retrieval augmented generation (RAG), classification, annotation, or semantic search depend on consistent embedding, logit, and text generation. Those applications usually compare similarity between embeddings, so any uncertainty in embeddings reduces the quality of results that those applications can produce. For large datasets, similarities between embeddings often fall within a relatively small range, so even small differences in embeddings can make those applications unusable. # Possible workarounds to generate consistent outputs for the same inputs (to be confirmed): - For 1) use a docker image to make Ollama portable between operating systems. I haven’t tested this approach. If someone does test this, please let us know. - For 2) make sure you produce results on the same hardware. For example, if you generate an embeddings database for a RAG application, make sure that embeddings for subsequent searches are produced on the same hardware. - For 3) potentially disable the KV cache on the `/generate` endpoint for the llama runner: - https://github.com/ollama/ollama/blob/01aa7887221e7bd286ebcb14a088c94ba1c22a99/runner/llamarunner/runner.go#L611 or for the ollama runner: https://github.com/ollama/ollama/blob/01aa7887221e7bd286ebcb14a088c94ba1c22a99/runner/ollamarunner/runner.go#L600 <details> <summary>** Click to reveal code used to investigate all three causes of different results for the same inputs</summary> image “lettersandnumbers.jpg”: ![Image](https://github.com/user-attachments/assets/2d566ee4-ce53-47ed-81bc-1b2a948a4d8c) ``` import requests import base64 # Get image as base64 def get_base64_image(image_filename): with open(image_filename, "rb") as image_file: encoded_string = base64.b64encode(image_file.read()).decode("utf-8") return encoded_string # Define the API endpoint url = "http://localhost:11434/api/generate" # Define the payload payload = { "model": "gemma3:12b", "prompt": "Respond exclusivly with the exact letters and numbers shown in this image. Provide your most likely guess, and a less likely guess. Respond using JSON.", "options": {"temperature": 0.0, "seed": 0, "top_k": 1, }, "format": { "type": "object", "properties": { "most_likely_text": {"type": "string"}, "less_likely_text": {"type": "string"}, }, "required": [ "most_likely_text", "less_likely_text" ] }, "stream": False, "images": [get_base64_image(image_filename="lettersandnumbers.jpg")], } # Send the POST request try: response = requests.post(url, json=payload) # Check the response if response.status_code == 200: print("Raw response text:", response.text) # Parse the JSON response data = response.json() # Extract and print the response field resp = data.get("response", "Response not found in response") print("Response:", resp) else: print(f"Failed with status code: {response.status_code}") print(response.text) except requests.exceptions.RequestException as e: print(f"An error occurred: {e}") ``` </details>

GiteaMirror commented

2026-05-03 21:02:21 -05:00

@flexorx commented on GitHub (Mar 30, 2025):

I can't get it @kevin-pw , in our case we are running this stuff on the same OS (RedHat), on the same computer, in the same environment, without any docker, without parallelism and purely on GPU. We do NOT do anything explicit or special rgd KV cache, we just set all temperature, top_P top_K and the rest of this stuff to 0, repetition_penalty to 1.0 to remove ANY non-determinism and assure results fully reproducible. What is the cause in this case? The whole discussion is going in a sentiment like "there's nothing we can do", but the issue has obviously not been an issue previously in our conditions and only occurred sometime 2024 as a kind of bug.

@flexorx commented on GitHub (Mar 30, 2025): I can't get it @kevin-pw , in our case we are running this stuff on the same OS (RedHat), on the same computer, in the same environment, without any docker, without parallelism and purely on GPU. We do NOT do anything explicit or special rgd KV cache, we just set all temperature, top_P top_K and the rest of this stuff to 0, repetition_penalty to 1.0 to remove ANY non-determinism and assure results fully reproducible. What is the cause in this case? The whole discussion is going in a sentiment like "there's nothing we can do", but the issue has obviously not been an issue previously in our conditions and only occurred sometime 2024 as a kind of bug.

GiteaMirror commented

2026-05-03 21:02:23 -05:00

@kevin-pw commented on GitHub (Mar 30, 2025):

We do NOT do anything explicit or special rgd KV cache

@flexorx Does the following correctly describe your issue when you submit multiple identical input queries?

The text generated by the first input query is different from the text generated by the second input query.
The text generated by the second and all following input queries is identical.

If that is the case, the issue is caused by the KV cache. In the current version 0.6.3 of Ollama, no input parameter is available to disable the KV cache on the /generate endpoint. To eliminate the issue, you would need to implement workaround 3) in my comment above by modifying the two lines in the source code.

@kevin-pw commented on GitHub (Mar 30, 2025): > We do NOT do anything explicit or special rgd KV cache @flexorx Does the following correctly describe your issue when you submit multiple identical input queries? 1. The text generated by the first input query is different from the text generated by the second input query. 2. The text generated by the second and all following input queries is identical. If that is the case, the issue is caused by the KV cache. In the current version `0.6.3` of Ollama, no input parameter is available to disable the KV cache on the `/generate` endpoint. To eliminate the issue, you would need to implement workaround 3) in my comment above by modifying the two lines in the source code.

GiteaMirror commented

2026-05-03 21:02:25 -05:00

@flexorx commented on GitHub (Mar 30, 2025):

@kevin-pw yes, indeed, it is often that the first query result is different from the second and on consecutive queries for the same input. However, if we also do some other input previously to the first input of the specific query, then the first result for this specific query after this other input would differ from the first result of this specific query but without this other input prior to it.

So, essentially, we can say two things:

It is always that the result of the first input of one particular query is different from the second and on inputs of that same query, given that all inputs preceding to that first input remain intact, and that no other queries interleave these first, second and on inputs of this particular query.
Generally, if we perform a sequence of queries A,B,C,D, then the results would vary for any permutation of these queries' order, such as B,C,A,D, D,A,C,B etc. Therefore, in general, the results are not immutable to the permutation of queries and are path-dependent therefore.

@flexorx commented on GitHub (Mar 30, 2025): @kevin-pw yes, indeed, it is often that the first query result is different from the second and on consecutive queries for the same input. However, if we also do some other input previously to the first input of the specific query, then the first result for this specific query after this other input would differ from the first result of this specific query but without this other input prior to it. So, essentially, we can say two things: 1. It is always that the result of the first input of one particular query is different from the second and on inputs of that same query, given that all inputs preceding to that first input remain intact, and that no other queries interleave these first, second and on inputs of this particular query. 2. Generally, if we perform a sequence of queries A,B,C,D, then the results would vary for any permutation of these queries' order, such as B,C,A,D, D,A,C,B etc. Therefore, in general, the results are not immutable to the permutation of queries and are path-dependent therefore.

GiteaMirror commented

2026-05-03 21:02:28 -05:00

@JakeBeaver commented on GitHub (Mar 31, 2025):

I found some simple repro steps for the /api/chat endpoint

Repeat sending a POST request with this body:

{
    "model": "gemma3:1b",
    "messages": [
        { "role": "user", "content": "What is your name?" }
    ],
    "stream": false,
    "options": { "temperature": 0, "top_p": 1, "top_k": 1, "seed": 123 }
}

First response after model load:

{
    "model": "gemma3:1b",
    "created_at": "2025-03-31T15:17:52.130833Z",
    "message": {
        "role": "assistant",
        "content": "Hello! I’m Gemma, created by the Gemma team at Google DeepMind."
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 2484392600,
    "load_duration": 1680573700,
    "prompt_eval_count": 14,
    "prompt_eval_duration": 609479800,
    "eval_count": 18,
    "eval_duration": 193867500
}

Every subsequent response:

{
    "model": "gemma3:1b",
    "created_at": "2025-03-31T15:18:22.5370125Z",
    "message": {
        "role": "assistant",
        "content": "I’m Gemma, created by the Gemma team at Google DeepMind."
    },
    "done_reason": "stop",
    "done": true,
    "total_duration": 509578700,
    "load_duration": 133404300,
    "prompt_eval_count": 14,
    "prompt_eval_duration": 161069100,
    "eval_count": 16,
    "eval_duration": 214586700
}

Unload the model by sending this in a POST request body:

{
    "model": "gemma3:1b",
    "messages": [],
    "stream": false,
    "keep_alive": 0
}

Repeat from step 1

Repeating this gives the same result in a loop. After the first response, message.content loses the Hello! and eval_count changes from 18 to 16 for all subsequent responses.

At least its enough to unload with an API request, so I don't have to rely on shell scripts rebooting all of ollama, but still, seems wasteful to keep keep_alive as 0 and force the model unload after every request.

@JakeBeaver commented on GitHub (Mar 31, 2025): I found some simple repro steps for the `/api/chat` endpoint 1. Repeat sending a POST request with this body: ```json { "model": "gemma3:1b", "messages": [ { "role": "user", "content": "What is your name?" } ], "stream": false, "options": { "temperature": 0, "top_p": 1, "top_k": 1, "seed": 123 } } ``` 2. First response after model load: ```json { "model": "gemma3:1b", "created_at": "2025-03-31T15:17:52.130833Z", "message": { "role": "assistant", "content": "Hello! I’m Gemma, created by the Gemma team at Google DeepMind." }, "done_reason": "stop", "done": true, "total_duration": 2484392600, "load_duration": 1680573700, "prompt_eval_count": 14, "prompt_eval_duration": 609479800, "eval_count": 18, "eval_duration": 193867500 } ``` 3. Every subsequent response: ```json { "model": "gemma3:1b", "created_at": "2025-03-31T15:18:22.5370125Z", "message": { "role": "assistant", "content": "I’m Gemma, created by the Gemma team at Google DeepMind." }, "done_reason": "stop", "done": true, "total_duration": 509578700, "load_duration": 133404300, "prompt_eval_count": 14, "prompt_eval_duration": 161069100, "eval_count": 16, "eval_duration": 214586700 } ``` 4. Unload the model by sending this in a POST request body: ```json { "model": "gemma3:1b", "messages": [], "stream": false, "keep_alive": 0 } ``` 5. Repeat from step 1 Repeating this gives the same result in a loop. After the first response, `message.content` loses the `Hello!` and `eval_count` changes from `18` to `16` for all subsequent responses. At least its enough to unload with an API request, so I don't have to rely on shell scripts rebooting all of ollama, but still, seems wasteful to keep `keep_alive` as `0` and force the model unload after every request.

GiteaMirror commented

2026-05-03 21:02:31 -05:00

@wyli commented on GitHub (Mar 31, 2025):

disabling kvcache as mentioned by @kevin-pw works for me.. I put a possible implementation here https://github.com/ollama/ollama/pull/10064 to make it configurable.

@wyli commented on GitHub (Mar 31, 2025): disabling kvcache as mentioned by @kevin-pw works for me.. I put a possible implementation here https://github.com/ollama/ollama/pull/10064 to make it configurable.

GiteaMirror commented

2026-05-03 21:02:32 -05:00

@d-kleine commented on GitHub (Apr 4, 2025):

@sayap already identified the KV cache as the root of the problem with generating consistent reproducible outputs almost a year ago. He also submitted a fix which worked at least when I tested it back then (#5760), but this PR has not been merged.

I have also tried to make the PRNG for the cache initialization consistent with a seed, but never got this really working. The thing about disabling the KV caching forces the LLM fully recompute the attention matrices, therefore using more memory than having the KV caching enabled.

@d-kleine commented on GitHub (Apr 4, 2025): @sayap already identified the KV cache as the root of the problem with generating consistent reproducible outputs almost a year ago. He also submitted a fix which worked at least when I tested it back then (#5760), but this PR has not been merged. I have also tried to make the PRNG for the cache initialization consistent with a seed, but never got this really working. The thing about disabling the KV caching forces the LLM fully recompute the attention matrices, therefore using more memory than having the KV caching enabled.

GiteaMirror commented

2026-05-03 21:02:36 -05:00

@wyli commented on GitHub (Apr 4, 2025):

not sure why that PR was not considered as well... in general these don't change the default and increase flexibility.

(I think the analysis here has already demonstrate the numerical differences https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)

@wyli commented on GitHub (Apr 4, 2025): not sure why that PR was not considered as well... in general these don't change the default and increase flexibility. (I think the analysis here has already demonstrate the numerical differences https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)

GiteaMirror commented

2026-05-03 21:02:38 -05:00

@Jonas-Wessner commented on GitHub (Feb 25, 2026):

I confirm the bug on ubuntu 24.04.3, running on L40s GPUs.
When I execute the script for the first time, I get a random output for iteration 0 of the loop. For subsequent iterations, I get a different, but consistent output.
If I run the script again, the bug is gone.
If I change something about the prompt (causing some cache reload I suppose), the bug can be reproduced again.
I hope this can be fixed soon, since otherwise it is hard to guarantee reproducible experiment results.

from openai import OpenAI
import hashlib

# Initialize client with custom base URL
client = OpenAI(
    base_url="http://localhost:11435/v1",  # your Ollama/OpenAI-compatible server
    api_key="ollama"                        # non-empty placeholder
)

prompt = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me a random list of numbers please. But really really random please."}
]

outputs = []

for i in range(3):
    response = client.chat.completions.create(
        model="llama3.3:70b",
        messages=prompt,
        seed=42,
        temperature=0,
    )
    text = response.choices[0].message.content
    outputs.append(text)
    print(f"Run {i+1}: {text}")

# Check if all outputs are identical
hashes = [hashlib.sha256(o.encode()).hexdigest() for o in outputs]
all_same = len(set(hashes)) == 1
print("\nAll outputs identical:", all_same)

@Jonas-Wessner commented on GitHub (Feb 25, 2026): I confirm the bug on ubuntu 24.04.3, running on L40s GPUs. When I execute the script for the first time, I get a random output for iteration 0 of the loop. For subsequent iterations, I get a different, but consistent output. If I run the script again, the bug is gone. If I change something about the prompt (causing some cache reload I suppose), the bug can be reproduced again. I hope this can be fixed soon, since otherwise it is hard to guarantee reproducible experiment results. ``` from openai import OpenAI import hashlib # Initialize client with custom base URL client = OpenAI( base_url="http://localhost:11435/v1", # your Ollama/OpenAI-compatible server api_key="ollama" # non-empty placeholder ) prompt = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Give me a random list of numbers please. But really really random please."} ] outputs = [] for i in range(3): response = client.chat.completions.create( model="llama3.3:70b", messages=prompt, seed=42, temperature=0, ) text = response.choices[0].message.content outputs.append(text) print(f"Run {i+1}: {text}") # Check if all outputs are identical hashes = [hashlib.sha256(o.encode()).hexdigest() for o in outputs] all_same = len(set(hashes)) == 1 print("\nAll outputs identical:", all_same) ```

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#65369