[GH-ISSUE #5495] The quality of the results returned by the embedding model become worse #65475

Closed
opened 2026-05-03 21:26:10 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @wwjCMP on GitHub (Jul 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5495

What is the issue?

The quality of the results returned by the embedding model now is much worse than the previous version.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @wwjCMP on GitHub (Jul 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5495 ### What is the issue? The quality of the results returned by the embedding model now is much worse than the previous version. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48
GiteaMirror added the bugneeds more info labels 2026-05-03 21:26:11 -05:00
Author
Owner

@savvyer commented on GitHub (Jul 5, 2024):

Yep, I noticed as well that ollama spits out nonsense responses for different models

<!-- gh-comment-id:2211392486 --> @savvyer commented on GitHub (Jul 5, 2024): Yep, I noticed as well that ollama spits out nonsense responses for different models
Author
Owner

@dcasota commented on GitHub (Jul 6, 2024):

Until which version was the quality good? which embedding model? Did you use langchain-python-rag-privategpt?

<!-- gh-comment-id:2211786018 --> @dcasota commented on GitHub (Jul 6, 2024): Until which version was the quality good? which embedding model? Did you use langchain-python-rag-privategpt?
Author
Owner

@mitar commented on GitHub (Jul 20, 2024):

We also noticed degradation of quality between 0.1.38 and 0.2.1. We ran comparison of versions between those two against our evaluation testcases and discovered:

  • Between v0.1.42 and v0.1.43 we detect the first degradation.
  • Between v0.1.44 and v0.1.45 we detect the second degradation.

Reference: https://gitlab.com/peerdb/llm/-/merge_requests/34

<!-- gh-comment-id:2241247280 --> @mitar commented on GitHub (Jul 20, 2024): We also noticed degradation of quality between 0.1.38 and 0.2.1. We ran comparison of versions between those two against our evaluation testcases and discovered: * Between v0.1.42 and v0.1.43 we detect the first degradation. * Between v0.1.44 and v0.1.45 we detect the second degradation. Reference: https://gitlab.com/peerdb/llm/-/merge_requests/34
Author
Owner

@dhiltgen commented on GitHub (Oct 24, 2024):

Are you still seeing poorer results in the latest version of Ollama, or has it recovered?

If it's still poor, can you clarify how you're arriving at this conclusion? Do you have a simple repro scenario we can run against old/new Ollama versions to see the same behavior?

<!-- gh-comment-id:2435682946 --> @dhiltgen commented on GitHub (Oct 24, 2024): Are you still seeing poorer results in the latest version of Ollama, or has it recovered? If it's still poor, can you clarify how you're arriving at this conclusion? Do you have a simple repro scenario we can run against old/new Ollama versions to see the same behavior?
Author
Owner

@MarkoSagadin commented on GitHub (Nov 6, 2024):

Hello @dhiltgen, I worked with @mitar on the project where we were evaluating how well different LLM models parse unstructured information (descriptions of the food ingredients on the packaging) into structured one (JSON format). The above linked MR contains the report of one such evaluation.

By degradation we meant that when using the same model, the same content, the same fixed seed and temperature is set to 0, the api/chat/ endpoint returns a different response between different Ollama versions.

For example, take a look at this report that we generated for our dataset while testing the Ollama versions 0.1.38, 0.2.1, 0.2.5 and 0.2.6.

At the end of the report you can see the diffs between expected and actual outputs (they are listed in the same way as I listed them here). So, versions 0.2.1, 0.2.5 and 0.2.6 produced the same output, while 0.1.38 produced a different one.

You can find the relevant test files in this directory. Both train and validate folders that can be found in the timestamped folders contain the inputs (.txt files) and expected outputs (.json). Each llama3:70b-instruct-q4_0 folder contains expect output.


I have tested the same inputs against the latest 0.3.14 version, they are the same as in v0.1.45.

So, from our point we marked a non-deterministic output (when we expected it to be deterministic) as a degradation.

Would it make sense from the point of the Ollama project to add tests to the CI which check if the responses are deterministic (given the above described conditions)?

<!-- gh-comment-id:2460549312 --> @MarkoSagadin commented on GitHub (Nov 6, 2024): Hello @dhiltgen, I worked with @mitar on the project where we were evaluating how well different LLM models parse unstructured information (descriptions of the food ingredients on the packaging) into structured one (JSON format). The above linked MR contains the report of one such evaluation. By degradation we meant that when using the same model, the same content, the same fixed seed and temperature is set to 0, the `api/chat/` endpoint returns a different response between different Ollama versions. For example, take a look at [this report](https://gitlab.com/peerdb/food/-/blob/9fadf2a29919cf470da65ba578815aff62e2ce1a/llm-tester/output/2024-07-18_20:40:29/report.md) that we generated for our dataset while testing the Ollama versions 0.1.38, 0.2.1, 0.2.5 and 0.2.6. At the end of the report you can see the diffs between expected and actual outputs (they are listed in the same way as I listed them here). So, versions 0.2.1, 0.2.5 and 0.2.6 produced the same output, while 0.1.38 produced a different one. You can find the relevant test files in [this directory](https://gitlab.com/peerdb/food/-/tree/9fadf2a29919cf470da65ba578815aff62e2ce1a/llm-tester/output/2024-07-18_20%3A40%3A29). Both train and validate folders that can be found in the timestamped folders contain the inputs (*.txt files) and expected outputs (*.json). Each `llama3:70b-instruct-q4_0` folder contains expect output. --- I have tested the same inputs against the latest 0.3.14 version, they are the same as in v0.1.45. So, from our point we marked a non-deterministic output (when we expected it to be deterministic) as a degradation. Would it make sense from the point of the Ollama project to add tests to the CI which check if the responses are deterministic (given the above described conditions)?
Author
Owner

@mitar commented on GitHub (Dec 8, 2024):

I would like to expand what @MarkoSagadin wrote that it is not just that outputs are different between Ollama versions, but also outputs with a newer version of Ollama got semantically (when inspected by a human) worse than the version 0.1.38. So for a particular task and a set of different inputs we check if outputs are a) the same b) if not, if they are still semantically what would a human call correct. And 0.1.38 had the highest number of correct outputs (for a task of parsing a natural language list of food ingredients into a parsed JSON output) while for newer versions this was lower.

So there were two such degradationions: a) between v0.1.42 and v0.1.43 b) between v0.1.44 and v0.1.45. Since v0.1.45 the outputs are deterministic up to version 0.3.14 which is the latest we tested.

Would it make sense from the point of the Ollama project to add tests to the CI which check if the responses are deterministic (given the above described conditions)?

I think this is a good suggestion. Ollama should have some set of prompts and expected outputs which it could run inside CI to test if everything is staying stable or if not that it is clear/known that something has changed.

<!-- gh-comment-id:2526180801 --> @mitar commented on GitHub (Dec 8, 2024): I would like to expand what @MarkoSagadin wrote that it is not just that outputs are different between Ollama versions, but also outputs with a newer version of Ollama got semantically (when inspected by a human) worse than the version 0.1.38. So for a particular task and a set of different inputs we check if outputs are a) the same b) if not, if they are still semantically what would a human call correct. And 0.1.38 had the highest number of correct outputs (for a task of parsing a natural language list of food ingredients into a parsed JSON output) while for newer versions this was lower. So there were two such degradationions: a) between v0.1.42 and v0.1.43 b) between v0.1.44 and v0.1.45. Since v0.1.45 the outputs are deterministic up to version 0.3.14 which is the latest we tested. > Would it make sense from the point of the Ollama project to add tests to the CI which check if the responses are deterministic (given the above described conditions)? I think this is a good suggestion. Ollama should have some set of prompts and expected outputs which it could run inside CI to test if everything is staying stable or if not that it is clear/known that something has changed.
Author
Owner

@EternityForest commented on GitHub (Mar 15, 2025):

I was wondering if I was going crazy or if the models seemed subjectively not that great!

<!-- gh-comment-id:2726438240 --> @EternityForest commented on GitHub (Mar 15, 2025): I was wondering if I was going crazy or if the models seemed subjectively not that great!
Author
Owner

@pdevine commented on GitHub (Oct 3, 2025):

I'm going to go ahead and close this as stale. There have been a lot of improvements w/ embedding models on the ollama engine (i.e. not the legacy llama.cpp engine) since the issue was originally filed.

I would definitely check out embeddinggemma with the MSR (matryoshka representation learning) implementation or qwen3-embedding.

<!-- gh-comment-id:3366926529 --> @pdevine commented on GitHub (Oct 3, 2025): I'm going to go ahead and close this as stale. There have been a lot of improvements w/ embedding models on the ollama engine (i.e. not the legacy llama.cpp engine) since the issue was originally filed. I would definitely check out `embeddinggemma` with the MSR (matryoshka representation learning) implementation or `qwen3-embedding`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65475