ollama and llama-server servers generate different embedding output for the same GGUF model file #7515

Closed
opened 2025-11-12 14:09:21 -06:00 by GiteaMirror · 9 comments
Owner

Originally created by @khanakia on GitHub (Jul 11, 2025).

I am trying to generate embeddings for Qwen3-Embedding-8B-Q4_K_M.gguf

I have confirmed all the parameters by doing verbose logging and also used the same gguf file to server via ollama too

Running llama

llama-server --embeddings -m Qwen3-Embedding-8B-Q4_K_M.gguf --port 8085

curl --location 'http://127.0.0.1:8085/embedding' \
--header 'Content-Type: application/json' \
--data '{
    "content": "hello"
}'

Running Ollama

curl http://localhost:11434/api/embed -d '{
  "model": "hf.co/Qwen/Qwen3-Embedding-8B-GGUF:latest",
  "input":"hello"
}'
## ollama
[ 0.013413369, 0.010063744, -0.0026498106, -0.014395288, -0.008698153, -0.0011749634, ...... ]

## llama
[ 1.6623456478118896, 1.2472200393676758, -0.32839635014533997, -1.7840369939804077, -1.077979564666748, -0.1456155776977539,  ...... ]
Originally created by @khanakia on GitHub (Jul 11, 2025). I am trying to generate embeddings for `Qwen3-Embedding-8B-Q4_K_M.gguf` I have confirmed all the parameters by doing verbose logging and also used the same gguf file to server via ollama too **Running llama** ``` llama-server --embeddings -m Qwen3-Embedding-8B-Q4_K_M.gguf --port 8085 curl --location 'http://127.0.0.1:8085/embedding' \ --header 'Content-Type: application/json' \ --data '{ "content": "hello" }' ``` **Running Ollama** ``` curl http://localhost:11434/api/embed -d '{ "model": "hf.co/Qwen/Qwen3-Embedding-8B-GGUF:latest", "input":"hello" }' ``` ``` ## ollama [ 0.013413369, 0.010063744, -0.0026498106, -0.014395288, -0.008698153, -0.0011749634, ...... ] ## llama [ 1.6623456478118896, 1.2472200393676758, -0.32839635014533997, -1.7840369939804077, -1.077979564666748, -0.1456155776977539, ...... ] ```
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

ollama normalizes the embeddings.

@rick-github commented on GitHub (Jul 11, 2025): ollama [normalizes](https://github.com/ollama/ollama/blob/35fda7b4af556e7eeef2b5dcb3638435382b2576/server/routes.go#L506) the embeddings.
Author
Owner

@khanakia commented on GitHub (Jul 11, 2025):

Ok i tried to normalize the llama output using the same ollama normalize func

values are still way off

func main() {
	v := normalize([]float32{1.6623456478118896, 1.2472200393676758, -0.32839635014533997}) // Example usage of normalize
	fmt.Println(v)                                                                          // Output the normalized vector
	// [0.7900901 0.59278655 -0.15608227]
}
@khanakia commented on GitHub (Jul 11, 2025): Ok i tried to normalize the llama output using the same ollama normalize func values are still way off ``` func main() { v := normalize([]float32{1.6623456478118896, 1.2472200393676758, -0.32839635014533997}) // Example usage of normalize fmt.Println(v) // Output the normalized vector // [0.7900901 0.59278655 -0.15608227] } ```
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

What about the 4093 other points in the vector?

@rick-github commented on GitHub (Jul 11, 2025): What about the 4093 other points in the vector?
Author
Owner

@khanakia commented on GitHub (Jul 11, 2025):

Thanks, it's correct

i verified by comparing both nomalizing the llama embedding output

https://github.com/khanakia/ollama_llama_difference/blob/main/vector_comparison_report.md

@khanakia commented on GitHub (Jul 11, 2025): Thanks, it's correct i verified by comparing both nomalizing the llama embedding output https://github.com/khanakia/ollama_llama_difference/blob/main/vector_comparison_report.md
Author
Owner

@khanakia commented on GitHub (Jul 11, 2025):

HI @rick-github

I am also using a Python script and comparing the vectors by normalizing

  • for sentence-transformers/all-MiniLM-L6-v2 vectors matched 100%

But for Qwen/Qwen3-Embedding-8B there is Total difference: 0.0190%

Is 0.0190% can be ignored or negligible?

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import numpy as np

# Sentences we want sentence embeddings for
sentences = ["hello"]

# Load model from HuggingFace Hub
# tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-8B")
model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-8B")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
token_embeddings = model_output[0]
attention_mask = encoded_input["attention_mask"]

# Mean pooling with attention mask
input_mask_expanded = (
    attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
sentence_embeddings = torch.sum(
    token_embeddings * input_mask_expanded, 1
) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
local_embedding = sentence_embeddings[0].numpy()
np.set_printoptions(precision=8, suppress=True)
print(local_embedding)

@khanakia commented on GitHub (Jul 11, 2025): HI @rick-github I am also using a Python script and comparing the vectors by normalizing - for `sentence-transformers/all-MiniLM-L6-v2` vectors matched 100% But for `Qwen/Qwen3-Embedding-8B` there is **Total difference**: 0.0190% Is `0.0190%` can be ignored or negligible? ```py from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F import numpy as np # Sentences we want sentence embeddings for sentences = ["hello"] # Load model from HuggingFace Hub # tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") # model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-8B") model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-8B") # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling token_embeddings = model_output[0] attention_mask = encoded_input["attention_mask"] # Mean pooling with attention mask input_mask_expanded = ( attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() ) sentence_embeddings = torch.sum( token_embeddings * input_mask_expanded, 1 ) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) # Normalize embeddings sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) local_embedding = sentence_embeddings[0].numpy() np.set_printoptions(precision=8, suppress=True) print(local_embedding) ```
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

The transformers library will be using BF16 tensors for all-MiniLM-L6-v2 and Qwen3-Embedding-8B. The default version of all-minilm in the ollama library (I'm assuming this is what you are comparing against) is F16, so it's expected that the vectors match. The version of Qwen3-Embedding-8B you are using in ollama is q4_K_M, so it's not surprising there is some loss in precision.

@rick-github commented on GitHub (Jul 11, 2025): The transformers library will be using BF16 tensors for all-MiniLM-L6-v2 and Qwen3-Embedding-8B. The default version of all-minilm in the ollama library (I'm assuming this is what you are comparing against) is F16, so it's expected that the vectors match. The version of Qwen3-Embedding-8B you are using in ollama is q4_K_M, so it's not surprising there is some loss in precision.
Author
Owner

@khanakia commented on GitHub (Jul 11, 2025):

Thanks

How do I check what quantization ollama is using for the downloaded models, it says unknown

➜  huggingface ollama show hf.co/Qwen/Qwen3-Embedding-8B-GGUF                                           
  Model
    architecture        qwen3      
    parameters          7.6B       
    context length      40960      
    embedding length    4096       
    quantization        unknown    

  Capabilities
    completion    
    tools         
    insert      
@khanakia commented on GitHub (Jul 11, 2025): Thanks How do I check what quantization ollama is using for the downloaded models, it says `unknown` ``` ➜ huggingface ollama show hf.co/Qwen/Qwen3-Embedding-8B-GGUF Model architecture qwen3 parameters 7.6B context length 40960 embedding length 4096 quantization unknown Capabilities completion tools insert ```
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

Tensors are not uniformly of one type in a quantized model, normally there's a header entry in the file to indicate what sort of quantization was done. HF models sometimes are missing this header entry, since they include it on the model tag. However, looking at the tensors in the model will give an indication of the type of quantization:

$ ollama show -v hf.co/Qwen/Qwen3-Embedding-8B-GGUF
...

  Tensors
    output.weight                Q6_K    [4096 151936]    
    output_norm.weight           F32     [4096]           
    token_embd.weight            Q4_K    [4096 151936]    
    blk.0.attn_k.weight          Q4_K    [4096 1024]      
    blk.0.attn_k_norm.weight     F32     [128]            
    blk.0.attn_norm.weight       F32     [4096]           
    blk.0.attn_output.weight     Q4_K    [4096 4096]      
...

Here we see that the large tensors in the repeating layers (blk.*) are of type Q4_K, so this is a q4 quantized model.

You can fetch the FP16 version of this model from HF:

$ ollama pull hf.co/Qwen/Qwen3-Embedding-8B-GGUF:f16

which presumably will result in a smaller difference to the transformers output.

@rick-github commented on GitHub (Jul 11, 2025): Tensors are not uniformly of one type in a quantized model, normally there's a header entry in the file to indicate what sort of quantization was done. HF models sometimes are missing this header entry, since they include it on the model tag. However, looking at the tensors in the model will give an indication of the type of quantization: ```console $ ollama show -v hf.co/Qwen/Qwen3-Embedding-8B-GGUF ... Tensors output.weight Q6_K [4096 151936] output_norm.weight F32 [4096] token_embd.weight Q4_K [4096 151936] blk.0.attn_k.weight Q4_K [4096 1024] blk.0.attn_k_norm.weight F32 [128] blk.0.attn_norm.weight F32 [4096] blk.0.attn_output.weight Q4_K [4096 4096] ... ``` Here we see that the large tensors in the repeating layers (blk.*) are of type Q4_K, so this is a q4 quantized model. You can fetch the FP16 version of this model from HF: ```console $ ollama pull hf.co/Qwen/Qwen3-Embedding-8B-GGUF:f16 ``` which presumably will result in a smaller difference to the transformers output.
Author
Owner

@khanakia commented on GitHub (Jul 12, 2025):

@rick-github, are you on Twitter? If so, I’d love to follow you!

@khanakia commented on GitHub (Jul 12, 2025): @rick-github, are you on Twitter? If so, I’d love to follow you!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#7515
No description provided.