[GH-ISSUE #6262] Batch embeddings get progressively worse with larger batches #50431

Open
opened 2026-04-28 15:49:56 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @jorgetrejo36 on GitHub (Aug 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6262

What is the issue?

I am using the ollama Python library for all the results I am getting.

As I create embeddings using ollama.embed() I get progressively worse embeddings as the batches are larger. This is compared against creating embeddings one at a time. There seems to be a jump that happens at batch sizes of 16 or larger. All of my tests assume that I am getting the embeddings back in the same order given I submitted an issue not too long ago that was resolved (#6187).

It is imperative that these embeddings be accurate given I am using them for a RAG app and retrieval needs to be good for all inserted embeddings.

I ran function "chunk_text" with text from peter pan (https://www.gutenberg.org/files/16/16-h/16-h.htm), chunk_size = 256, max_characters of 65536 (256 chunks with 256 characters each).

I ran the function "test" with chunks from the above function call and a batch_size_list of [2, 4, 8, 16, 32, 64, 128, 256].

Below all the code are results and some plots of the results.

import ollama
import numpy as np
import os
from typing import List
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

load_dotenv()

# Embedding model used was "bge-large:latest"
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
EPS=1e-4

def chunk_text(text: str, chunk_size: int, max_characters: int) -> List[str]:
    chunks = []
    for i in range(0, len(text) if len(text) < max_characters else max_characters, chunk_size):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

# Used first few chapters of Peter Pan
text = ""

chunk_size = 256
# 256 is the max batch size that is defined later
chunks = chunk_text(text, chunk_size, chunk_size * 256)

def embed_string(s: str) -> np.ndarray:
    return np.array(ollama.embed(
        input=s,
        model=EMBEDDING_MODEL,
        options={
            
        },
        truncate=False
    )["embeddings"])[0]

def embed_list(s: List[str]) -> np.ndarray:
    return np.array(ollama.embed(
        input=s,
        model=EMBEDDING_MODEL,
        options={
            
        },
        truncate=False
    )["embeddings"])

def test(list_of_string: List[str], batch_sizes: List[int]) -> bool:
    avg_distances = []
    avg_similarites = []

    max_distances = []
    min_similarities = []

    for batch_size in batch_sizes:
        print(f"Results for batch size: {batch_size}")
        singles = np.array([embed_string(s) for s in list_of_string[:batch_size]])
        as_list = embed_list(list_of_string[:batch_size])
    
        distances = []
        for single_embedding, as_list_embedding in zip(singles, as_list):
            distance = np.sqrt(((single_embedding - as_list_embedding) ** 2).sum())
            distances.append(distance)

        distances = np.array(distances)

        mean = np.mean(distances)
        max = np.max(distances)

        avg_distances.append(mean)
        max_distances.append(max)

        print("Euclidean Distance:")
        print(f"\tMean of euclidean distances: {mean}")
        print(f"\tMax euclidean distance: {max}")
        
        # Cosine similarity
        similarities = []
        for single_embedding, as_list_embedding in zip(singles, as_list):
            vector1 = single_embedding.reshape(1, -1)
            vector2 = as_list_embedding.reshape(1, -1)
            similarity = cosine_similarity(vector1, vector2)
            similarities.append(similarity)


        similarities = np.array(similarities)

        mean = np.mean(similarities)  
        min = np.min(similarities)

        avg_similarites.append(mean)
        min_similarities.append(min)

        print("Cosine Similarity:")
        print(f"\tMean of cosine similarites: {mean}")
        print(f"\tMin cosine similarity: {min}")

        print("==========================================================")

    return (batch_sizes, avg_distances, avg_similarites, max_distances, min_similarities)

batch_sizes_list = [2**i for i in range(1, 9)]
batch_sizes, avg_distances, avg_similarities, max_distances, min_similarities = test(chunks, batch_sizes_list)

RESULTS:

Results for batch size: 2
Euclidean Distance:
	Mean of euclidean distances: 0.0027100650691554615
	Max euclidean distance: 0.003069791207141852
Cosine Similarity:
	Mean of cosine similarites: 0.999996263071194
	Min cosine similarity: 0.99999528818776
==========================================================
Results for batch size: 4
Euclidean Distance:
	Mean of euclidean distances: 0.002698965850379388
	Max euclidean distance: 0.0032083663351101474
Cosine Similarity:
	Mean of cosine similarites: 0.9999962901587796
	Min cosine similarity: 0.9999948531925777
==========================================================
Results for batch size: 8
Euclidean Distance:
	Mean of euclidean distances: 0.003292175370207458
	Max euclidean distance: 0.0038060000679778546
Cosine Similarity:
	Mean of cosine similarites: 0.9999945197318343
	Min cosine similarity: 0.999992757181494
==========================================================
Results for batch size: 16
Euclidean Distance:
	Mean of euclidean distances: 0.11461230989305338
	Max euclidean distance: 1.136748198080119
Cosine Similarity:
	Mean of cosine similarites: 0.946342128810411
	Min cosine similarity: 0.35390177365096614
==========================================================
Results for batch size: 32
Euclidean Distance:
	Mean of euclidean distances: 0.08102131219835153
	Max euclidean distance: 0.8772319282635773
Cosine Similarity:
	Mean of cosine similarites: 0.9674320902167836
	Min cosine similarity: 0.6152323203539565
==========================================================
Results for batch size: 64
Euclidean Distance:
	Mean of euclidean distances: 0.09294858544026222
	Max euclidean distance: 1.1095371609913112
Cosine Similarity:
	Mean of cosine similarites: 0.960566610535093
	Min cosine similarity: 0.38446375093677954
==========================================================
Results for batch size: 128
Euclidean Distance:
	Mean of euclidean distances: 0.08298749768139443
	Max euclidean distance: 0.9481241092937398
Cosine Similarity:
	Mean of cosine similarites: 0.9696922749059266
	Min cosine similarity: 0.5505303906957912
==========================================================
Results for batch size: 256
Euclidean Distance:
	Mean of euclidean distances: 0.08726932907397295
	Max euclidean distance: 1.0992560821737951
Cosine Similarity:
	Mean of cosine similarites: 0.966701897336651
	Min cosine similarity: 0.39581805237428414
==========================================================

batch_size_vs_max_euclidean_distance

batch_size_vs_min_cosine_similarity

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.4

Originally created by @jorgetrejo36 on GitHub (Aug 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6262 ### What is the issue? I am using the ollama Python library for all the results I am getting. As I create embeddings using ollama.embed() I get progressively worse embeddings as the batches are larger. This is compared against creating embeddings one at a time. There seems to be a jump that happens at batch sizes of 16 or larger. All of my tests assume that I am getting the embeddings back in the same order given I submitted an issue not too long ago that was resolved (#6187). It is imperative that these embeddings be accurate given I am using them for a RAG app and retrieval needs to be good for all inserted embeddings. I ran function "chunk_text" with text from peter pan ([https://www.gutenberg.org/files/16/16-h/16-h.htm](url)), chunk_size = 256, max_characters of 65536 (256 chunks with 256 characters each). I ran the function "test" with chunks from the above function call and a batch_size_list of [2, 4, 8, 16, 32, 64, 128, 256]. Below all the code are results and some plots of the results. ``` import ollama import numpy as np import os from typing import List from dotenv import load_dotenv from sklearn.metrics.pairwise import cosine_similarity import matplotlib.pyplot as plt load_dotenv() # Embedding model used was "bge-large:latest" EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL") EPS=1e-4 def chunk_text(text: str, chunk_size: int, max_characters: int) -> List[str]: chunks = [] for i in range(0, len(text) if len(text) < max_characters else max_characters, chunk_size): chunk = text[i:i + chunk_size] chunks.append(chunk) return chunks # Used first few chapters of Peter Pan text = "" chunk_size = 256 # 256 is the max batch size that is defined later chunks = chunk_text(text, chunk_size, chunk_size * 256) def embed_string(s: str) -> np.ndarray: return np.array(ollama.embed( input=s, model=EMBEDDING_MODEL, options={ }, truncate=False )["embeddings"])[0] def embed_list(s: List[str]) -> np.ndarray: return np.array(ollama.embed( input=s, model=EMBEDDING_MODEL, options={ }, truncate=False )["embeddings"]) def test(list_of_string: List[str], batch_sizes: List[int]) -> bool: avg_distances = [] avg_similarites = [] max_distances = [] min_similarities = [] for batch_size in batch_sizes: print(f"Results for batch size: {batch_size}") singles = np.array([embed_string(s) for s in list_of_string[:batch_size]]) as_list = embed_list(list_of_string[:batch_size]) distances = [] for single_embedding, as_list_embedding in zip(singles, as_list): distance = np.sqrt(((single_embedding - as_list_embedding) ** 2).sum()) distances.append(distance) distances = np.array(distances) mean = np.mean(distances) max = np.max(distances) avg_distances.append(mean) max_distances.append(max) print("Euclidean Distance:") print(f"\tMean of euclidean distances: {mean}") print(f"\tMax euclidean distance: {max}") # Cosine similarity similarities = [] for single_embedding, as_list_embedding in zip(singles, as_list): vector1 = single_embedding.reshape(1, -1) vector2 = as_list_embedding.reshape(1, -1) similarity = cosine_similarity(vector1, vector2) similarities.append(similarity) similarities = np.array(similarities) mean = np.mean(similarities) min = np.min(similarities) avg_similarites.append(mean) min_similarities.append(min) print("Cosine Similarity:") print(f"\tMean of cosine similarites: {mean}") print(f"\tMin cosine similarity: {min}") print("==========================================================") return (batch_sizes, avg_distances, avg_similarites, max_distances, min_similarities) batch_sizes_list = [2**i for i in range(1, 9)] batch_sizes, avg_distances, avg_similarities, max_distances, min_similarities = test(chunks, batch_sizes_list) ``` RESULTS: ``` Results for batch size: 2 Euclidean Distance: Mean of euclidean distances: 0.0027100650691554615 Max euclidean distance: 0.003069791207141852 Cosine Similarity: Mean of cosine similarites: 0.999996263071194 Min cosine similarity: 0.99999528818776 ========================================================== Results for batch size: 4 Euclidean Distance: Mean of euclidean distances: 0.002698965850379388 Max euclidean distance: 0.0032083663351101474 Cosine Similarity: Mean of cosine similarites: 0.9999962901587796 Min cosine similarity: 0.9999948531925777 ========================================================== Results for batch size: 8 Euclidean Distance: Mean of euclidean distances: 0.003292175370207458 Max euclidean distance: 0.0038060000679778546 Cosine Similarity: Mean of cosine similarites: 0.9999945197318343 Min cosine similarity: 0.999992757181494 ========================================================== Results for batch size: 16 Euclidean Distance: Mean of euclidean distances: 0.11461230989305338 Max euclidean distance: 1.136748198080119 Cosine Similarity: Mean of cosine similarites: 0.946342128810411 Min cosine similarity: 0.35390177365096614 ========================================================== Results for batch size: 32 Euclidean Distance: Mean of euclidean distances: 0.08102131219835153 Max euclidean distance: 0.8772319282635773 Cosine Similarity: Mean of cosine similarites: 0.9674320902167836 Min cosine similarity: 0.6152323203539565 ========================================================== Results for batch size: 64 Euclidean Distance: Mean of euclidean distances: 0.09294858544026222 Max euclidean distance: 1.1095371609913112 Cosine Similarity: Mean of cosine similarites: 0.960566610535093 Min cosine similarity: 0.38446375093677954 ========================================================== Results for batch size: 128 Euclidean Distance: Mean of euclidean distances: 0.08298749768139443 Max euclidean distance: 0.9481241092937398 Cosine Similarity: Mean of cosine similarites: 0.9696922749059266 Min cosine similarity: 0.5505303906957912 ========================================================== Results for batch size: 256 Euclidean Distance: Mean of euclidean distances: 0.08726932907397295 Max euclidean distance: 1.0992560821737951 Cosine Similarity: Mean of cosine similarites: 0.966701897336651 Min cosine similarity: 0.39581805237428414 ========================================================== ``` ![batch_size_vs_max_euclidean_distance](https://github.com/user-attachments/assets/d82d7e56-44fd-4066-b697-dcf158a23fa8) ![batch_size_vs_min_cosine_similarity](https://github.com/user-attachments/assets/489a45c5-ed08-4b98-bece-9512ed27d7b4) ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.4
GiteaMirror added the bug label 2026-04-28 15:49:56 -05:00
Author
Owner

@igorschlum commented on GitHub (Aug 8, 2024):

hi @jorgetrejo36 I would like to run you code to see if I can replicate the issue on MacOS, but some pieces are missing. Can you provide them?

<!-- gh-comment-id:2276671018 --> @igorschlum commented on GitHub (Aug 8, 2024): hi @jorgetrejo36 I would like to run you code to see if I can replicate the issue on MacOS, but some pieces are missing. Can you provide them?
Author
Owner

@jmorganca commented on GitHub (Aug 8, 2024):

Sorry about this - looking into it

<!-- gh-comment-id:2276688846 --> @jmorganca commented on GitHub (Aug 8, 2024): Sorry about this - looking into it
Author
Owner

@rick-github commented on GitHub (Aug 8, 2024):

Which model are you using? I was unable to replicate with

  "nomic-embed-text:latest",
  "paraphrase-multilingual:latest",
  "snowflake-arctic-embed:latest",
  "mxbai-embed-large:latest",
  "bge-large:latest",
  "all-minilm:l12",
<!-- gh-comment-id:2276705829 --> @rick-github commented on GitHub (Aug 8, 2024): Which model are you using? I was unable to replicate with ``` "nomic-embed-text:latest", "paraphrase-multilingual:latest", "snowflake-arctic-embed:latest", "mxbai-embed-large:latest", "bge-large:latest", "all-minilm:l12", ```
Author
Owner

@jorgetrejo36 commented on GitHub (Aug 9, 2024):

I am using bge-large:latest

<!-- gh-comment-id:2277043696 --> @jorgetrejo36 commented on GitHub (Aug 9, 2024): I am using bge-large:latest
Author
Owner

@royjhan commented on GitHub (Aug 12, 2024):

@jorgetrejo36 is the problem still persisting?

<!-- gh-comment-id:2284550846 --> @royjhan commented on GitHub (Aug 12, 2024): @jorgetrejo36 is the problem still persisting?
Author
Owner

@jorgetrejo36 commented on GitHub (Aug 13, 2024):

@royjhan I just ran the tests again and I got similar results. I revised the code above with the actual function calls I made so you or anyone else should just be able to copy the code and immediately run it and see what I see. Does running it on your own machines result in zero discrepancies or what is it looking like? I'm just curious because maybe there's something wrong on my side that I'm not noticing.

<!-- gh-comment-id:2286433112 --> @jorgetrejo36 commented on GitHub (Aug 13, 2024): @royjhan I just ran the tests again and I got similar results. I revised the code above with the actual function calls I made so you or anyone else should just be able to copy the code and immediately run it and see what I see. Does running it on your own machines result in zero discrepancies or what is it looking like? I'm just curious because maybe there's something wrong on my side that I'm not noticing.
Author
Owner

@rick-github commented on GitHub (Aug 13, 2024):

I did wget https://www.gutenberg.org/files/16/16-h/16-h.htm, made this change

--- 6262.py.orig	2024-08-13 18:24:05.560356363 +0200
+++ 6262.py	2024-08-13 18:24:36.903051485 +0200
@@ -21,6 +21,8 @@
 
 # Used first few chapters of Peter Pan
 text = ""
+with open("16-h.htm") as fd:
+  text = fd.read()
 
 chunk_size = 256
 # 256 is the max batch size that is defined later

ran this command:

EMBEDDING_MODEL=bge-large:latest python3 ./6262.py

and got these results:

Results for batch size: 2
Euclidean Distance:
	Mean of euclidean distances: 0.0030663777330302523
	Max euclidean distance: 0.0036879363736636315
Cosine Similarity:
	Mean of cosine similarites: 0.999995105494726
	Min cosine similarity: 0.9999931995599575
==========================================================
Results for batch size: 4
Euclidean Distance:
	Mean of euclidean distances: 0.0027463073727389655
	Max euclidean distance: 0.00358716464884748
Cosine Similarity:
	Mean of cosine similarites: 0.9999960895006781
	Min cosine similarity: 0.9999935661230478
==========================================================
Results for batch size: 8
Euclidean Distance:
	Mean of euclidean distances: 0.0025482074992666114
	Max euclidean distance: 0.00358716464884748
Cosine Similarity:
	Mean of cosine similarites: 0.999996656158536
	Min cosine similarity: 0.9999935661230478
==========================================================
Results for batch size: 16
Euclidean Distance:
	Mean of euclidean distances: 0.0024955809009619316
	Max euclidean distance: 0.0034347715901393936
Cosine Similarity:
	Mean of cosine similarites: 0.9999968216936727
	Min cosine similarity: 0.9999941011717459
==========================================================
Results for batch size: 32
Euclidean Distance:
	Mean of euclidean distances: 0.0024666019674360485
	Max euclidean distance: 0.0034347715901393936
Cosine Similarity:
	Mean of cosine similarites: 0.9999969092075001
	Min cosine similarity: 0.9999941011717459
==========================================================
Results for batch size: 64
Euclidean Distance:
	Mean of euclidean distances: 0.002441116828212457
	Max euclidean distance: 0.0034347715901393936
Cosine Similarity:
	Mean of cosine similarites: 0.9999969774363446
	Min cosine similarity: 0.9999941011717459
==========================================================
Results for batch size: 128
Euclidean Distance:
	Mean of euclidean distances: 0.0024016578407820323
	Max euclidean distance: 0.003220162033245072
Cosine Similarity:
	Mean of cosine similarites: 0.9999970830756011
	Min cosine similarity: 0.999994815278163
==========================================================
Results for batch size: 256
Euclidean Distance:
	Mean of euclidean distances: 0.002390881610128061
	Max euclidean distance: 0.00358716464884748
Cosine Similarity:
	Mean of cosine similarites: 0.9999971108082498
	Min cosine similarity: 0.9999935661230478
==========================================================
<!-- gh-comment-id:2286656696 --> @rick-github commented on GitHub (Aug 13, 2024): I did `wget https://www.gutenberg.org/files/16/16-h/16-h.htm`, made this change ```diff --- 6262.py.orig 2024-08-13 18:24:05.560356363 +0200 +++ 6262.py 2024-08-13 18:24:36.903051485 +0200 @@ -21,6 +21,8 @@ # Used first few chapters of Peter Pan text = "" +with open("16-h.htm") as fd: + text = fd.read() chunk_size = 256 # 256 is the max batch size that is defined later ``` ran this command: ``` EMBEDDING_MODEL=bge-large:latest python3 ./6262.py ``` and got these results: ``` Results for batch size: 2 Euclidean Distance: Mean of euclidean distances: 0.0030663777330302523 Max euclidean distance: 0.0036879363736636315 Cosine Similarity: Mean of cosine similarites: 0.999995105494726 Min cosine similarity: 0.9999931995599575 ========================================================== Results for batch size: 4 Euclidean Distance: Mean of euclidean distances: 0.0027463073727389655 Max euclidean distance: 0.00358716464884748 Cosine Similarity: Mean of cosine similarites: 0.9999960895006781 Min cosine similarity: 0.9999935661230478 ========================================================== Results for batch size: 8 Euclidean Distance: Mean of euclidean distances: 0.0025482074992666114 Max euclidean distance: 0.00358716464884748 Cosine Similarity: Mean of cosine similarites: 0.999996656158536 Min cosine similarity: 0.9999935661230478 ========================================================== Results for batch size: 16 Euclidean Distance: Mean of euclidean distances: 0.0024955809009619316 Max euclidean distance: 0.0034347715901393936 Cosine Similarity: Mean of cosine similarites: 0.9999968216936727 Min cosine similarity: 0.9999941011717459 ========================================================== Results for batch size: 32 Euclidean Distance: Mean of euclidean distances: 0.0024666019674360485 Max euclidean distance: 0.0034347715901393936 Cosine Similarity: Mean of cosine similarites: 0.9999969092075001 Min cosine similarity: 0.9999941011717459 ========================================================== Results for batch size: 64 Euclidean Distance: Mean of euclidean distances: 0.002441116828212457 Max euclidean distance: 0.0034347715901393936 Cosine Similarity: Mean of cosine similarites: 0.9999969774363446 Min cosine similarity: 0.9999941011717459 ========================================================== Results for batch size: 128 Euclidean Distance: Mean of euclidean distances: 0.0024016578407820323 Max euclidean distance: 0.003220162033245072 Cosine Similarity: Mean of cosine similarites: 0.9999970830756011 Min cosine similarity: 0.999994815278163 ========================================================== Results for batch size: 256 Euclidean Distance: Mean of euclidean distances: 0.002390881610128061 Max euclidean distance: 0.00358716464884748 Cosine Similarity: Mean of cosine similarites: 0.9999971108082498 Min cosine similarity: 0.9999935661230478 ========================================================== ```
Author
Owner

@royjhan commented on GitHub (Aug 13, 2024):

@rick-github thank you for contributing here. @jorgetrejo36 how are you loading Peter Pan into text, and can you see if the above changes change anything for you? I'm still not able to recreate the issue.

<!-- gh-comment-id:2286773075 --> @royjhan commented on GitHub (Aug 13, 2024): @rick-github thank you for contributing here. @jorgetrejo36 how are you loading Peter Pan into text, and can you see if the above changes change anything for you? I'm still not able to recreate the issue.
Author
Owner

@jorgetrejo36 commented on GitHub (Aug 14, 2024):

@royjhan Did some more testing and it seems to be a problem with the parallelism ollama does. We took away these two env vars:

OLLAMA_MAX_QUEUE=1000
OLLAMA_NUM_PARALLEL=100

and we got similiar results to @rick-github.

Ideally, we would like to keep these env vars set but until then a temp fix will be just disabling these settings. Let me know if you get anything similar to my initial results when these env vars are set.

<!-- gh-comment-id:2290021457 --> @jorgetrejo36 commented on GitHub (Aug 14, 2024): @royjhan Did some more testing and it seems to be a problem with the parallelism ollama does. We took away these two env vars: ``` OLLAMA_MAX_QUEUE=1000 OLLAMA_NUM_PARALLEL=100 ``` and we got similiar results to @rick-github. Ideally, we would like to keep these env vars set but until then a temp fix will be just disabling these settings. Let me know if you get anything similar to my initial results when these env vars are set.
Author
Owner

@rick-github commented on GitHub (Aug 14, 2024):

Can you post some server logs? I'm curious as to the effect of OLLAMA_NUM_PARALLEL/OLLAMA_MAX_QUEUE on scheduling on your hardware.

<!-- gh-comment-id:2290030161 --> @rick-github commented on GitHub (Aug 14, 2024): Can you post some [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues)? I'm curious as to the effect of OLLAMA_NUM_PARALLEL/OLLAMA_MAX_QUEUE on scheduling on your hardware.
Author
Owner

@jorgetrejo36 commented on GitHub (Aug 21, 2024):

@rick-github

server-logs.txt

I OLLAMA_NUM_PARALLEL = 8 and I still get the same issue. When I did a single test with OLLAMA_NUM_PARALLEL = 4 everything seemed to work fine.

<!-- gh-comment-id:2302960421 --> @jorgetrejo36 commented on GitHub (Aug 21, 2024): @rick-github [server-logs.txt](https://github.com/user-attachments/files/16698360/server-logs.txt) I OLLAMA_NUM_PARALLEL = 8 and I still get the same issue. When I did a single test with OLLAMA_NUM_PARALLEL = 4 everything seemed to work fine.
Author
Owner

@rick-github commented on GitHub (Aug 21, 2024):

Thanks, I also got poor results at OLLAMA_NUM_PARALLEL=8, so something to keep in mind when doing bulk embeddings.

<!-- gh-comment-id:2303012236 --> @rick-github commented on GitHub (Aug 21, 2024): Thanks, I also got poor results at `OLLAMA_NUM_PARALLEL=8`, so something to keep in mind when doing bulk embeddings.
Author
Owner

@jorgetrejo36 commented on GitHub (Sep 12, 2024):

@rick-github @royjhan Is there any possibility of this being fixed or has there been any updates on it?

<!-- gh-comment-id:2346257380 --> @jorgetrejo36 commented on GitHub (Sep 12, 2024): @rick-github @royjhan Is there any possibility of this being fixed or has there been any updates on it?
Author
Owner

@rick-github commented on GitHub (Sep 13, 2024):

I'm looking at issues with generation being bad with high OLLAMA_NUM_PARALLEL, so it's not just embeddings. But I currently don't have a fix, it's a work in progress.

<!-- gh-comment-id:2347660782 --> @rick-github commented on GitHub (Sep 13, 2024): I'm looking at issues with generation being bad with high OLLAMA_NUM_PARALLEL, so it's not just embeddings. But I currently don't have a fix, it's a work in progress.
Author
Owner

@gaileys commented on GitHub (Oct 11, 2024):

I can confirm I'm seeing similar problems with a batch size of 128. I have OLLAMA_NUM_PARALLEL=4. I haven't run thie above code to test the extent of this problem, but I have moved from non-batch processing to batch and see this clearly in the results.

<!-- gh-comment-id:2407390564 --> @gaileys commented on GitHub (Oct 11, 2024): I can confirm I'm seeing similar problems with a batch size of 128. I have OLLAMA_NUM_PARALLEL=4. I haven't run thie above code to test the extent of this problem, but I have moved from non-batch processing to batch and see this clearly in the results.
Author
Owner

@gaileys commented on GitHub (Oct 15, 2024):

My understanding is the the problem is the order of resultant embeddings is not the same as the order of the chunks supplied. The Embeddings are correct but they are randomised.

<!-- gh-comment-id:2413463960 --> @gaileys commented on GitHub (Oct 15, 2024): My understanding is the the problem is the order of resultant embeddings is not the same as the order of the chunks supplied. The Embeddings are correct but they are randomised.
Author
Owner

@gaileys commented on GitHub (Nov 5, 2024):

My understanding is the the problem is the order of resultant embeddings is not the same as the order of the chunks supplied. The Embeddings are correct but they are randomised.

OK, I have run some more tests and that isn't the case. This is as stated origninally that the larger batch sizes generate increasinly dispersed results. I'm using the embeddings for clustering and the results are unusable for me at batch sizes of 128.

<!-- gh-comment-id:2457193793 --> @gaileys commented on GitHub (Nov 5, 2024): > My understanding is the the problem is the order of resultant embeddings is not the same as the order of the chunks supplied. The Embeddings are correct but they are randomised. OK, I have run some more tests and that isn't the case. This is as stated origninally that the larger batch sizes generate increasinly dispersed results. I'm using the embeddings for clustering and the results are unusable for me at batch sizes of 128.
Author
Owner

@KastanDay commented on GitHub (Mar 19, 2025):

@jorgetrejo36 @gaileys and @rick-github any suggestions on max stable OLLAMA_NUM_PARALLEL? I see OLLAMA_NUM_PARALLEL=128 is "unusable", but do you have perfect performance with OLLAMA_NUM_PARALLEL=64 or 32 or less?

Thanks for any guidance on the most performant + most stable config.

<!-- gh-comment-id:2738100068 --> @KastanDay commented on GitHub (Mar 19, 2025): @jorgetrejo36 @gaileys and @rick-github any suggestions on max stable `OLLAMA_NUM_PARALLEL`? I see `OLLAMA_NUM_PARALLEL=128` is "unusable", but do you have perfect performance with `OLLAMA_NUM_PARALLEL=64 or 32 or less`? Thanks for any guidance on the most performant + most stable config.
Author
Owner

@rick-github commented on GitHub (Mar 20, 2025):

ollama doesn't currently support parallel embeddings (as of 90ca84172). If you need parallelism, the only method at the moment is to run multiple ollama servers and run a proxy in front to present a unified interface (eg https://github.com/ollama/ollama/issues/8186#issuecomment-2560443545)

<!-- gh-comment-id:2740047100 --> @rick-github commented on GitHub (Mar 20, 2025): ollama doesn't currently support parallel embeddings (as of 90ca84172). If you need parallelism, the only method at the moment is to run multiple ollama servers and run a proxy in front to present a unified interface (eg https://github.com/ollama/ollama/issues/8186#issuecomment-2560443545)
Author
Owner

@tobiaswuerth commented on GitHub (Apr 6, 2025):

ollama doesn't currently support parallel embeddings (as of 90ca841). If you need parallelism, the only method at the moment is to run multiple ollama servers and run a proxy in front to present a unified interface (eg #8186 (comment))

Looking forward to embedding batching, is there a release timeline for this feature?

<!-- gh-comment-id:2781339476 --> @tobiaswuerth commented on GitHub (Apr 6, 2025): > ollama doesn't currently support parallel embeddings (as of [90ca841](https://github.com/ollama/ollama/commit/90ca84172c2a98ecfd76eb7e05cd3e33e1dde507)). If you need parallelism, the only method at the moment is to run multiple ollama servers and run a proxy in front to present a unified interface (eg [#8186 (comment)](https://github.com/ollama/ollama/issues/8186#issuecomment-2560443545)) Looking forward to embedding batching, is there a release timeline for this feature?
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

If I were to guess, I'd say this will be a feature of the new runner architecture. While it doesn't support embedding at the moment, the new architecture means ollama developers won't have the same limitations as with the llama.cpp backend. How long until it gets there is anybody's guess.

<!-- gh-comment-id:2783592016 --> @rick-github commented on GitHub (Apr 7, 2025): If I were to guess, I'd say this will be a feature of the new runner architecture. While it [doesn't support](https://github.com/ollama/ollama/blob/0f3f9e353df96d4cfc40ac19114c782a57fe30f5/runner/ollamarunner/runner.go#L867) embedding at the moment, the new architecture means ollama developers won't have the same limitations as with the llama.cpp backend. How long until it gets there is anybody's guess.
Author
Owner

@jorgetrejo36 commented on GitHub (Apr 7, 2025):

Just to clarify,

Does batch processing for embeddings work at all? If OLLAMA_NUM_PARALLEL is just set to 1 will batch processing work or does the current implementation depend on the parallelism stuff?

@rick-github

<!-- gh-comment-id:2783985243 --> @jorgetrejo36 commented on GitHub (Apr 7, 2025): Just to clarify, Does batch processing for embeddings work at all? If OLLAMA_NUM_PARALLEL is just set to 1 will batch processing work or does the current implementation depend on the parallelism stuff? @rick-github
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

Batch processing works, just not in parallel. The contents of the batch are serialized, processed, and the embeddings are then returned to the client.

<!-- gh-comment-id:2783993491 --> @rick-github commented on GitHub (Apr 7, 2025): Batch processing works, just not in parallel. The contents of the batch are serialized, processed, and the embeddings are then returned to the client.
Author
Owner

@tobiaswuerth commented on GitHub (Apr 7, 2025):

Batch processing works, just not in parallel. The contents of the batch are serialized, processed, and the embeddings are then returned to the client.

Ah, okay! Then I misunderstood. Because on the website at the bottom it states:

Coming soon: Batch embeddings: processing multiple input data prompts simultaneously

<!-- gh-comment-id:2784077261 --> @tobiaswuerth commented on GitHub (Apr 7, 2025): > Batch processing works, just not in parallel. The contents of the batch are serialized, processed, and the embeddings are then returned to the client. Ah, okay! Then I misunderstood. Because on the [website](https://ollama.com/blog/embedding-models) at the bottom it states: > Coming soon: Batch embeddings: processing multiple input data prompts simultaneously
Author
Owner

@PaulCapestany commented on GitHub (Apr 8, 2025):

FWIW: unless I'm missing something, it appears that with a simple tweak, parallel processing to /embed can now work, including with batching. All I did was change numParallel = 1 to numParallel = 4, built a new ollama binary via go install ./..., and yeah, after running a simple shell script to sanity-check/test sequential vs parallel requests, it seems to work fine for me.

It looks like the problem originated in llamacpp (https://github.com/ggml-org/llama.cpp/issues/6722), and then the issue was addressed by disabling parallel embedding generation in ollama (https://github.com/ollama/ollama/pull/6467), but, it appears to me that it must have been fixed in llamacpp since I am no longer getting server crashes.

<!-- gh-comment-id:2785065551 --> @PaulCapestany commented on GitHub (Apr 8, 2025): FWIW: unless I'm missing something, it appears that with a simple tweak, parallel processing to `/embed` can now work, including with batching. All I did was change [`numParallel = 1`](https://github.com/ollama/ollama/blob/ad22ace439eb3fab7230134e56bb6276a78347e4/server/sched.go#L198) to `numParallel = 4`, built a new ollama binary via `go install ./...`, and yeah, after running a simple [shell script](https://gist.github.com/PaulCapestany/6d2c6f0d9bb7261ceedb736268ad6377) to sanity-check/test sequential vs parallel requests, it seems to work fine for me. It looks like the problem originated in llamacpp (https://github.com/ggml-org/llama.cpp/issues/6722), and then the issue was addressed by disabling parallel embedding generation in ollama (https://github.com/ollama/ollama/pull/6467), but, it appears to me that it must have been fixed in llamacpp since I am no longer getting server crashes.
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

The problem is not server crashes, it's that embeddings done in parallel are corrupted.

$ for p in 1 2 3 4 ; do echo parallel=$p ; seq 0 9 | parallel -n0 -P$p curl -s localhost:11434/api/embed -d \''{"model":"snowflake-arctic-embed","input":"What is the weather like today?"}'\' | jq -c '.embeddings[0]|.[0:3] + ["..."] + .[-3:]' ; done
parallel=1
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
parallel=2
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
parallel=3
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896]
[0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896]
[0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
parallel=4
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896]
[0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
[0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468]
<!-- gh-comment-id:2787206612 --> @rick-github commented on GitHub (Apr 8, 2025): The problem is not server crashes, it's that embeddings done in parallel are corrupted. ```console $ for p in 1 2 3 4 ; do echo parallel=$p ; seq 0 9 | parallel -n0 -P$p curl -s localhost:11434/api/embed -d \''{"model":"snowflake-arctic-embed","input":"What is the weather like today?"}'\' | jq -c '.embeddings[0]|.[0:3] + ["..."] + .[-3:]' ; done parallel=1 [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] parallel=2 [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] parallel=3 [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896] [0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896] [0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] parallel=4 [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896] [0.00024531296,-0.05799037,-0.0064362884,"...",-0.004672217,-0.016983667,-0.043087896] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] [0.0001809538,-0.05792252,-0.0065146764,"...",-0.004687789,-0.016948145,-0.043066468] ```
Author
Owner

@PaulCapestany commented on GitHub (Apr 9, 2025):

@rick-github thanks, clearly in my haste I was missing many things 😅

I tried @jorgetrejo36's python script (with your improvements) while running my version of ollama with parallelization enabled for embeddings (via export OLLAMA_NUM_PARALLEL=4; ollama serve) and, yeah, the consistency/similarity results were atrocious. Results were totally fine with OLLAMA_NUM_PARALLEL=1 though.

I was curious about how to attempt to optimize embedding throughput even with OLLAMA_NUM_PARALLEL=1 limitation via chunk-sizing and/or batch-sizing, if anyone else is, here's what I saw in my testing (repo here, caveat emptor):

Image

<!-- gh-comment-id:2790675408 --> @PaulCapestany commented on GitHub (Apr 9, 2025): @rick-github thanks, clearly in my haste I was missing _many_ things 😅 I tried @jorgetrejo36's python script (with your improvements) while running my version of ollama with parallelization enabled for embeddings (via `export OLLAMA_NUM_PARALLEL=4; ollama serve`) and, yeah, the consistency/similarity results were atrocious. Results were totally fine with `OLLAMA_NUM_PARALLEL=1` though. I was curious about how to attempt to optimize embedding throughput even with `OLLAMA_NUM_PARALLEL=1` limitation via chunk-sizing and/or batch-sizing, if anyone else is, here's what I saw in my testing ([repo here](https://github.com/PaulCapestany/embed-tests), caveat emptor): ![Image](https://github.com/user-attachments/assets/cf89f359-45bd-406f-b936-5a6f1999afc5)
Author
Owner

@vinipy12 commented on GitHub (Nov 17, 2025):

Is this still a thing? Or can we finally do parallel embeddings?

<!-- gh-comment-id:3543263824 --> @vinipy12 commented on GitHub (Nov 17, 2025): Is this still a thing? Or can we finally do parallel embeddings?
Author
Owner

@rick-github commented on GitHub (Nov 17, 2025):

Embedding models still do not support parallelism. A client can make parallel calls, but they will be serialized for processing.

<!-- gh-comment-id:3543334049 --> @rick-github commented on GitHub (Nov 17, 2025): Embedding models still do not support parallelism. A client can make parallel calls, but they will be serialized for processing.
Author
Owner

@vinipy12 commented on GitHub (Nov 17, 2025):

So, if I make multiple ollama.embeddings() call through an asyncio.semaphore, they will be queued and serialized right?

<!-- gh-comment-id:3543418958 --> @vinipy12 commented on GitHub (Nov 17, 2025): So, if I make multiple ollama.embeddings() call through an asyncio.semaphore, they will be queued and serialized right?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50431