[GH-ISSUE #12239] [Performance] embeddinggemma seems to be slow #70202

Open
opened 2026-05-04 20:38:48 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @runyournode on GitHub (Sep 10, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12239

Hi,

I did a speed benchmark for the new embeddinggemma against bge-m3 on GPU and it seems oddly slow.

My questions:

  • Is this expected or normal ? Any hope it will/could be optimized ?
  • Why aren't the quantized models always faster than the bf16 one ?

200 embeddings requests:

=== Global summary ===
embeddinggemma:300m_batched=True : 9.27 sec
embeddinggemma:300m-qat-q8_0_batched=True : 2.08 sec
embeddinggemma:300m-qat-q4_0_batched=True : 8.97 sec
bge-m3:567m_batched=True       : 1.42 sec
embeddinggemma:300m_batched=False : 35.30 sec
embeddinggemma:300m-qat-q8_0_batched=False : 28.08 sec
embeddinggemma:300m-qat-q4_0_batched=False : 34.96 sec
bge-m3:567m_batched=False      : 30.45 sec

1000 embeddings request (only batched):

=== Global summary ===
embeddinggemma:300m_batched=True : 9.79 sec
embeddinggemma:300m-qat-q8_0_batched=True : 10.06 sec
embeddinggemma:300m-qat-q4_0_batched=True : 17.10 sec
bge-m3:567m_batched=True       : 6.77 sec

Host & config
Ubuntu 22.04 - RTX 4090 - AMD Ryzen 9 5950X - 128GB RAM
ollama v0.11.10
Env:

  • Environment="OLLAMA_DEBUG=1"
  • Environment="OLLAMA_HOST=0.0.0.0"
  • Environment="OLLAMA_CONTEXT_LENGTH=8192"
  • Environment="OLLAMA_FLASH_ATTENTION=1"
  • Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
  • Environment="OLLAMA_NUM_PARALLEL=2"
  • Environment="OLLAMA_KEEP_ALIVE=900"
  • Environment="OLLAMA_NEW_ENGINE=1"

Script to reproduce the benchmark

from typing import List
import string
import random
import time

import requests


# generating a list of random str
alphabet = string.ascii_letters + string.digits
random_strings = [
    ''.join(random.choices(alphabet, k=300))
    for _ in range(2_000)
]


def get_embeddings_batch(texts: List[str], batched: bool = True, model="embeddinggemma:300m", url="http://localhost:11434/api/embed"):
    """
    Get embeddings from Ollama.

    Args:
        texts (list[str]): List of text to embed
        batched (bool): single or fully batched query
        model (str): Ollama embedding model
        url (str): Ollama API url

    Returns:
        list[list[float]]: List of embeddings vectors
    """
    if batched:
        response = requests.post(url, json={"model": model, "input": texts})
        response.raise_for_status()
        data = response.json()
        embeddings = data["embeddings"]
    else:
        embeddings = []
        for text in texts:
            response = requests.post(url, json={"model": model, "input": text})
            response.raise_for_status()
            data = response.json()
            embeddings.append(data["embeddings"][0])

    return embeddings


if __name__ == "__main__":
    batch = random_strings

    models = [
        "embeddinggemma:300m",
        "embeddinggemma:300m-qat-q8_0",
        "embeddinggemma:300m-qat-q4_0",
        "bge-m3:567m"
    ]

    results = {}

    for batched in [True] :# , False]:
        for model in models:
            _ = get_embeddings_batch(['load', 'and', 'warm-up!'], batched=batched, model=model)
            print(f"\n--- Testing model {model} with {batched=}---")
            start = time.perf_counter()
            vecs = get_embeddings_batch(batch, batched=batched, model=model)
            end = time.perf_counter()
            elapsed = end - start
            results[f'{model}_{batched=}'] = elapsed
            print(f"Total time: {elapsed:.2f} sec")
            print(f"Number of embbedings: {len(vecs)} ; dimension: {len(vecs[0])}")

    print("\n=== Global summary ===")
    for model, t in results.items():
        print(f"{model:<30} : {t:.2f} sec")
Originally created by @runyournode on GitHub (Sep 10, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12239 Hi, I did a speed benchmark for the new `embeddinggemma` against `bge-m3` on GPU and it seems oddly slow. My questions: - Is this expected or normal ? Any hope it will/could be optimized ? - Why aren't the quantized models always faster than the bf16 one ? 200 embeddings requests: ``` === Global summary === embeddinggemma:300m_batched=True : 9.27 sec embeddinggemma:300m-qat-q8_0_batched=True : 2.08 sec embeddinggemma:300m-qat-q4_0_batched=True : 8.97 sec bge-m3:567m_batched=True : 1.42 sec embeddinggemma:300m_batched=False : 35.30 sec embeddinggemma:300m-qat-q8_0_batched=False : 28.08 sec embeddinggemma:300m-qat-q4_0_batched=False : 34.96 sec bge-m3:567m_batched=False : 30.45 sec ``` 1000 embeddings request (only batched): ``` === Global summary === embeddinggemma:300m_batched=True : 9.79 sec embeddinggemma:300m-qat-q8_0_batched=True : 10.06 sec embeddinggemma:300m-qat-q4_0_batched=True : 17.10 sec bge-m3:567m_batched=True : 6.77 sec ``` **Host & config** Ubuntu 22.04 - RTX 4090 - AMD Ryzen 9 5950X - 128GB RAM ollama v0.11.10 Env: - Environment="OLLAMA_DEBUG=1" - Environment="OLLAMA_HOST=0.0.0.0" - Environment="OLLAMA_CONTEXT_LENGTH=8192" - Environment="OLLAMA_FLASH_ATTENTION=1" - Environment="OLLAMA_KV_CACHE_TYPE=q8_0" - Environment="OLLAMA_NUM_PARALLEL=2" - Environment="OLLAMA_KEEP_ALIVE=900" - Environment="OLLAMA_NEW_ENGINE=1" **Script to reproduce the benchmark** ```python from typing import List import string import random import time import requests # generating a list of random str alphabet = string.ascii_letters + string.digits random_strings = [ ''.join(random.choices(alphabet, k=300)) for _ in range(2_000) ] def get_embeddings_batch(texts: List[str], batched: bool = True, model="embeddinggemma:300m", url="http://localhost:11434/api/embed"): """ Get embeddings from Ollama. Args: texts (list[str]): List of text to embed batched (bool): single or fully batched query model (str): Ollama embedding model url (str): Ollama API url Returns: list[list[float]]: List of embeddings vectors """ if batched: response = requests.post(url, json={"model": model, "input": texts}) response.raise_for_status() data = response.json() embeddings = data["embeddings"] else: embeddings = [] for text in texts: response = requests.post(url, json={"model": model, "input": text}) response.raise_for_status() data = response.json() embeddings.append(data["embeddings"][0]) return embeddings if __name__ == "__main__": batch = random_strings models = [ "embeddinggemma:300m", "embeddinggemma:300m-qat-q8_0", "embeddinggemma:300m-qat-q4_0", "bge-m3:567m" ] results = {} for batched in [True] :# , False]: for model in models: _ = get_embeddings_batch(['load', 'and', 'warm-up!'], batched=batched, model=model) print(f"\n--- Testing model {model} with {batched=}---") start = time.perf_counter() vecs = get_embeddings_batch(batch, batched=batched, model=model) end = time.perf_counter() elapsed = end - start results[f'{model}_{batched=}'] = elapsed print(f"Total time: {elapsed:.2f} sec") print(f"Number of embbedings: {len(vecs)} ; dimension: {len(vecs[0])}") print("\n=== Global summary ===") for model, t in results.items(): print(f"{model:<30} : {t:.2f} sec") ```
Author
Owner

@kevinsimper commented on GitHub (Oct 8, 2025):

I also tried benchmark embeddinggemma and it was not much faster doing huge batches vs small batches, and the GPU is not busy. Hard to know how to debug it :)

<!-- gh-comment-id:3380464900 --> @kevinsimper commented on GitHub (Oct 8, 2025): I also tried benchmark embeddinggemma and it was not much faster doing huge batches vs small batches, and the GPU is not busy. Hard to know how to debug it :)
Author
Owner

@ww2283 commented on GitHub (Nov 3, 2025):

it is slow. I changed the model from embeddinggemma to nomic-embed-text which is revealing the speed difference. Either this model is not optimized for ollama yet, or there's some bug here. This is the case for Ubuntu 20.04. On mac, the speed for embeddinggemma is ok , though.

<!-- gh-comment-id:3482674715 --> @ww2283 commented on GitHub (Nov 3, 2025): it is slow. I changed the model from embeddinggemma to nomic-embed-text which is revealing the speed difference. Either this model is not optimized for ollama yet, or there's some bug here. This is the case for Ubuntu 20.04. On mac, the speed for embeddinggemma is ok , though.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70202