[GH-ISSUE #7400] Creating embeddings using the REST API is much slower than performing the same operation using Sentence Transformers #4706

Open
opened 2026-04-12 15:38:52 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @sebovzeoueb on GitHub (Oct 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7400

Originally assigned to: @jessegross on GitHub.

What is the issue?

I'm working on a RAG written in Python, and we're using ollama as the chatbot LLM provider. It's running in a Docker container and the Python app makes REST API calls to it. We have so far been using Sentence Transformers to create embeddings for documents that get ingested into the RAG and the user's query, however it would be great to ditch this dependency as it adds a bit of startup time and a lot of package dependencies that take up disk space.

Since the embed API now supports batching, I've run a test using the existing Sentence Transformers code, and equivalent code (same vectors, same model) using the embed route from the ollama Docker container and it's about 2x slower. Even though I guess making an HTTP request slows things down a little, I can't imagine the overhead should be that much?

Any way we can make it faster so we can get rid of other dependencies and use ollama for all our LLM related needs?

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @sebovzeoueb on GitHub (Oct 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7400 Originally assigned to: @jessegross on GitHub. ### What is the issue? I'm working on a RAG written in Python, and we're using ollama as the chatbot LLM provider. It's running in a Docker container and the Python app makes REST API calls to it. We have so far been using Sentence Transformers to create embeddings for documents that get ingested into the RAG and the user's query, however it would be great to ditch this dependency as it adds a bit of startup time and a lot of package dependencies that take up disk space. Since the embed API now supports batching, I've run a test using the existing Sentence Transformers code, and equivalent code (same vectors, same model) using the embed route from the ollama Docker container and it's about 2x slower. Even though I guess making an HTTP request slows things down a little, I can't imagine the overhead should be that much? Any way we can make it faster so we can get rid of other dependencies and use ollama for all our LLM related needs? ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14
GiteaMirror added the performancebug labels 2026-04-12 15:38:52 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 28, 2024):

Which model? Server logs may give some insight in to why it's slower. Would it be possible for you to provide a script that demonstrates the problem (ie, do both types of embedding)?

<!-- gh-comment-id:2441968699 --> @rick-github commented on GitHub (Oct 28, 2024): Which model? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md) may give some insight in to why it's slower. Would it be possible for you to provide a script that demonstrates the problem (ie, do both types of embedding)?
Author
Owner

@sebovzeoueb commented on GitHub (Oct 28, 2024):

OK, this is performing worse than I expected:

from sentence_transformers import SentenceTransformer
import json
import requests
import os
import time

stransform = SentenceTransformer("paraphrase-MiniLM-L6-v2")
HOST = os.getenv("OLLAMA_HOST", "localhost")

def create_st_embeddings(text):
    return stransform.encode(text)



def load_model(model_name):
    def is_loaded():
        models = requests.get(f"http://{HOST}:11434/api/tags")
        model_list = json.loads(models.text)["models"]
        return next(
            filter(lambda x: x["name"].split(":")[0] == model_name, model_list), None
        )

    while not is_loaded():
        print(f"{model_name} model not found. Please wait while it loads.")
        request = requests.post(
            f"http://{HOST}:11434/api/pull",
            data=json.dumps({"name": model_name}),
            stream=True,
        )
        current = 0
        for item in request.iter_lines():
            if item:
                value = json.loads(item)
                # TODO: display statuses
                if "total" in value:
                    if "completed" in value:
                        current = value["completed"]
                    yield (current, value["total"])


def create_embeddings_ollama(text):
    data = {"model": "all-minilm", "input": text, "stream": False}
    response = requests.post(
        f"http://{HOST}:11434/api/embed", data=json.dumps(data)
    ).json()
    # if there was an error in the response, it may be because the model wasn't present
    # TODO: check the type of error
    if "error" in response:
        for _ in load_model("all-minilm"):
            # TODO: display progress
            pass
        response = requests.post(
            f"http://{HOST}:11434/api/embed", data=json.dumps(data)
        ).json()
        # if at this point it still didn't work we'll let it raise an exception
    return response["embeddings"]

def test_speed():
    inputs = ["Lorem ipsum", "dolor sit amet,", "consectetur adipiscing elit,", "sed do", "eiusmod tempor incididunt ut labore et dolore magna aliqua"]
    start = time.time()
    for i in range(100):
        create_st_embeddings(inputs)
    end = time.time()
    print(f"Sentence tranformer took {end - start}s")
    start = time.time()
    for i in range(100):
        create_embeddings_ollama(inputs)
    end = time.time()
    print(f"Ollama took {end - start}s")

if __name__ == "__main__":
    test_speed()

On my system I get the following output:

Sentence tranformer took 4.112587213516235s
Ollama took 72.928391456604s

I of course accounted for the model loading, the model was already loaded beforehand so it skipped the loading part. I ran twice and got almost the exact same timings, so it's consistent.

My server logs have a bunch of this, so the math checks out, 100*700 ish ms:

2024-10-28 18:42:39 [GIN] 2024/10/28 - 17:42:39 | 200 |  704.642746ms |      172.18.0.1 | POST     "/api/embed"
2024-10-28 18:42:40 [GIN] 2024/10/28 - 17:42:40 | 200 |  778.619764ms |      172.18.0.1 | POST     "/api/embed"
2024-10-28 18:42:41 [GIN] 2024/10/28 - 17:42:41 | 200 |  743.552632ms |      172.18.0.1 | POST     "/api/embed"
2024-10-28 18:42:41 [GIN] 2024/10/28 - 17:42:41 | 200 |  773.902971ms |      172.18.0.1 | POST     "/api/embed"
2024-10-28 18:42:42 [GIN] 2024/10/28 - 17:42:42 | 200 |   717.96729ms |      172.18.0.1 | POST     "/api/embed"
<!-- gh-comment-id:2442235889 --> @sebovzeoueb commented on GitHub (Oct 28, 2024): OK, this is performing worse than I expected: ``` from sentence_transformers import SentenceTransformer import json import requests import os import time stransform = SentenceTransformer("paraphrase-MiniLM-L6-v2") HOST = os.getenv("OLLAMA_HOST", "localhost") def create_st_embeddings(text): return stransform.encode(text) def load_model(model_name): def is_loaded(): models = requests.get(f"http://{HOST}:11434/api/tags") model_list = json.loads(models.text)["models"] return next( filter(lambda x: x["name"].split(":")[0] == model_name, model_list), None ) while not is_loaded(): print(f"{model_name} model not found. Please wait while it loads.") request = requests.post( f"http://{HOST}:11434/api/pull", data=json.dumps({"name": model_name}), stream=True, ) current = 0 for item in request.iter_lines(): if item: value = json.loads(item) # TODO: display statuses if "total" in value: if "completed" in value: current = value["completed"] yield (current, value["total"]) def create_embeddings_ollama(text): data = {"model": "all-minilm", "input": text, "stream": False} response = requests.post( f"http://{HOST}:11434/api/embed", data=json.dumps(data) ).json() # if there was an error in the response, it may be because the model wasn't present # TODO: check the type of error if "error" in response: for _ in load_model("all-minilm"): # TODO: display progress pass response = requests.post( f"http://{HOST}:11434/api/embed", data=json.dumps(data) ).json() # if at this point it still didn't work we'll let it raise an exception return response["embeddings"] def test_speed(): inputs = ["Lorem ipsum", "dolor sit amet,", "consectetur adipiscing elit,", "sed do", "eiusmod tempor incididunt ut labore et dolore magna aliqua"] start = time.time() for i in range(100): create_st_embeddings(inputs) end = time.time() print(f"Sentence tranformer took {end - start}s") start = time.time() for i in range(100): create_embeddings_ollama(inputs) end = time.time() print(f"Ollama took {end - start}s") if __name__ == "__main__": test_speed() ``` On my system I get the following output: ``` Sentence tranformer took 4.112587213516235s Ollama took 72.928391456604s ``` I of course accounted for the model loading, the model was already loaded beforehand so it skipped the loading part. I ran twice and got almost the exact same timings, so it's consistent. My server logs have a bunch of this, so the math checks out, 100*700 ish ms: ``` 2024-10-28 18:42:39 [GIN] 2024/10/28 - 17:42:39 | 200 | 704.642746ms | 172.18.0.1 | POST "/api/embed" 2024-10-28 18:42:40 [GIN] 2024/10/28 - 17:42:40 | 200 | 778.619764ms | 172.18.0.1 | POST "/api/embed" 2024-10-28 18:42:41 [GIN] 2024/10/28 - 17:42:41 | 200 | 743.552632ms | 172.18.0.1 | POST "/api/embed" 2024-10-28 18:42:41 [GIN] 2024/10/28 - 17:42:41 | 200 | 773.902971ms | 172.18.0.1 | POST "/api/embed" 2024-10-28 18:42:42 [GIN] 2024/10/28 - 17:42:42 | 200 | 717.96729ms | 172.18.0.1 | POST "/api/embed" ```
Author
Owner

@rick-github commented on GitHub (Oct 28, 2024):

Thanks for the great test script. Testing locally:

$ python3 7400.py 
Sentence tranformer took 0.6625242233276367s
Ollama took 3.005030870437622s

About 4.5x slower, nothing like the 17x you see. If I push the model onto CPU, it takes about twice as long, but best case ollama is still slower. What's the output of ollama ps after the test?

<!-- gh-comment-id:2442362284 --> @rick-github commented on GitHub (Oct 28, 2024): Thanks for the great test script. Testing locally: ```console $ python3 7400.py Sentence tranformer took 0.6625242233276367s Ollama took 3.005030870437622s ``` About 4.5x slower, nothing like the 17x you see. If I push the model onto CPU, it takes about twice as long, but best case ollama is still slower. What's the output of `ollama ps` after the test?
Author
Owner

@sebovzeoueb commented on GitHub (Oct 28, 2024):

My hardware is probably significantly worse than yours, I'm using a GTX 1070 and i7 6700.
This is what my ollama ps looks like:

NAME                 ID              SIZE      PROCESSOR    UNTIL
all-minilm:latest    1b226e2802db    528 MB    100% GPU     Forever
<!-- gh-comment-id:2442380065 --> @sebovzeoueb commented on GitHub (Oct 28, 2024): My hardware is probably significantly worse than yours, I'm using a GTX 1070 and i7 6700. This is what my `ollama ps` looks like: ``` NAME ID SIZE PROCESSOR UNTIL all-minilm:latest 1b226e2802db 528 MB 100% GPU Forever ```
Author
Owner

@sebovzeoueb commented on GitHub (Oct 28, 2024):

btw my host system is Windows, so Sentence Transformer is running natively in Windows with GPU acceleration, whereas Ollama is running in Docker, also with GPU acceleration.

<!-- gh-comment-id:2442398978 --> @sebovzeoueb commented on GitHub (Oct 28, 2024): btw my host system is Windows, so Sentence Transformer is running natively in Windows with GPU acceleration, whereas Ollama is running in Docker, also with GPU acceleration.
Author
Owner

@rick-github commented on GitHub (Oct 28, 2024):

I had a quick look at the internal traffic and ollama turns each of the items in the batch list into a separate API call to the ollama runner, so there's a fan-out effect which I'm guessing is not a factor for SentenceTransformer. This may improve with the move to go-based runners in the future, but I'm afraid that at the moment, ollama is just going to be a lot slower than SentenceTransformer.

<!-- gh-comment-id:2442423113 --> @rick-github commented on GitHub (Oct 28, 2024): I had a quick look at the internal traffic and ollama turns each of the items in the batch list into a separate API call to the ollama runner, so there's a fan-out effect which I'm guessing is not a factor for SentenceTransformer. This may improve with the move to go-based runners in the future, but I'm afraid that at the moment, ollama is just going to be a lot slower than SentenceTransformer.
Author
Owner

@sebovzeoueb commented on GitHub (Oct 28, 2024):

Oh, that's a shame, I hope that future improvement happens, because it's currently not great.

<!-- gh-comment-id:2442451655 --> @sebovzeoueb commented on GitHub (Oct 28, 2024): Oh, that's a shame, I hope that future improvement happens, because it's currently not great.
Author
Owner

@dhiltgen commented on GitHub (Oct 28, 2024):

Due to current limitations in the underlying c++ code embedding models have to process serially, so while our embedding API does processes batches, that gets processed in a single thread by the underlying hardware. As we transition over to the Go server we'll be looking at this and other parts of the system to improve parallelism for better performance.

<!-- gh-comment-id:2442891910 --> @dhiltgen commented on GitHub (Oct 28, 2024): Due to current limitations in the underlying c++ code embedding models have to process serially, so while our embedding API does processes batches, that gets processed in a single thread by the underlying hardware. As we transition over to the Go server we'll be looking at this and other parts of the system to improve parallelism for better performance.
Author
Owner

@liuy commented on GitHub (Oct 30, 2024):

Due to current limitations in the underlying c++ code embedding models have to process serially, so while our embedding API does processes batches, that gets processed in a single thread by the underlying hardware. As we transition over to the Go server we'll be looking at this and other parts of the system to improve parallelism for better performance.

I checked the code, the performace problem is nothing with embedding processing, rather how embed endpoint is implemented. For the internal truncating logic, it loads kvData and do tokenzing and detokenizing for each string in the promt array, which is quite inefficient. I don't think it is necessary but underlying go runner does trancating the prompt if it exceeds context window.

The fix could be simple:

  1. just removing trancating logic in embed handler and log warning about trancating if promt size > ctx window
    Or a little bit more complex:
  2. add a truncate flag in request for embedding handler, when psize > ctx window, return error rather than trancating prompt.

I'm going to cook a patch in 2). How do you think of it? @dhiltgen @rick-github

<!-- gh-comment-id:2446157150 --> @liuy commented on GitHub (Oct 30, 2024): > Due to current limitations in the underlying c++ code embedding models have to process serially, so while our embedding API does processes batches, that gets processed in a single thread by the underlying hardware. As we transition over to the Go server we'll be looking at this and other parts of the system to improve parallelism for better performance. I checked the code, the performace problem is nothing with embedding processing, rather how embed endpoint is implemented. For the internal truncating logic, it loads kvData and do tokenzing and detokenizing for each string in the promt array, which is quite inefficient. I don't think it is necessary but underlying go runner does trancating the prompt if it exceeds context window. The fix could be simple: 1) just removing trancating logic in embed handler and log warning about trancating if promt size > ctx window Or a little bit more complex: 2) add a truncate flag in request for embedding handler, when psize > ctx window, return error rather than trancating prompt. I'm going to cook a patch in 2). How do you think of it? @dhiltgen @rick-github
Author
Owner

@liuy commented on GitHub (Oct 30, 2024):

Oh, that's a shame, I hope that future improvement happens, because it's currently not great.

it is indeed VERY SLOW when do RAG stuf. The more docs in RAG, the slower. quit anoying but easy to fix. I'll go and cook a patch to fix it :)

<!-- gh-comment-id:2446161265 --> @liuy commented on GitHub (Oct 30, 2024): > Oh, that's a shame, I hope that future improvement happens, because it's currently not great. it is indeed VERY SLOW when do RAG stuf. The more docs in RAG, the slower. quit anoying but easy to fix. I'll go and cook a patch to fix it :)
Author
Owner

@liuy commented on GitHub (Oct 30, 2024):

Hi there, I've made a patch for it #7424 @dhiltgen @jessegross

just get token numbers in the runner instead of route.

Even on following simplest request, I got nearly 20x boost.

curl http://localhost:11434/api/embed -d '{
"model": "all-minilm",
"input": ["Why is the sky blue?", "Why is the grass green?"]
}'

new approach: "total_duration":14239148
old approach: "total_duration":240871657

<!-- gh-comment-id:2447773763 --> @liuy commented on GitHub (Oct 30, 2024): Hi there, I've made a patch for it #7424 @dhiltgen @jessegross just get token numbers in the runner instead of route. Even on following simplest request, I got nearly 20x boost. curl http://localhost:11434/api/embed -d '{ "model": "all-minilm", "input": ["Why is the sky blue?", "Why is the grass green?"] }' new approach: "total_duration":14239148 old approach: "total_duration":240871657
Author
Owner

@davidshen84 commented on GitHub (Jul 22, 2025):

Hi, I think I am experiencing the same issue. I have logs of small json documents that I want to create a embedding for. Despite the nvidia-smi says the ollama process is using GPU and ollama ps says the nomic-embed-text models is running 100% on GPU, the whole process is very slow, much slower than using the SentenceTransformer package.

On my Windows system, nvidia-smi says the GPU clock is maxed out, but the GPU utilisation is close to 1%, and the VRam utilisation is about 300MB which is about the size of the embedding model.

I found this log message on my ollama server:

[GIN] 2025/07/17 - 14:38:53 | 200 |    7.6638973s |       127.0.0.1 | POST     "/api/embed"
time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=tfs_z
time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=mirostat
time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_eta
time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_tau
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
decode: cannot decode batches with this context (use llama_encode() instead)
...repeat the same message endlessly...
> ollama --version
ollama version is 0.9.6

Is it confirmed that it is an issue on the ollama end?

Thanks

<!-- gh-comment-id:3101394710 --> @davidshen84 commented on GitHub (Jul 22, 2025): Hi, I think I am experiencing the same issue. I have logs of small json documents that I want to create a embedding for. Despite the `nvidia-smi` says the ollama process is using GPU and `ollama ps` says the `nomic-embed-text` models is running 100% on GPU, the whole process is very slow, much slower than using the SentenceTransformer package. On my Windows system, `nvidia-smi` says the GPU clock is maxed out, but the GPU utilisation is close to 1%, and the VRam utilisation is about 300MB which is about the size of the embedding model. I found this log message on my ollama server: ```text [GIN] 2025/07/17 - 14:38:53 | 200 | 7.6638973s | 127.0.0.1 | POST "/api/embed" time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=tfs_z time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=mirostat time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_eta time=2025-07-17T14:38:54.026+10:00 level=WARN source=types.go:573 msg="invalid option provided" option=mirostat_tau decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) decode: cannot decode batches with this context (use llama_encode() instead) ...repeat the same message endlessly... ``` ``` > ollama --version ollama version is 0.9.6 ``` Is it confirmed that it is an issue on the ollama end? Thanks
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4706