[GH-ISSUE #8713] Race condition in api/embed when generating embeddings with multiple inputs #31409

Closed
opened 2026-04-22 11:50:19 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @dpereira on GitHub (Jan 30, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8713

What is the issue?

There seems to be a race condition between the processBatch/decode execution and the Server.embeddings block that generates the json response from the embedding buffer received as a result.

Calling the api/embed endpoint repeatedly should trigger the issue for one of the executions. The race condition happens between the inputs sent for a given request, as far as I've been able to verify, but I think it could happen between concurrent requests to the api/embed endpoint as well.

When the issue manifests itself, part of of the json array in the embedding response for one of the inputs is overwritten by zeroes, which seems to be caused by a concurrent call to memset within llama cpp code when processing the other input.

Example payload to reproduce:

{"input": ["dove", "pigeon"], "model": "llama3.2:3b"}

I think the issue happens with any model, and the probability of the race condition seems to vary according to model size (and processing time?). Reproduced with these models:

  • llama3.1:8b
  • llama3.2:3b
  • deepseek-r1:14b

What did you expect to happen?

The json arrays returned from the api/embed call should not be partially overwritten with zero values.

Additional details:

I'm using a CPU backend, and have not tested this in another hardware configuration in detail. The issue also happens a in Mac M1, but I'm not sure if the code paths involved are the same as with the current box I'm using right now (intel core ultra 9 185H / Linux) -- will be able to check about the mac only later and update.

The issue seems to happen only when input 1 and input 2 are processed in separate batches. This happens when processBatch processes input 1, returns, and is called again by the run method, and processes input 2 separately. When both inputs are processed in the same batch, in a single Decode call, the issue does not manifest.

The flow that causes the issue starts in EmbedHandler triggering two calls to the embeddings endpoint in parallel (by triggering two goroutines). After that, the concurrent flows are roughly the ones below (RC marks the step that happen in parallel/concurrently when the issue manifests):

Input 1: embeddings (add seq and wait) -> processBatch -> Decode -> ... -> RC: embeddings (resume and encode json response)
Input 2: embeddings (add seq and wait) -> processBatch -> Decode -> llama_decode -> llama_decode_internal -> llama_output_reserve -> ggml_backend_buffer_clear -> ggml_backend_cpu_buffer_clear -> RC: memset

The last steps for each input happen in parallel and act on the same buffer (buffer being used to encode json response for input 1 flow is the same as memset is zeroing out for input 2), causing memset to clear the buffer for processing Input 2 that is being used to return the embedding result for Input 1.

OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.5.7

Originally created by @dpereira on GitHub (Jan 30, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8713 ### What is the issue? There seems to be a race condition between the `processBatch/decode` execution and the `Server.embeddings` [block](https://github.com/ollama/ollama/blob/main/llama/runner/runner.go#L784) that generates the json response from the embedding buffer received as a result. Calling the `api/embed` endpoint repeatedly should trigger the issue for one of the executions. The race condition happens between the inputs sent for a given request, as far as I've been able to verify, but I think it could happen between concurrent requests to the api/embed endpoint as well. When the issue manifests itself, part of of the json array in the embedding response for one of the inputs is overwritten by zeroes, which seems to be caused by a concurrent call to `memset` within llama cpp code when processing the other input. Example payload to reproduce: ``` {"input": ["dove", "pigeon"], "model": "llama3.2:3b"} ``` I think the issue happens with any model, and the probability of the race condition seems to vary according to model size (and processing time?). Reproduced with these models: - llama3.1:8b - llama3.2:3b - deepseek-r1:14b ### What did you expect to happen? The json arrays returned from the api/embed call should not be partially overwritten with zero values. ### Additional details: I'm using a CPU backend, and have not tested this in another hardware configuration in detail. The issue also happens a in Mac M1, but I'm not sure if the code paths involved are the same as with the current box I'm using right now (intel core ultra 9 185H / Linux) -- will be able to check about the mac only later and update. The issue seems to happen only when input 1 and input 2 are processed in separate batches. This happens when `processBatch` processes input 1, returns, and is called again by the `run` method, and processes input 2 separately. When both inputs are processed in the same batch, in a single `Decode` call, the issue does not manifest. The flow that causes the issue starts in `EmbedHandler` triggering two calls to the `embeddings` endpoint in parallel (by triggering two goroutines). After that, the concurrent flows are roughly the ones below (RC marks the step that happen in parallel/concurrently when the issue manifests): Input 1: `embeddings (add seq and wait) -> processBatch -> Decode -> ... -> RC: embeddings (resume and encode json response)` Input 2: `embeddings (add seq and wait) -> processBatch -> Decode -> llama_decode -> llama_decode_internal -> llama_output_reserve -> ggml_backend_buffer_clear -> ggml_backend_cpu_buffer_clear -> RC: memset` The last steps for each input happen in parallel and act on the same buffer (buffer being used to encode json response for input 1 flow is the same as memset is zeroing out for input 2), causing memset to clear the buffer for processing Input 2 that is being used to return the embedding result for Input 1. ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-22 11:50:19 -05:00
Author
Owner

@dpereira commented on GitHub (Jan 30, 2025):

These files illustrate the response when:

  • The issue does not manifest (ok.json)
  • The issue manifests (rc.json)

ok.json
rc.json

<!-- gh-comment-id:2625782414 --> @dpereira commented on GitHub (Jan 30, 2025): These files illustrate the response when: - The issue does not manifest (ok.json) - The issue manifests (rc.json) [ok.json](https://github.com/user-attachments/files/18609612/ok.json) [rc.json](https://github.com/user-attachments/files/18609611/rc.json)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31409