[GH-ISSUE #14314] Embeddings getting slower and slower #9315

Closed
opened 2026-04-12 22:10:45 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ivoras on GitHub (Feb 18, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14314

What is the issue?

Calling the /apu/embed API results in each new requests being slower than the last one.
See the attached log - this is CPU-only.

Relevant log output

Feb 19 00:36:18 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:18 | 200 |  1.590016255s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:36:18 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:36:19 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:19 | 200 |  2.084325676s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:36:19 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:36:21 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:21 | 200 |  2.902344855s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:36:21 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:36:21 xx ollama[2619070]: output_reserve: reallocating output buffer from size 10.52 MiB to 200.89 MiB
Feb 19 00:36:31 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:31 | 200 | 13.060263867s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:36:31 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:36:31 xx ollama[2619070]: output_reserve: reallocating output buffer from size 200.89 MiB to 269.76 MiB
Feb 19 00:36:46 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:46 | 200 |  27.71817867s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:36:46 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:36:55 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:55 | 200 | 36.775755535s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:36:55 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:03 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:03 | 200 | 44.281857556s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:03 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:07 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:07 | 200 | 49.177545982s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:07 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:12 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:12 | 200 | 53.164910262s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:12 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:19 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:19 | 200 |  59.96701725s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:19 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:28 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:28 | 200 |          1m8s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:28 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:33 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:33 | 200 |         1m14s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:33 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:44 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:44 | 200 |         1m24s |      172.17.0.2 | POST     "/api/embed"
Feb 19 00:37:44 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding
Feb 19 00:37:48 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:48 | 200 |         1m28s |      172.17.0.2 | POST     "/api/embed"

OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.16.2

Originally created by @ivoras on GitHub (Feb 18, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14314 ### What is the issue? Calling the /apu/embed API results in each new requests being slower than the last one. See the attached log - this is CPU-only. ### Relevant log output ```shell Feb 19 00:36:18 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:18 | 200 | 1.590016255s | 172.17.0.2 | POST "/api/embed" Feb 19 00:36:18 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:36:19 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:19 | 200 | 2.084325676s | 172.17.0.2 | POST "/api/embed" Feb 19 00:36:19 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:36:21 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:21 | 200 | 2.902344855s | 172.17.0.2 | POST "/api/embed" Feb 19 00:36:21 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:36:21 xx ollama[2619070]: output_reserve: reallocating output buffer from size 10.52 MiB to 200.89 MiB Feb 19 00:36:31 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:31 | 200 | 13.060263867s | 172.17.0.2 | POST "/api/embed" Feb 19 00:36:31 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:36:31 xx ollama[2619070]: output_reserve: reallocating output buffer from size 200.89 MiB to 269.76 MiB Feb 19 00:36:46 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:46 | 200 | 27.71817867s | 172.17.0.2 | POST "/api/embed" Feb 19 00:36:46 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:36:55 xx ollama[2619070]: [GIN] 2026/02/19 - 00:36:55 | 200 | 36.775755535s | 172.17.0.2 | POST "/api/embed" Feb 19 00:36:55 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:03 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:03 | 200 | 44.281857556s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:03 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:07 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:07 | 200 | 49.177545982s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:07 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:12 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:12 | 200 | 53.164910262s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:12 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:19 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:19 | 200 | 59.96701725s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:19 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:28 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:28 | 200 | 1m8s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:28 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:33 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:33 | 200 | 1m14s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:33 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:44 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:44 | 200 | 1m24s | 172.17.0.2 | POST "/api/embed" Feb 19 00:37:44 xx ollama[2619070]: init: embeddings required but some input tokens were not marked as outputs -> overriding Feb 19 00:37:48 xx ollama[2619070]: [GIN] 2026/02/19 - 00:37:48 | 200 | 1m28s | 172.17.0.2 | POST "/api/embed" ``` ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.16.2
GiteaMirror added the needs more infobug labels 2026-04-12 22:10:45 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 19, 2026):

Model? Example inputs?

<!-- gh-comment-id:3926381503 --> @rick-github commented on GitHub (Feb 19, 2026): Model? Example inputs?
Author
Owner

@ivoras commented on GitHub (Feb 19, 2026):

Seen on two models:

  • granite-embedding:278m
  • nomic-embed-text-v2-moe:latest
<!-- gh-comment-id:3929469501 --> @ivoras commented on GitHub (Feb 19, 2026): Seen on two models: - granite-embedding:278m - nomic-embed-text-v2-moe:latest
Author
Owner

@rick-github commented on GitHub (Feb 19, 2026):

What client are you using?

$ for i in {1..500} ; do /usr/bin/time -f %e ollama run granite-embedding:278m $RANDOM hello 2>&1 >/dev/null ; done | datamash min 1 max 1 mean 1 sstdev 1
0.49    0.55    0.5227  0.0090262365297938
<!-- gh-comment-id:3929796278 --> @rick-github commented on GitHub (Feb 19, 2026): What client are you using? ```console $ for i in {1..500} ; do /usr/bin/time -f %e ollama run granite-embedding:278m $RANDOM hello 2>&1 >/dev/null ; done | datamash min 1 max 1 mean 1 sstdev 1 0.49 0.55 0.5227 0.0090262365297938 ```
Author
Owner

@rick-github commented on GitHub (Feb 19, 2026):

@darshjme-codes Please do not spam these issues with AI generated slop. As my testing shows, this is not as simple as catastrophic memory reallocation cascading.

<!-- gh-comment-id:3930371794 --> @rick-github commented on GitHub (Feb 19, 2026): @darshjme-codes Please do not spam these issues with AI generated slop. As my testing shows, this is not as simple as `catastrophic memory reallocation cascading`.
Author
Owner

@NAPTiON commented on GitHub (Feb 23, 2026):

If you're using embeddings for agent memory/RAG and hitting performance degradation at scale, one alternative worth considering: skip embeddings entirely and use the LLM as the retrieval layer.

I run a memory pipeline for an AI agent that stores everything in plain markdown files, categorized by a local Llama 3.2 1B model. On each session start, the agent reads a keyword index (~500 lines) and loads only the relevant memory files into context.

No embeddings. No vector DB. No performance degradation over time. The tradeoff is it doesn't scale past ~5000 memories (context window limit), but for most agent use cases that's plenty.

Pipeline latency is ~200ms per cycle on M-series Mac. The categorization is the expensive part, but it runs at write time (every 10 min), not at query time.

Scripts: NAPTiON/ai-memory-pipeline

<!-- gh-comment-id:3944587287 --> @NAPTiON commented on GitHub (Feb 23, 2026): If you're using embeddings for agent memory/RAG and hitting performance degradation at scale, one alternative worth considering: skip embeddings entirely and use the LLM as the retrieval layer. I run a memory pipeline for an AI agent that stores everything in plain markdown files, categorized by a local Llama 3.2 1B model. On each session start, the agent reads a keyword index (~500 lines) and loads only the relevant memory files into context. No embeddings. No vector DB. No performance degradation over time. The tradeoff is it doesn't scale past ~5000 memories (context window limit), but for most agent use cases that's plenty. Pipeline latency is ~200ms per cycle on M-series Mac. The categorization is the expensive part, but it runs at write time (every 10 min), not at query time. Scripts: [NAPTiON/ai-memory-pipeline](https://github.com/NAPTiON/ai-memory-pipeline)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9315