[PR #5657] [CLOSED] Parallelize Tokenization in api/embed #58569

Closed
opened 2026-04-29 13:27:28 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/5657
Author: @royjhan
Created: 7/12/2024
Status: Closed

Base: mainHead: royh-embed-parallel


📝 Commits (10+)

📊 Changes

8 files changed (+486 additions, -32 deletions)

View changed files

📝 api/client.go (+10 -1)
📝 api/types.go (+24 -0)
integration/embed_test.go (+152 -0)
📝 llm/ext_server/server.cpp (+23 -16)
📝 llm/server.go (+8 -8)
📝 server/routes.go (+162 -3)
📝 server/routes_test.go (+103 -0)
📝 server/sched_test.go (+4 -4)

📄 Description

Example: batch embedding of 250 2500 token inputs with nomic-embed-text
Numbers: tokenizing + detokenizing one input takes 1.3ms. Comparatively, the batch embedding of 250 inputs takes 21.76s out of 22.64s total. Clear bottle neck on the embedding rather than tokenizing

TLDR parallelizing tokenization has no benefit for currently relevant workloads


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/5657 **Author:** [@royjhan](https://github.com/royjhan) **Created:** 7/12/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `royh-embed-parallel` --- ### 📝 Commits (10+) - [`c22d548`](https://github.com/ollama/ollama/commit/c22d54895a280b54c727279d85a5fc94defb5a29) Initial Batch Embedding - [`0f87628`](https://github.com/ollama/ollama/commit/0f87628b6dcbd29badc98868ea8a07e5c3ba80b1) Revert "Initial Batch Embedding" - [`ff191d7`](https://github.com/ollama/ollama/commit/ff191d7cba60caebe24201b1b7890a612efc956c) Initial Draft - [`22458c5`](https://github.com/ollama/ollama/commit/22458c573acd7c34d67b7b7b367bb98f8f9c2748) mock up notes - [`c406fa7`](https://github.com/ollama/ollama/commit/c406fa7a4c1f989a4be1a2ed995db6994ebb5816) api/embed draft - [`49e3411`](https://github.com/ollama/ollama/commit/49e341147db0fefb1a11f03b81af6f0e85eb7d45) add server function - [`b9c74df`](https://github.com/ollama/ollama/commit/b9c74df37b0ab3f63f2af37d8f443c1ff822dfb0) check normalization - [`5213c12`](https://github.com/ollama/ollama/commit/5213c1235436d99a59b144f87e43cbfc83d94d31) clean up - [`c111d8b`](https://github.com/ollama/ollama/commit/c111d8bb514f4615d69368278af4ef24551b4f65) normalization - [`80c1a3f`](https://github.com/ollama/ollama/commit/80c1a3f812cacff19e8857cc77dfd2fd4502b1e7) playing around with truncate stuff ### 📊 Changes **8 files changed** (+486 additions, -32 deletions) <details> <summary>View changed files</summary> 📝 `api/client.go` (+10 -1) 📝 `api/types.go` (+24 -0) ➕ `integration/embed_test.go` (+152 -0) 📝 `llm/ext_server/server.cpp` (+23 -16) 📝 `llm/server.go` (+8 -8) 📝 `server/routes.go` (+162 -3) 📝 `server/routes_test.go` (+103 -0) 📝 `server/sched_test.go` (+4 -4) </details> ### 📄 Description Example: batch embedding of 250 2500 token inputs with nomic-embed-text Numbers: tokenizing + detokenizing one input takes 1.3ms. Comparatively, the batch embedding of 250 inputs takes 21.76s out of 22.64s total. Clear bottle neck on the embedding rather than tokenizing TLDR parallelizing tokenization has no benefit for currently relevant workloads --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 13:27:28 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#58569