[GH-ISSUE #12591] Best practices for concurrent embeddings in multi-node deployments #8355

Closed
opened 2026-04-12 20:57:21 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @B-A-M-N on GitHub (Oct 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12591

Description

I've developed SOLLOL, an orchestration and observability layer for distributed Ollama/llama.cpp deployments. After achieving 19-21 embeddings/sec throughput across multi-node clusters, I have questions about optimizing connection patterns and understanding Ollama's internal behavior for distributed workloads.

Context

Project: Production-grade connection pooling and load balancing for multi-node Ollama setups
Use case: Document embedding pipelines (FlockParser) and multi-agent systems (SynapticLlamas)
Current setup: 2-3 mixed CPU/GPU nodes on local network

Current Implementation

# HTTP/2 with connection reuse
session = httpx.Client(
    transport=httpx.HTTPTransport(retries=3, http2=True),
    timeout=httpx.Timeout(300.0, connect=10.0),
    limits=httpx.Limits(
        max_keepalive_connections=40,
        max_connections=100,
        keepalive_expiry=30.0
    )
)

# Adaptive batching
- Small batches (<100 texts): ThreadPoolExecutor with 2 workers per node
- Large batches (>100 texts): Dask distributed processing

Performance Results

Test setup: 2 Ollama nodes, mxbai-embed-large model

Batch Size Strategy Throughput Notes
25 texts ThreadPoolExecutor ~19 emb/sec baseline
50 texts ThreadPoolExecutor ~21 emb/sec baseline
100 texts ThreadPoolExecutor ~21 emb/sec baseline
200 texts Dask distributed ~21 emb/sec 1.46x faster than ThreadPool
300 texts Dask distributed ~21 emb/sec scales linearly

Key optimizations:

  • HTTP/2 multiplexing: ~30% latency reduction on concurrent requests
  • Connection reuse: 10x speedup vs naive implementation
  • Worker-local pool caching: eliminates Dask serialization overhead

Questions

  1. Connection pooling: Does ollama serve benefit from HTTP/2 multiplexing, or is HTTP/1.1 with keep-alive equally effective? Are there any connection-level optimizations we should be aware of?

  2. Concurrency limits: What's the recommended maximum concurrent requests per Ollama instance? Are there internal queues or throttling mechanisms we should tune for?

  3. Request batching: Does Ollama perform any internal batching of embedding requests? Understanding this would help optimize our client-side batching strategy.

  4. Connection lifecycle: Would maintaining persistent/long-lived connections provide benefits beyond keep-alive headers? Do connections maintain any state between requests?

  5. Async API plans: Are there plans for native async/streaming embedding APIs? This would allow more efficient non-blocking I/O patterns.

Why This Matters

Many teams are running multi-node Ollama clusters (home labs, small businesses, research environments) but lack tooling for unified orchestration. SOLLOL aims to make distributed inference as simple as single-node deployments through:

  • Zero-config node discovery
  • Intelligent load balancing with VRAM awareness
  • Real-time observability and metrics
  • Adaptive routing strategies

Understanding Ollama's connection behavior and internal architecture would help optimize distributed client implementations.


Any insights from the Ollama team on optimizing distributed deployments would be greatly appreciated. Happy to provide more details or testing data if helpful.

Originally created by @B-A-M-N on GitHub (Oct 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12591 ### Description I've developed **SOLLOL**, an orchestration and observability layer for distributed Ollama/llama.cpp deployments. After achieving 19-21 embeddings/sec throughput across multi-node clusters, I have questions about optimizing connection patterns and understanding Ollama's internal behavior for distributed workloads. ### Context **Project:** Production-grade connection pooling and load balancing for multi-node Ollama setups **Use case:** Document embedding pipelines (FlockParser) and multi-agent systems (SynapticLlamas) **Current setup:** 2-3 mixed CPU/GPU nodes on local network ### Current Implementation ```python # HTTP/2 with connection reuse session = httpx.Client( transport=httpx.HTTPTransport(retries=3, http2=True), timeout=httpx.Timeout(300.0, connect=10.0), limits=httpx.Limits( max_keepalive_connections=40, max_connections=100, keepalive_expiry=30.0 ) ) # Adaptive batching - Small batches (<100 texts): ThreadPoolExecutor with 2 workers per node - Large batches (>100 texts): Dask distributed processing ``` ### Performance Results **Test setup:** 2 Ollama nodes, `mxbai-embed-large` model | Batch Size | Strategy | Throughput | Notes | |------------|----------|------------|-------| | 25 texts | ThreadPoolExecutor | ~19 emb/sec | baseline | | 50 texts | ThreadPoolExecutor | ~21 emb/sec | baseline | | 100 texts | ThreadPoolExecutor | ~21 emb/sec | baseline | | 200 texts | Dask distributed | ~21 emb/sec | 1.46x faster than ThreadPool | | 300 texts | Dask distributed | ~21 emb/sec | scales linearly | **Key optimizations:** - HTTP/2 multiplexing: ~30% latency reduction on concurrent requests - Connection reuse: 10x speedup vs naive implementation - Worker-local pool caching: eliminates Dask serialization overhead ### Questions 1. **Connection pooling:** Does `ollama serve` benefit from HTTP/2 multiplexing, or is HTTP/1.1 with keep-alive equally effective? Are there any connection-level optimizations we should be aware of? 2. **Concurrency limits:** What's the recommended maximum concurrent requests per Ollama instance? Are there internal queues or throttling mechanisms we should tune for? 3. **Request batching:** Does Ollama perform any internal batching of embedding requests? Understanding this would help optimize our client-side batching strategy. 4. **Connection lifecycle:** Would maintaining persistent/long-lived connections provide benefits beyond keep-alive headers? Do connections maintain any state between requests? 5. **Async API plans:** Are there plans for native async/streaming embedding APIs? This would allow more efficient non-blocking I/O patterns. ### Why This Matters Many teams are running multi-node Ollama clusters (home labs, small businesses, research environments) but lack tooling for unified orchestration. SOLLOL aims to make distributed inference as simple as single-node deployments through: - Zero-config node discovery - Intelligent load balancing with VRAM awareness - Real-time observability and metrics - Adaptive routing strategies Understanding Ollama's connection behavior and internal architecture would help optimize distributed client implementations. ### Links - **SOLLOL:** https://github.com/BenevolentJoker-JohnL/SOLLOL - **FlockParser** (document processing): https://github.com/BenevolentJoker-JohnL/FlockParser - **SynapticLlamas** (multi-agent): https://github.com/BenevolentJoker-JohnL/SynapticLlamas --- Any insights from the Ollama team on optimizing distributed deployments would be greatly appreciated. Happy to provide more details or testing data if helpful.
Author
Owner

@B-A-M-N commented on GitHub (Oct 13, 2025):

Dashboard Screenshots

Here's the SOLLOL unified dashboard in action:

Node Health Monitoring:

  • 2 active Ollama nodes (10.9.66.48, 10.9.66.154)
  • 11-12ms latency, 100% success rate
  • Real-time status tracking

Active Applications:

  • FlockParser (document processing pipeline)
  • SynapticLlamas (multi-agent reasoning)
  • Both connected via OllamaPool router

Real-time Routing Decisions:

  • Activity logs showing mxbai-embed-large requests/responses
  • Latency tracking per request (163ms-2.3sec)
  • llama.cpp activity stream connected
  • Integrated Ray + Dask dashboards

Dask Distributed Processing:

  • Task distribution across cluster workers
  • Bytes stored: 299 MiB
  • Processing + CPU + Data Transfer phases visible
  • Task stream showing parallel execution

The adaptive routing automatically handles:

  • Small batches (<100 texts) → Local ThreadPoolExecutor (lower overhead)
  • Large batches (>100 texts) → Dask distributed (better parallelism)

Key insight: Connection pooling + HTTP/2 is where most of the speedup comes from. Dask adds another 1.4-2× on top for large batches.

<!-- gh-comment-id:3395619373 --> @B-A-M-N commented on GitHub (Oct 13, 2025): ### Dashboard Screenshots Here's the SOLLOL unified dashboard in action: **Node Health Monitoring:** - 2 active Ollama nodes (10.9.66.48, 10.9.66.154) - 11-12ms latency, 100% success rate - Real-time status tracking **Active Applications:** - FlockParser (document processing pipeline) - SynapticLlamas (multi-agent reasoning) - Both connected via OllamaPool router **Real-time Routing Decisions:** - Activity logs showing mxbai-embed-large requests/responses - Latency tracking per request (163ms-2.3sec) - llama.cpp activity stream connected - Integrated Ray + Dask dashboards **Dask Distributed Processing:** - Task distribution across cluster workers - Bytes stored: 299 MiB - Processing + CPU + Data Transfer phases visible - Task stream showing parallel execution The adaptive routing automatically handles: - Small batches (<100 texts) → Local ThreadPoolExecutor (lower overhead) - Large batches (>100 texts) → Dask distributed (better parallelism) **Key insight:** Connection pooling + HTTP/2 is where most of the speedup comes from. Dask adds another 1.4-2× on top for large batches.
Author
Owner

@B-A-M-N commented on GitHub (Oct 13, 2025):

Dashboard Screenshots

1. Node Health Monitoring

Node Health Monitoring

  • 2 active Ollama nodes (10.9.66.48, 10.9.66.154)
  • 11-12ms latency, 100% success rate
  • Real-time status tracking

2. Active Applications

Active Applications

  • FlockParser (document processing pipeline)
  • SynapticLlamas (multi-agent reasoning)
  • Both connected via OllamaPool router

3. Real-time Routing Decisions

Routing Decisions

  • Activity logs showing mxbai-embed-large requests/responses
  • Latency tracking per request (163ms-2.3sec)
  • llama.cpp activity stream connected
  • Integrated Ray + Dask dashboards

4. Dask Distributed Processing

Dask Processing

  • Task distribution across cluster workers
  • Bytes stored: 299 MiB
  • Processing + CPU + Data Transfer phases visible
  • Task stream showing parallel execution

The adaptive routing automatically handles:

  • Small batches (<100 texts) → Local ThreadPoolExecutor (lower overhead)
  • Large batches (>100 texts) → Dask distributed (better parallelism)

Key insight: Connection pooling + HTTP/2 is where most of the speedup comes from. Dask adds another 1.4-2× on top for large batches.

<!-- gh-comment-id:3395621940 --> @B-A-M-N commented on GitHub (Oct 13, 2025): ### Dashboard Screenshots **1. Node Health Monitoring** ![Node Health Monitoring](https://i.imgur.com/O0cAyLj.png) - 2 active Ollama nodes (10.9.66.48, 10.9.66.154) - 11-12ms latency, 100% success rate - Real-time status tracking **2. Active Applications** ![Active Applications](https://i.imgur.com/CxbCbgB.png) - FlockParser (document processing pipeline) - SynapticLlamas (multi-agent reasoning) - Both connected via OllamaPool router **3. Real-time Routing Decisions** ![Routing Decisions](https://i.imgur.com/BsyKmZ5.png) - Activity logs showing mxbai-embed-large requests/responses - Latency tracking per request (163ms-2.3sec) - llama.cpp activity stream connected - Integrated Ray + Dask dashboards **4. Dask Distributed Processing** ![Dask Processing](https://i.imgur.com/G9fgzw0.png) - Task distribution across cluster workers - Bytes stored: 299 MiB - Processing + CPU + Data Transfer phases visible - Task stream showing parallel execution --- The adaptive routing automatically handles: - Small batches (<100 texts) → Local ThreadPoolExecutor (lower overhead) - Large batches (>100 texts) → Dask distributed (better parallelism) **Key insight:** Connection pooling + HTTP/2 is where most of the speedup comes from. Dask adds another 1.4-2× on top for large batches.
Author
Owner

@rick-github commented on GitHub (Oct 13, 2025):

ollama does not currently support parallel embeddings.

See here for documentation on configuring parallel completions.

<!-- gh-comment-id:3395624585 --> @rick-github commented on GitHub (Oct 13, 2025): ollama [does not](https://github.com/ollama/ollama/blob/6544e1473525c381e89aba4778283900b3ad7145/server/sched.go#L399) currently support parallel embeddings. See [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) for documentation on configuring parallel completions.
Author
Owner

@B-A-M-N commented on GitHub (Oct 13, 2025):

Thanks for the clarification! That's actually what motivated building SOLLOL - the lack of native parallel embedding support and no way to leverage all the compute resources on a network.

SOLLOL is also the first hybrid routing system that can coordinate across both Ollama and llama.cpp RPC backends simultaneously, enabling model sharding across heterogeneous infrastructure.

A few follow-up questions about the parallel completions configuration:

  1. Does the parallel completions setting apply to embeddings? The docs mention completions, but I want to confirm if OLLAMA_NUM_PARALLEL or similar settings affect embedding throughput.

  2. Multi-node coordination: Ollama doesn't provide any native way to discover and utilize multiple Ollama instances across a network. Is this intentional (expecting users to handle orchestration), or are there plans for built-in clustering/coordination?

  3. Hybrid backend interop: For use cases requiring model sharding (large models split across nodes), is there any coordination between Ollama and llama.cpp RPC backends? Or is this expected to be handled at the orchestration layer?

  4. Connection handling: Since Ollama doesn't support parallel embeddings natively, is it safe to send concurrent embedding requests from multiple clients (as we're doing)? Or should we implement client-side queuing?

  5. Performance characteristics: With our current approach (concurrent requests via connection pooling), we're seeing:

    • 10× speedup from connection reuse alone
    • 30% additional improvement from HTTP/2 multiplexing
    • Linear scaling across nodes

    Are there any known bottlenecks or gotchas when doing high-concurrency embedding requests against a single Ollama instance?

The goal of SOLLOL is to provide the parallelization and orchestration layer that Ollama doesn't have built-in - essentially making distributed multi-node setups work transparently. If there's a better way to achieve this that aligns with Ollama's architecture, I'd love to hear suggestions!

Current architecture:

Client App → SOLLOL Hybrid Router → Ollama Nodes (embeddings/generation)
                                  → llama.cpp RPC Backends (model sharding)

This gives us:

  • Hybrid backend routing - First system to coordinate Ollama + llama.cpp RPC
  • Model sharding support - Split large models across RPC backends
  • Automatic node discovery across the network
  • Automatic failover
  • Intelligent routing based on node health/VRAM/backend type
  • Distributed batch processing via Dask
  • Unified observability for heterogeneous infrastructure

Essentially treating N Ollama instances + M llama.cpp RPC backends as a single logical service - making it possible to actually leverage all the GPUs/CPUs on your network, regardless of backend type.

<!-- gh-comment-id:3395641037 --> @B-A-M-N commented on GitHub (Oct 13, 2025): Thanks for the clarification! That's actually what motivated building SOLLOL - the lack of native parallel embedding support **and no way to leverage all the compute resources on a network**. SOLLOL is also **the first hybrid routing system** that can coordinate across both Ollama and llama.cpp RPC backends simultaneously, enabling model sharding across heterogeneous infrastructure. A few follow-up questions about the parallel completions configuration: 1. **Does the parallel completions setting apply to embeddings?** The docs mention completions, but I want to confirm if `OLLAMA_NUM_PARALLEL` or similar settings affect embedding throughput. 2. **Multi-node coordination:** Ollama doesn't provide any native way to discover and utilize multiple Ollama instances across a network. Is this intentional (expecting users to handle orchestration), or are there plans for built-in clustering/coordination? 3. **Hybrid backend interop:** For use cases requiring model sharding (large models split across nodes), is there any coordination between Ollama and llama.cpp RPC backends? Or is this expected to be handled at the orchestration layer? 4. **Connection handling:** Since Ollama doesn't support parallel embeddings natively, is it safe to send concurrent embedding requests from multiple clients (as we're doing)? Or should we implement client-side queuing? 5. **Performance characteristics:** With our current approach (concurrent requests via connection pooling), we're seeing: - 10× speedup from connection reuse alone - 30% additional improvement from HTTP/2 multiplexing - Linear scaling across nodes Are there any known bottlenecks or gotchas when doing high-concurrency embedding requests against a single Ollama instance? The goal of SOLLOL is to provide the parallelization and orchestration layer that Ollama doesn't have built-in - essentially making distributed multi-node setups work transparently. If there's a better way to achieve this that aligns with Ollama's architecture, I'd love to hear suggestions! **Current architecture:** ``` Client App → SOLLOL Hybrid Router → Ollama Nodes (embeddings/generation) → llama.cpp RPC Backends (model sharding) ``` This gives us: - **Hybrid backend routing** - First system to coordinate Ollama + llama.cpp RPC - **Model sharding support** - Split large models across RPC backends - Automatic node discovery across the network - Automatic failover - Intelligent routing based on node health/VRAM/backend type - Distributed batch processing via Dask - Unified observability for heterogeneous infrastructure Essentially treating N Ollama instances + M llama.cpp RPC backends as a single logical service - making it possible to actually leverage all the GPUs/CPUs on your network, regardless of backend type.
Author
Owner

@rick-github commented on GitHub (Oct 13, 2025):

Yeah, I'm not responding to LLM slop. It seems like you are just shilling your project here and in the discord.

<!-- gh-comment-id:3395666732 --> @rick-github commented on GitHub (Oct 13, 2025): Yeah, I'm not responding to LLM slop. It seems like you are just shilling your project here and in the discord.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8355