[PR #10786] Ollama Parallel Cluster Suport #44612

Open
opened 2026-04-25 00:12:44 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/10786
Author: @dakotacookson
Created: 5/20/2025
Status: 🔄 Open

Base: mainHead: ollama-cluster-Enhancments


📝 Commits (9)

  • 1d95792 Added CLuster Capablities
  • f4d9e22 cluster working
  • 49e2fc5 fixed and enhanced discovery and error logging
  • c73348a fixed helath check issues
  • c26fde7 fixed missing functinality
  • 0293abf fixed build error
  • a3c73c3 Update ggml-hip.dll
  • e47c167 added more fixes and removed mock routes added openapi compatiablity
  • 767e519 changed to streaming mode and fixed issue wiht http request

📊 Changes

71 files changed (+25206 additions, -24 deletions)

View changed files

📝 Dockerfile (+7 -5)
api/cluster_types.go (+316 -0)
build_rocm/lib/ollama/rocm/ggml-hip-wrapper.cpp (+8 -0)
build_rocm/lib/ollama/rocm/ggml-hip.dll (+0 -0)
build_rocm/lib/ollama/rocm/ggml-hip.exp (+0 -0)
build_rocm/lib/ollama/rocm/ggml-hip.lib (+0 -0)
cluster/config.go (+402 -0)
cluster/config_test.go (+220 -0)
cluster/discovery.go (+1338 -0)
cluster/discovery_test.go (+355 -0)
cluster/errors/README.md (+195 -0)
cluster/errors/communication.go (+496 -0)
cluster/errors/context.go (+158 -0)
cluster/errors/degradation.go (+420 -0)
cluster/errors/errors.go (+129 -0)
cluster/errors/failures.go (+106 -0)
cluster/errors/metrics.go (+433 -0)
cluster/errors/network.go (+27 -0)
cluster/errors/reconnect.go (+326 -0)
cluster/errors/retry.go (+106 -0)

...and 51 more files

📄 Description

A Note: I Am new to all of this but i needed this for my own usage, I decided to push up the branch in case anybody else wanted to review it and fix it up if needed I will probably not fix any bugs myself just pushing this up for others or if its good for the rest of you too

Distributed Clustering Support for Ollama

This PR introduces comprehensive clustering capabilities to Ollama, enabling horizontal scaling across multiple machines for increased throughput, reliability, and resource utilization.

Core Features

🌐 Cluster Architecture

  • Multi-node deployment with coordinator and worker role separation
  • Flexible topology supporting hub-and-spoke, mesh, and hybrid configurations
  • State machine-based node lifecycle with clean transitions between operational states
  • Resource-aware admission control for optimized workload distribution
  • Centralized configuration with per-node overrides for fine-grained control

🔍 Smart Node Discovery

  • Zero-configuration mDNS discovery for automatic node detection on local networks
  • Multicast UDP heartbeats for lightweight presence detection
  • Manual configuration support for complex network environments
  • Service advertisement with metadata for capability-based routing
  • Dynamic node registration with IP-based identity verification

📊 Resource Tracking

  • Comprehensive node monitoring with detailed resource metrics (CPU, RAM, GPU)
  • Workload-aware scheduling to prevent hotspots and overloaded nodes
  • Hardware capability detection for heterogeneous cluster management
  • Network bandwidth tracking to optimize inter-node communication
  • Resource reservation mechanism for prioritizing critical workloads

⚕️ Health Management

  • Real-time health monitoring with configurable thresholds and checks
  • Automatic failure detection with node status degradation
  • Self-healing reconnection with exponential backoff and jitter
  • Graceful degradation during partial outages
  • Intelligent timeout handling based on historical latency patterns
  • Enhanced error categorization for targeted recovery strategies

🔄 Communication System

  • Robust error handling with detailed error categorization and context
  • Retry mechanisms with configurable backoff policies
  • Latency tracking with automatic timeout adjustments
  • Graceful connection management to prevent resource leaks
  • Communication metrics for performance optimization

🚦 Status Management

  • State machine for valid status transitions (Online, Busy, Degraded, Offline, etc.)
  • Event-driven architecture for real-time status updates
  • Consistency preservation during state changes
  • Deadlock prevention with timeout-based recovery
  • Detailed status history for debugging and analysis

Implementation Details

The clustering system is built on several modular components that work together:

Core Components

ClusterMode

The central controller that initializes and manages all cluster components. It handles node state transitions, provides access to shared services, and coordinates startup/shutdown sequences.

NodeRegistry

Maintains the authoritative list of all cluster nodes with their current state. It provides thread-safe access to node information, emits events on node state changes, and supports filtering by node role and status.

DiscoveryService

Handles node discovery and heartbeats using configurable methods (mDNS, multicast, or manual). It includes automatic reconnection logic, enhanced error handling, and protocol versioning for backward compatibility.

HealthMonitor

Performs continuous health checks on all nodes with adaptive monitoring frequency. It detects issues early through metrics analysis and manages a degradation pipeline for graceful service reduction during partial outages.

Extended Features

Error Management

The errors package provides detailed error tracking with severity levels, categorization, and contextual metadata. It implements specialized error types for different failure scenarios and includes recovery suggestion mechanisms.

Retry Policies

Configurable retry strategies with exponential backoff, jitter, and circuit-breaking capabilities to prevent cascading failures during network issues or service degradation.

Status Controller

A robust state machine that enforces valid node status transitions, preventing inconsistent states while allowing flexibility for recovery operations.

Performance Considerations

The clustering implementation is designed for minimal overhead:

  • Lightweight heartbeats (<100 bytes) with configurable frequency
  • Efficient service discovery with connection pooling and caching
  • Backpressure mechanisms to prevent resource exhaustion
  • Selective health checks based on node criticality and history
  • Background processing for non-critical operations

Testing Framework

The PR includes comprehensive testing capabilities:

  • Integration tests for multi-node scenarios
  • Failure injection for resilience testing
  • Network partitioning simulation
  • Long-running stability tests
  • Performance benchmarks for overhead measurement

Future Enhancements

While this PR provides a robust clustering foundation, future enhancements could include:

  • Distributed model sharding across nodes
  • Cross-node tensor parallelism
  • Predictive scaling based on usage patterns
  • Fine-grained access control for node management
  • Enhanced visualization for cluster health

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/10786 **Author:** [@dakotacookson](https://github.com/dakotacookson) **Created:** 5/20/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `ollama-cluster-Enhancments` --- ### 📝 Commits (9) - [`1d95792`](https://github.com/ollama/ollama/commit/1d957929ef1b03aad7595abb68aaba70361ffcf1) Added CLuster Capablities - [`f4d9e22`](https://github.com/ollama/ollama/commit/f4d9e22b4a349d1a9e7f74536ca6306728abffcf) cluster working - [`49e2fc5`](https://github.com/ollama/ollama/commit/49e2fc5a179a4dc3cf0556b1302485ffe5c9e13d) fixed and enhanced discovery and error logging - [`c73348a`](https://github.com/ollama/ollama/commit/c73348a118542dd92ee1c5bf37fe2f44da117cc3) fixed helath check issues - [`c26fde7`](https://github.com/ollama/ollama/commit/c26fde7fce08f7d8c546bff0f264073bc6378cd7) fixed missing functinality - [`0293abf`](https://github.com/ollama/ollama/commit/0293abf06d8e661c4ebceb36322da12ba18668cb) fixed build error - [`a3c73c3`](https://github.com/ollama/ollama/commit/a3c73c379faa56c582a8cb70e2c6c880e4d0cae8) Update ggml-hip.dll - [`e47c167`](https://github.com/ollama/ollama/commit/e47c167526f874dda962776ce72ae7671f62e1ee) added more fixes and removed mock routes added openapi compatiablity - [`767e519`](https://github.com/ollama/ollama/commit/767e519c7502707af2a5d1ef6bb7e353720271b9) changed to streaming mode and fixed issue wiht http request ### 📊 Changes **71 files changed** (+25206 additions, -24 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+7 -5) ➕ `api/cluster_types.go` (+316 -0) ➕ `build_rocm/lib/ollama/rocm/ggml-hip-wrapper.cpp` (+8 -0) ➕ `build_rocm/lib/ollama/rocm/ggml-hip.dll` (+0 -0) ➕ `build_rocm/lib/ollama/rocm/ggml-hip.exp` (+0 -0) ➕ `build_rocm/lib/ollama/rocm/ggml-hip.lib` (+0 -0) ➕ `cluster/config.go` (+402 -0) ➕ `cluster/config_test.go` (+220 -0) ➕ `cluster/discovery.go` (+1338 -0) ➕ `cluster/discovery_test.go` (+355 -0) ➕ `cluster/errors/README.md` (+195 -0) ➕ `cluster/errors/communication.go` (+496 -0) ➕ `cluster/errors/context.go` (+158 -0) ➕ `cluster/errors/degradation.go` (+420 -0) ➕ `cluster/errors/errors.go` (+129 -0) ➕ `cluster/errors/failures.go` (+106 -0) ➕ `cluster/errors/metrics.go` (+433 -0) ➕ `cluster/errors/network.go` (+27 -0) ➕ `cluster/errors/reconnect.go` (+326 -0) ➕ `cluster/errors/retry.go` (+106 -0) _...and 51 more files_ </details> ### 📄 Description A Note: I Am new to all of this but i needed this for my own usage, I decided to push up the branch in case anybody else wanted to review it and fix it up if needed I will probably not fix any bugs myself just pushing this up for others or if its good for the rest of you too # Distributed Clustering Support for Ollama This PR introduces comprehensive clustering capabilities to Ollama, enabling horizontal scaling across multiple machines for increased throughput, reliability, and resource utilization. ## Core Features ### 🌐 Cluster Architecture - **Multi-node deployment** with coordinator and worker role separation - **Flexible topology** supporting hub-and-spoke, mesh, and hybrid configurations - **State machine-based node lifecycle** with clean transitions between operational states - **Resource-aware admission control** for optimized workload distribution - **Centralized configuration** with per-node overrides for fine-grained control ### 🔍 Smart Node Discovery - **Zero-configuration mDNS discovery** for automatic node detection on local networks - **Multicast UDP heartbeats** for lightweight presence detection - **Manual configuration** support for complex network environments - **Service advertisement** with metadata for capability-based routing - **Dynamic node registration** with IP-based identity verification ### 📊 Resource Tracking - **Comprehensive node monitoring** with detailed resource metrics (CPU, RAM, GPU) - **Workload-aware scheduling** to prevent hotspots and overloaded nodes - **Hardware capability detection** for heterogeneous cluster management - **Network bandwidth tracking** to optimize inter-node communication - **Resource reservation** mechanism for prioritizing critical workloads ### ⚕️ Health Management - **Real-time health monitoring** with configurable thresholds and checks - **Automatic failure detection** with node status degradation - **Self-healing reconnection** with exponential backoff and jitter - **Graceful degradation** during partial outages - **Intelligent timeout handling** based on historical latency patterns - **Enhanced error categorization** for targeted recovery strategies ### 🔄 Communication System - **Robust error handling** with detailed error categorization and context - **Retry mechanisms** with configurable backoff policies - **Latency tracking** with automatic timeout adjustments - **Graceful connection management** to prevent resource leaks - **Communication metrics** for performance optimization ### 🚦 Status Management - **State machine** for valid status transitions (Online, Busy, Degraded, Offline, etc.) - **Event-driven architecture** for real-time status updates - **Consistency preservation** during state changes - **Deadlock prevention** with timeout-based recovery - **Detailed status history** for debugging and analysis ## Implementation Details The clustering system is built on several modular components that work together: ### Core Components #### ClusterMode The central controller that initializes and manages all cluster components. It handles node state transitions, provides access to shared services, and coordinates startup/shutdown sequences. #### NodeRegistry Maintains the authoritative list of all cluster nodes with their current state. It provides thread-safe access to node information, emits events on node state changes, and supports filtering by node role and status. #### DiscoveryService Handles node discovery and heartbeats using configurable methods (mDNS, multicast, or manual). It includes automatic reconnection logic, enhanced error handling, and protocol versioning for backward compatibility. #### HealthMonitor Performs continuous health checks on all nodes with adaptive monitoring frequency. It detects issues early through metrics analysis and manages a degradation pipeline for graceful service reduction during partial outages. ### Extended Features #### Error Management The `errors` package provides detailed error tracking with severity levels, categorization, and contextual metadata. It implements specialized error types for different failure scenarios and includes recovery suggestion mechanisms. #### Retry Policies Configurable retry strategies with exponential backoff, jitter, and circuit-breaking capabilities to prevent cascading failures during network issues or service degradation. #### Status Controller A robust state machine that enforces valid node status transitions, preventing inconsistent states while allowing flexibility for recovery operations. ## Performance Considerations The clustering implementation is designed for minimal overhead: - **Lightweight heartbeats** (<100 bytes) with configurable frequency - **Efficient service discovery** with connection pooling and caching - **Backpressure mechanisms** to prevent resource exhaustion - **Selective health checks** based on node criticality and history - **Background processing** for non-critical operations ## Testing Framework The PR includes comprehensive testing capabilities: - **Integration tests** for multi-node scenarios - **Failure injection** for resilience testing - **Network partitioning** simulation - **Long-running stability tests** - **Performance benchmarks** for overhead measurement ## Future Enhancements While this PR provides a robust clustering foundation, future enhancements could include: - Distributed model sharding across nodes - Cross-node tensor parallelism - Predictive scaling based on usage patterns - Fine-grained access control for node management - Enhanced visualization for cluster health --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 00:12:44 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#44612