[GH-ISSUE #6911] Mixture of Agents for Ollama #4372

Closed
opened 2026-04-12 15:18:38 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @secondtruth on GitHub (Sep 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6911

The Mixture of Agents (MoA) is an innovative approach to leveraging the collective strengths of multiple language models to enhance overall performance and capabilities of one main model (aggregator). By combining outputs from various models, each potentially excelling in different aspects or domains, this approach has demonstrated significant improvements in model performance, even outperforming GPT-4o on certain benchmarks.

moa-structure

This proposal aims to bring similar capabilities to Ollama, allowing users to define and utilize multiple agent models within a single inference pipeline. It suggests extending Ollama's Modelfile vocabulary to support a MoA architecture.

The feature would enable users to:

  1. Define multiple "reference" models, each potentially specialized for different tasks or domains.
  2. Specify an "aggregator" model that synthesizes the outputs from these referenced models.
  3. Create sophisticated inference pipelines that can adapt to various types of queries or tasks.

Expanding Ollama's capabilities with MoA would make Ollama more flexible and powerful, and offer enhanced model performance. This feature would significantly , especially beneficial in scenarios where diverse expertise or multimodal processing is required.

Proposed Syntax Extensions

  1. REFERENCE: Define a reference model to be used as an agent. These directives must be placed before FROM, so it can be referenced in the system prompt of the aggregator model.

    REFERENCE <model_name> AS <expert_alias>
    
    • All configurations following this statement until the next REFERENCE or FROM apply to this agent model.
  2. FROM: Defines the aggregator model, equal to its current use in Modelfiles. Must come after any REFERENCE directive.

    • All configurations following this statement apply to the aggregator model.
  3. SYSTEM: Defines the system prompt of the reference or aggregator model, equal to its current use in Modelfiles.

  4. TEMPLATE --type=user-message: Defines a template for user messages given to the reference or aggregator model.

    • This proposal introduces new template variables for this case to access agent outputs.
    • Available template variables for TEMPLATE --type=user-message:
      • {{ .Ref.<expert_alias> }}

Example Usage

# Define first reference model
REFERENCE qwen2.5 AS companion

PARAMETER temperature 0.3
PARAMETER num_predict 512

SYSTEM """
You are a companion AI expert. Provide friendly and supportive responses.
"""

# Define second reference model (multimodal)
REFERENCE llava AS sight

PARAMETER temperature 0.4
PARAMETER num_predict 256

SYSTEM """
You are a visual analysis expert. Describe and analyze visual elements.
"""

# Aggregator model configuration
FROM llama3.1

PARAMETER temperature 0.5
PARAMETER num_ctx 4096

TEMPLATE --type=user-message """
You have been provided with a set of responses from various agent LLMs to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from agents:

Companion:
{{ .Ref.companion }}

Sight:
{{ .Ref.sight }}
"""

Benefits

  1. Enhanced Model Capabilities: Leverage strengths of multiple models for more comprehensive responses.
  2. Flexibility: Easy configuration and deployment of complex model ensembles through declarative syntax.
  3. Improved Context-Awareness: Utilize specialized models for different aspects of queries.
  4. Familiar Syntax: Builds upon existing Modelfile conventions, making it intuitive for Ollama and Docker users.

Potential Use Cases

  1. Multimodal Processing: Combine text and image/video/voice analysis for richer understanding.
  2. Enhanced Question-Answering, Fact-Checking, and Research: Use specialized experts for different knowledge domains, aggregating information from multiple specialized sources.
  3. Context-Aware Conversational Agents: Dynamically adapt responses based on conversational context, or use different experts for style, content, and personality aspects.

See also

Originally created by @secondtruth on GitHub (Sep 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6911 The [Mixture of Agents (MoA)](https://arxiv.org/abs/2406.04692) is an innovative approach to leveraging the collective strengths of multiple language models to enhance overall performance and capabilities of one main model (aggregator). By combining outputs from various models, each potentially excelling in different aspects or domains, this approach has demonstrated significant improvements in model performance, even outperforming GPT-4o on certain benchmarks. ![moa-structure](https://github.com/user-attachments/assets/f18b22b6-a47d-45bb-9016-e77b106a752e) This proposal aims to bring similar capabilities to Ollama, allowing users to define and utilize multiple agent models within a single inference pipeline. It suggests extending Ollama's Modelfile vocabulary to support a MoA architecture. The feature would enable users to: 1. Define multiple "reference" models, each potentially specialized for different tasks or domains. 2. Specify an "aggregator" model that synthesizes the outputs from these referenced models. 3. Create sophisticated inference pipelines that can adapt to various types of queries or tasks. Expanding Ollama's capabilities with MoA would make Ollama more flexible and powerful, and offer enhanced model performance. This feature would significantly , especially beneficial in scenarios where diverse expertise or multimodal processing is required. ## Proposed Syntax Extensions 1. **`REFERENCE`:** Define a reference model to be used as an agent. These directives must be placed before `FROM`, so it can be referenced in the system prompt of the aggregator model. ``` REFERENCE <model_name> AS <expert_alias> ``` - All configurations following this statement until the next `REFERENCE` or `FROM` apply to this agent model. 2. **`FROM`:** Defines the aggregator model, equal to its current use in Modelfiles. Must come after any `REFERENCE` directive. - All configurations following this statement apply to the aggregator model. 3. **`SYSTEM`:** Defines the system prompt of the reference or aggregator model, equal to its current use in Modelfiles. 4. **`TEMPLATE --type=user-message`:** Defines a template for user messages given to the reference or aggregator model. - This proposal introduces new template variables for this case to access agent outputs. - Available template variables for `TEMPLATE --type=user-message`: - `{{ .Ref.<expert_alias> }}` ## Example Usage ```dockerfile # Define first reference model REFERENCE qwen2.5 AS companion PARAMETER temperature 0.3 PARAMETER num_predict 512 SYSTEM """ You are a companion AI expert. Provide friendly and supportive responses. """ # Define second reference model (multimodal) REFERENCE llava AS sight PARAMETER temperature 0.4 PARAMETER num_predict 256 SYSTEM """ You are a visual analysis expert. Describe and analyze visual elements. """ # Aggregator model configuration FROM llama3.1 PARAMETER temperature 0.5 PARAMETER num_ctx 4096 TEMPLATE --type=user-message """ You have been provided with a set of responses from various agent LLMs to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability. Responses from agents: Companion: {{ .Ref.companion }} Sight: {{ .Ref.sight }} """ ``` ## Benefits 1. Enhanced Model Capabilities: Leverage strengths of multiple models for more comprehensive responses. 2. Flexibility: Easy configuration and deployment of complex model ensembles through declarative syntax. 3. Improved Context-Awareness: Utilize specialized models for different aspects of queries. 4. Familiar Syntax: Builds upon existing Modelfile conventions, making it intuitive for Ollama and Docker users. ## Potential Use Cases 1. Multimodal Processing: Combine text and image/video/voice analysis for richer understanding. 2. Enhanced Question-Answering, Fact-Checking, and Research: Use specialized experts for different knowledge domains, aggregating information from multiple specialized sources. 3. Context-Aware Conversational Agents: Dynamically adapt responses based on conversational context, or use different experts for style, content, and personality aspects. ## See also - [Blog Post about Together MoA](https://www.together.ai/blog/together-moa) - [Together MoA Documentation](https://docs.together.ai/docs/mixture-of-agents) - [Together MoA GitHub Repo](https://github.com/togethercomputer/MoA)
GiteaMirror added the feature request label 2026-04-12 15:18:38 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 23, 2024):

Why not just use of the many MoE/MoA frameworks (eg ollama_moe) and plug in ollama as an inference engine?

<!-- gh-comment-id:2367047978 --> @rick-github commented on GitHub (Sep 23, 2024): Why not just use of the many MoE/MoA frameworks (eg [ollama_moe](https://github.com/rapidarchitect/ollama_moe)) and plug in ollama as an inference engine?
Author
Owner

@zwilch commented on GitHub (Sep 23, 2024):

Implementing [the functionality described] is better suited for a Framework/Frontend implementation rather than
the ollama backend.
This is because these frameworks would then direct their requests to ollama as the backend, which would load the
corresponding models with the parameters as shown in the example, and then summarize the responses.
However, this doesn't fit into the ollama backend, because it would mean anchoring a new API with a multitude of
possible parameters within ollama, and then the frontends/frameworks would have to support this new API.
Therefore, this is better implemented in these frameworks and not in the LLM backend.

<!-- gh-comment-id:2367985770 --> @zwilch commented on GitHub (Sep 23, 2024): Implementing [the functionality described] is better suited for a Framework/Frontend implementation rather than the ollama backend. This is because these frameworks would then direct their requests to ollama as the backend, which would load the corresponding models with the parameters as shown in the example, and then summarize the responses. However, this doesn't fit into the ollama backend, because it would mean anchoring a new API with a multitude of possible parameters within ollama, and then the frontends/frameworks would have to support this new API. Therefore, this is better implemented in these frameworks and not in the LLM backend.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4372