[GH-ISSUE #2109] Support loading multiple models at the same time #26968

Closed
opened 2026-04-22 03:46:26 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @Picaso2 on GitHub (Jan 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2109

Originally assigned to: @dhiltgen on GitHub.

is it possible to create one model from multiple models? or even load multiple models?

Originally created by @Picaso2 on GitHub (Jan 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2109 Originally assigned to: @dhiltgen on GitHub. is it possible to create one model from multiple models? or even load multiple models?
Author
Owner

@Dbone29 commented on GitHub (Jan 22, 2024):

You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.

<!-- gh-comment-id:1904937099 --> @Dbone29 commented on GitHub (Jan 22, 2024): You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.
Author
Owner

@cmndcntrlcyber commented on GitHub (Mar 2, 2024):

You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells.

Do you happen to have a link or name of the tool?

<!-- gh-comment-id:1974734293 --> @cmndcntrlcyber commented on GitHub (Mar 2, 2024): > You can merge 2 models with another tool. https://huggingface.co/Undi95 He doe it with some models. After that you can a gguf file auf this model und use it in ollama when you want to. Ollama on its own isn't able to combine 2 modells. Do you happen to have a link or name of the tool?
Author
Owner

@Dbone29 commented on GitHub (Mar 2, 2024):

There are many tools for this task, but unfortunately, I am not familiar enough to say which one is the best or what the differences between them are. However, here's an example of a tool that I came across last year:

https://github.com/arcee-ai/mergekit

<!-- gh-comment-id:1974780974 --> @Dbone29 commented on GitHub (Mar 2, 2024): There are many tools for this task, but unfortunately, I am not familiar enough to say which one is the best or what the differences between them are. However, here's an example of a tool that I came across last year: https://github.com/arcee-ai/mergekit
Author
Owner

@pdevine commented on GitHub (Mar 11, 2024):

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

<!-- gh-comment-id:1989071514 --> @pdevine commented on GitHub (Mar 11, 2024): @Picaso2 other than the multimodal models we don't _yet_ support loading multiple models into memory simultaneously. What is the use case you're trying to do?
Author
Owner

@mofanke commented on GitHub (Mar 12, 2024):

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously.

<!-- gh-comment-id:1990754093 --> @mofanke commented on GitHub (Mar 12, 2024): > @Picaso2 other than the multimodal models we don't _yet_ support loading multiple models into memory simultaneously. What is the use case you're trying to do? I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously.
Author
Owner

@Picaso2 commented on GitHub (Mar 12, 2024):

Ultimately i would like to have an system that i can have a conversation with on various topics from science to politic to math.


From: mofanke @.>
Sent: Tuesday, March 12, 2024 02:09
To: ollama/ollama @.
>
Cc: Picaso2 @.>; Mention @.>
Subject: Re: [ollama/ollama] multiple models (Issue #2109)

@Picaso2https://github.com/Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously.


Reply to this email directly, view it on GitHubhttps://github.com/ollama/ollama/issues/2109#issuecomment-1990754093, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASTD6ECNM3664GEZHXMF34TYX2LXZAVCNFSM6AAAAABCDDUZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQG42TIMBZGM.
You are receiving this because you were mentioned.Message ID: @.***>

<!-- gh-comment-id:1992702787 --> @Picaso2 commented on GitHub (Mar 12, 2024): Ultimately i would like to have an system that i can have a conversation with on various topics from science to politic to math. ________________________________ From: mofanke ***@***.***> Sent: Tuesday, March 12, 2024 02:09 To: ollama/ollama ***@***.***> Cc: Picaso2 ***@***.***>; Mention ***@***.***> Subject: Re: [ollama/ollama] multiple models (Issue #2109) @Picaso2<https://github.com/Picaso2> other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do? I encountered a similar requirement, and I want to implement a RAG (Retrieval-Augmented Generation) system. It requires using both an embedding model and a chat model separately. Currently, the implementation with Ollama requires constantly switching between models, which slows down the process. It would be much more efficient if there was a way to use them simultaneously. — Reply to this email directly, view it on GitHub<https://github.com/ollama/ollama/issues/2109#issuecomment-1990754093>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASTD6ECNM3664GEZHXMF34TYX2LXZAVCNFSM6AAAAABCDDUZMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJQG42TIMBZGM>. You are receiving this because you were mentioned.Message ID: ***@***.***>
Author
Owner

@Fruetel commented on GitHub (Mar 16, 2024):

@Picaso2 other than the multimodal models we don't yet support loading multiple models into memory simultaneously. What is the use case you're trying to do?

I also have a use case for this.

I'm using Crew.ai with Ollama. I have agents which need to use tools, such as search or document retrieval and then there are agents who work on data which are provided by the tool users. For tool using agent I use Hermes-2-Pro-Mistral, which is optimized for tool usage but not that smart with 7 billion parameters. Would be awesome to be able to load a smart Mixtral model for the thinking agents in parallel to the Hermes for the tool using ones.

<!-- gh-comment-id:2001916237 --> @Fruetel commented on GitHub (Mar 16, 2024): > @Picaso2 other than the multimodal models we don't _yet_ support loading multiple models into memory simultaneously. What is the use case you're trying to do? I also have a use case for this. I'm using Crew.ai with Ollama. I have agents which need to use tools, such as search or document retrieval and then there are agents who work on data which are provided by the tool users. For tool using agent I use Hermes-2-Pro-Mistral, which is optimized for tool usage but not that smart with 7 billion parameters. Would be awesome to be able to load a smart Mixtral model for the thinking agents in parallel to the Hermes for the tool using ones.
Author
Owner

@dizzyriver commented on GitHub (Mar 16, 2024):

Same. I'd like llava for image to text and mixtral for language reasoning

<!-- gh-comment-id:2002054787 --> @dizzyriver commented on GitHub (Mar 16, 2024): Same. I'd like llava for image to text and mixtral for language reasoning
Author
Owner

@alfi4000 commented on GitHub (Mar 19, 2024):

Same. I'd like llava for text to image and mixtral for language reasoning

same

<!-- gh-comment-id:2005452310 --> @alfi4000 commented on GitHub (Mar 19, 2024): > Same. I'd like llava for text to image and mixtral for language reasoning same
Author
Owner

@alfi4000 commented on GitHub (Mar 19, 2024):

would It be possible to run several models 1 over gpu and the other ones over cpu and ram because I want to be able to run several models at once so if one of my family members are using Ollama over open webui and I do the same time it would be good that one runs on cpu and the other one one gpu!

<!-- gh-comment-id:2005478889 --> @alfi4000 commented on GitHub (Mar 19, 2024): would It be possible to run several models 1 over gpu and the other ones over cpu and ram because I want to be able to run several models at once so if one of my family members are using Ollama over open webui and I do the same time it would be good that one runs on cpu and the other one one gpu!
Author
Owner

@oldgithubman commented on GitHub (Mar 19, 2024):

I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat

<!-- gh-comment-id:2008314211 --> @oldgithubman commented on GitHub (Mar 19, 2024): I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat
Author
Owner

@alfi4000 commented on GitHub (Mar 23, 2024):

I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat

Try run this: and edit the ip Adress 127.0.0.1 to your rigs ip Adress or just let it how it is and then add the ip Adress with Port to your instances for example I used open webui added those 3 and it managed one ip Adress connection per Chat window so it should handle all 3 graphics car but only if you run it like I have you!:

Linux (I tested in Ubuntu)
OLLAMA_HOST=127.0.0.1:11435 ollama serve & OLLAMA_HOST=127.0.0.1:11436 ollama serve & OLLAMA_HOST=127.0.0.1:11437 ollama serve

just an example you can use different ports but on one connection only one gpu and 1 llm not several other wise it will first finish first the second and then third!

To stop it
Linux (on Ubuntu tested)
Command: grep ollama
Output:
1828
2883
1284
Command: kill 1284 & kill 1828 & kill 2883
If that doesn’t work try to kill each process manually!

command: kill 1828
command: kill 2883
command: kill 1284

<!-- gh-comment-id:2016416883 --> @alfi4000 commented on GitHub (Mar 23, 2024): > > I have a rig with three graphics cards that I would like to run three separate models on simultaneously and have them group chat > Try run this: and edit the ip Adress 127.0.0.1 to your rigs ip Adress or just let it how it is and then add the ip Adress with Port to your instances for example I used open webui added those 3 and it managed one ip Adress connection per Chat window so it should handle all 3 graphics car but only if you run it like I have you!: Linux (I tested in Ubuntu) OLLAMA_HOST=127.0.0.1:11435 ollama serve & OLLAMA_HOST=127.0.0.1:11436 ollama serve & OLLAMA_HOST=127.0.0.1:11437 ollama serve just an example you can use different ports but on one connection only one gpu and 1 llm not several other wise it will first finish first the second and then third! To stop it Linux (on Ubuntu tested) Command: grep ollama Output: 1828 2883 1284 Command: kill 1284 & kill 1828 & kill 2883 If that doesn’t work try to kill each process manually! command: kill 1828 command: kill 2883 command: kill 1284
Author
Owner

@oldgithubman commented on GitHub (Mar 30, 2024):

That's what I'm currently doing (loosely), but you also have to map each instance to a specific GPU. It works, but it's very clunky to setup. A GUI would be nice

<!-- gh-comment-id:2027910809 --> @oldgithubman commented on GitHub (Mar 30, 2024): That's what I'm currently doing (loosely), but you also have to map each instance to a specific GPU. It works, but it's very clunky to setup. A GUI would be nice
Author
Owner

@leporel commented on GitHub (Apr 6, 2024):

run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances

version: '3.8'

services:
  ollama:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT-11434}:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

  ollama-cpu:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama-cpu
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT2-11435}:11434
                
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\web
        target: /app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}'
      - 'WEBUI_SECRET_KEY='
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped
<!-- gh-comment-id:2040838988 --> @leporel commented on GitHub (Apr 6, 2024): run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances ``` version: '3.8' services: ollama: volumes: - type: bind source: C:\MyPrograms\ollama\data target: /root/.ollama - type: bind source: C:\MyPrograms\ollama\models target: /models container_name: ollama pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - ${OPEN_WEBUI_PORT-11434}:11434 deploy: resources: reservations: devices: - driver: nvidia device_ids: ['0'] capabilities: - gpu ollama-cpu: volumes: - type: bind source: C:\MyPrograms\ollama\data target: /root/.ollama - type: bind source: C:\MyPrograms\ollama\models target: /models container_name: ollama-cpu pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - ${OPEN_WEBUI_PORT2-11435}:11434 open-webui: image: ghcr.io/open-webui/open-webui:latest container_name: open-webui volumes: - type: bind source: C:\MyPrograms\ollama\web target: /app/backend/data depends_on: - ollama ports: - ${OPEN_WEBUI_PORT-3000}:8080 environment: - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}' - 'WEBUI_SECRET_KEY=' extra_hosts: - host.docker.internal:host-gateway restart: unless-stopped ```
Author
Owner

@oldgithubman commented on GitHub (Apr 6, 2024):

run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances

version: '3.8'

services:
  ollama:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT-11434}:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities:
                - gpu

  ollama-cpu:
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\data
        target: /root/.ollama
      - type: bind
        source: C:\MyPrograms\ollama\models
        target: /models
    container_name: ollama-cpu
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - ${OPEN_WEBUI_PORT2-11435}:11434
                
  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - type: bind
        source: C:\MyPrograms\ollama\web
        target: /app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}'
      - 'WEBUI_SECRET_KEY='
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped

No offense, but that's even clunkier. You already don't need to use docker

<!-- gh-comment-id:2040965840 --> @oldgithubman commented on GitHub (Apr 6, 2024): > run in docker, stick containers separately with gpu1,gpu2 or cpu only, open-webui can work with multiply ollama instances > > ``` > version: '3.8' > > services: > ollama: > volumes: > - type: bind > source: C:\MyPrograms\ollama\data > target: /root/.ollama > - type: bind > source: C:\MyPrograms\ollama\models > target: /models > container_name: ollama > pull_policy: always > tty: true > restart: unless-stopped > image: ollama/ollama:latest > ports: > - ${OPEN_WEBUI_PORT-11434}:11434 > deploy: > resources: > reservations: > devices: > - driver: nvidia > device_ids: ['0'] > capabilities: > - gpu > > ollama-cpu: > volumes: > - type: bind > source: C:\MyPrograms\ollama\data > target: /root/.ollama > - type: bind > source: C:\MyPrograms\ollama\models > target: /models > container_name: ollama-cpu > pull_policy: always > tty: true > restart: unless-stopped > image: ollama/ollama:latest > ports: > - ${OPEN_WEBUI_PORT2-11435}:11434 > > open-webui: > image: ghcr.io/open-webui/open-webui:latest > container_name: open-webui > volumes: > - type: bind > source: C:\MyPrograms\ollama\web > target: /app/backend/data > depends_on: > - ollama > ports: > - ${OPEN_WEBUI_PORT-3000}:8080 > environment: > - 'OLLAMA_BASE_URL=http://ollama:${OPEN_WEBUI_PORT-11434}' > - 'WEBUI_SECRET_KEY=' > extra_hosts: > - host.docker.internal:host-gateway > restart: unless-stopped > ``` No offense, but that's even clunkier. You already don't need to use docker
Author
Owner

@oldgithubman commented on GitHub (Apr 23, 2024):

Can we have control over which model is run on which GPU?

<!-- gh-comment-id:2073094074 --> @oldgithubman commented on GitHub (Apr 23, 2024): Can we have control over which model is run on which GPU?
Author
Owner

@dhiltgen commented on GitHub (Apr 23, 2024):

Can we have control over which model is run on which GPU?

This is something we can look at adding incrementally as this feature matures. Feel free to file a new issue and capture how you'd like it to work.

<!-- gh-comment-id:2073660857 --> @dhiltgen commented on GitHub (Apr 23, 2024): > Can we have control over which model is run on which GPU? This is something we can look at adding incrementally as this feature matures. Feel free to file a new issue and capture how you'd like it to work.
Author
Owner

@dougy83 commented on GitHub (Jun 9, 2024):

Can we have control over which model is run on which GPU?

You can create a new cpu-only model name using the following (e.g. for phi3 model) ollama show phi3 --model-file > phi3-cpuonly.modelfile, editing that file to include PARAMETER num_cpu 0 and update the FROM section as it describes, and then using ollama create -f phi3-cpuonly.modelfile phi3-cpuonly
You then just reference the the phi3-cpuonly, and it loads into system RAM. You can call the file and model whatever you want.

<!-- gh-comment-id:2156638836 --> @dougy83 commented on GitHub (Jun 9, 2024): > Can we have control over which model is run on which GPU? You can create a new cpu-only model name using the following (e.g. for phi3 model) `ollama show phi3 --model-file > phi3-cpuonly.modelfile`, editing that file to include `PARAMETER num_cpu 0` and update the FROM section as it describes, and then using `ollama create -f phi3-cpuonly.modelfile phi3-cpuonly` You then just reference the the phi3-cpuonly, and it loads into system RAM. You can call the file and model whatever you want.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26968