[GH-ISSUE #3369] Pull a model on start or without requiring serve #48584

Open
opened 2026-04-28 08:54:07 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @0x77dev on GitHub (Mar 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3369

What are you trying to do?

To enhance the user experience when deploying Ollama with models in a containerized environment, it would be beneficial to enable pre-embedding a model into the image through a custom Dockerfile or pulling a model upon starting Ollama by specifying an argument or environment variable. This eliminates the need for an API request after the container starts.

How should we solve this?

  • ollama serve --pull [models]
  • OLLAMA_PULL=model1,model2 ollama serve
  • ollama pull without ollama serve (a bit harder to implement option, but improves the ability to create custom images with custom models beyond pulling)

What is the impact of not solving this?

This is a significant improvement for hosting Ollama. Without it, deploying Ollama, especially in a production environment, would be more challenging.

Anything else?

Related:

Originally created by @0x77dev on GitHub (Mar 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3369 ### What are you trying to do? To enhance the user experience when deploying Ollama with models in a containerized environment, it would be beneficial to enable pre-embedding a model into the image through a custom Dockerfile or pulling a model upon starting Ollama by specifying an argument or environment variable. This eliminates the need for an API request after the container starts. ### How should we solve this? - `ollama serve --pull [models]` - `OLLAMA_PULL=model1,model2 ollama serve` - `ollama pull` without `ollama serve` (a bit harder to implement option, but improves the ability to create custom images with custom models beyond pulling) ### What is the impact of not solving this? This is a significant improvement for hosting Ollama. Without it, deploying Ollama, especially in a production environment, would be more challenging. ### Anything else? Related: - https://github.com/ollama/ollama/issues/1322 - https://github.com/ollama/ollama/issues/358#issuecomment-2022394098
GiteaMirror added the dockerfeature request labels 2026-04-28 08:54:08 -05:00
Author
Owner

@ip2cloud commented on GitHub (Apr 2, 2024):

This would be very useful in a docker-compose.yaml. Including raising more than one automatic model after container rising.

I saw that in the kubernetes helm this has the entries:

ollama:
  gpu
:
    # -- Enable GPU integration
    enabled: true
    
    # -- GPU type: 'nvidia' or 'amd'
    type: 'nvidia'
    
    # -- Specify the number of GPU to 2
    number: 2
   
  # -- List of models to pull at container startup
  models: 
    - mistral
    - llama2

no values.yaml
How to make this in docker-compose.yaml

<!-- gh-comment-id:2030914034 --> @ip2cloud commented on GitHub (Apr 2, 2024): This would be very useful in a docker-compose.yaml. Including raising more than one automatic model after container rising. I saw that in the kubernetes helm this has the entries: ``` ollama: gpu : # -- Enable GPU integration enabled: true # -- GPU type: 'nvidia' or 'amd' type: 'nvidia' # -- Specify the number of GPU to 2 number: 2 # -- List of models to pull at container startup models: - mistral - llama2 ``` no values.yaml How to make this in docker-compose.yaml
Author
Owner

@shivaraj-bh commented on GitHub (Apr 17, 2024):

decoupling pull from serve will also be very helpful to setup the requirements before starting the server.

I would love to see this implemented

<!-- gh-comment-id:2060726050 --> @shivaraj-bh commented on GitHub (Apr 17, 2024): decoupling pull from serve will also be very helpful to setup the requirements before starting the server. I would love to see this implemented
Author
Owner

@Chernegi commented on GitHub (Apr 23, 2024):

You may add to your Dockerfile:

RUN ollama serve & sleep 5 ; ollama pull $model_name ;
echo "kill 'ollama serve' process" ;
ps -ef | grep 'ollama serve' | grep -v grep | awk '{print $2}' | xargs -r kill -9

<!-- gh-comment-id:2073660686 --> @Chernegi commented on GitHub (Apr 23, 2024): You may add to your Dockerfile: RUN ollama serve & sleep 5 ; ollama pull $model_name ; \ echo "kill 'ollama serve' process" ; \ ps -ef | grep 'ollama serve' | grep -v grep | awk '{print $2}' | xargs -r kill -9
Author
Owner

@xyproto commented on GitHub (Sep 4, 2024):

Not being able to use ollama pull without ollama serve is problematic when trying to package models as Arch Linux packages.

Being able to package models as packages is useful, because then other packages or applications can depend on both Ollama and a specific model being available.

An example of why the current situation is a bit awkward:

pkgname=ollama-tinyllama
_tag=latest
pkgver=1.0.0
pkgrel=1
pkgdesc='The tinyllama (1B) large language model (LLM), for Ollama'
arch=(any)
url='https://github.com/jzhang38/TinyLlama'
license=(Apache-2.0)
depends=(ollama)
makedepends=(python)

prepare() {
  # Find a free port
  export OLLAMA_HOST=":$(python -c 'import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')"

  # Create a place to keep the models
  mkdir -p models
  export OLLAMA_MODELS="$srcdir/models"

  # Start Ollama
  ollama serve &
  serve_pid=$!

  # Try downloading the model with ollama, wait 1 second if ollama is not ready yet, try 10 times
  for i in {1..10}; do
    ollama pull "${pkgname#ollama-}:$_tag" && break || sleep 1
  done

  # Stop Ollama
  kill $serve_pid
}

package() {
  install -d "$pkgdir/var/lib/ollama"
  cp -r models/. "$pkgdir/var/lib/ollama/"
}

Being able to use ollama pull without having to start Ollama would be useful.

If the models were placed in separate directories, it would also be easier to manage permissions, in the context of Linux packages.

<!-- gh-comment-id:2328310692 --> @xyproto commented on GitHub (Sep 4, 2024): Not being able to use `ollama pull` without `ollama serve` is problematic when trying to package models as Arch Linux packages. Being able to package models as packages is useful, because then other packages or applications can depend on both Ollama and a specific model being available. An example of why the current situation is a bit awkward: ```bash pkgname=ollama-tinyllama _tag=latest pkgver=1.0.0 pkgrel=1 pkgdesc='The tinyllama (1B) large language model (LLM), for Ollama' arch=(any) url='https://github.com/jzhang38/TinyLlama' license=(Apache-2.0) depends=(ollama) makedepends=(python) prepare() { # Find a free port export OLLAMA_HOST=":$(python -c 'import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')" # Create a place to keep the models mkdir -p models export OLLAMA_MODELS="$srcdir/models" # Start Ollama ollama serve & serve_pid=$! # Try downloading the model with ollama, wait 1 second if ollama is not ready yet, try 10 times for i in {1..10}; do ollama pull "${pkgname#ollama-}:$_tag" && break || sleep 1 done # Stop Ollama kill $serve_pid } package() { install -d "$pkgdir/var/lib/ollama" cp -r models/. "$pkgdir/var/lib/ollama/" } ``` Being able to use `ollama pull` without having to start Ollama would be useful. If the models were placed in separate directories, it would also be easier to manage permissions, in the context of Linux packages.
Author
Owner

@a-h commented on GitHub (Sep 27, 2024):

Raised PR https://github.com/ollama/ollama/pull/7001

<!-- gh-comment-id:2378934719 --> @a-h commented on GitHub (Sep 27, 2024): Raised PR https://github.com/ollama/ollama/pull/7001
Author
Owner

@a-h commented on GitHub (Sep 27, 2024):

With the changes, I'm able to define the models I want to have present in a Nix Flake. In my case, I'm using version 3.11 as a base, and I've applied my PR as a patch to test it.

      # Wrap ollama so that we can set environment variables to provide models.
      wrappedOllama = system: pkgs:
        let
          #TODO: When https://github.com/ollama/ollama/pull/7001 is merged and the unstable 
          # nixpkgs uses the version with it, we can remove the src and vendorHash overrides,
          # keeping the acceleration override.
          ollama = (pkgs.ollama.overrideAttrs {
            version = "3.11-patch";
            src = pkgs.fetchFromGitHub {
              owner = "a-h";
              repo = "ollama";
              rev = "42e790d02524f5f461eb241d88de12cf6d9afdb2";
              fetchSubmodules = true;
              hash = "sha256-R7KT1Vg4VRtoI1lXBiIKbQJQfxn6sAYXBwAisl1MN5c=";
            };
            vendorHash = "sha256-hSxcREAujhvzHVNwnRTfhi0MKI3s8HNavER2VLz6SYk=";
          }).override
            (oldAttrs: {
              acceleration =
                if system == "aarch64-darwin" || system == "x86_64-darwin" # If darwin, use metal.
                then null
                else "cuda"; # If linux, use cuda. (change manually to "rocm" for AMD GPUs)
            });
          models = pkgs.runCommand "pull-models" { } ''
            export HOME="$out"
            ${ollama}/bin/ollama pull mistral-nemo --local
            ${ollama}/bin/ollama pull nomic-embedded-text --local
          '';
          wrapped = pkgs.writeShellScriptBin "ollama" ''
            export HOME="${models}"
            export OLLAMA_MODELS="${models}/.ollama/models"
            exec ${ollama}/bin/ollama "$@"
          '';
        in
        pkgs.symlinkJoin {
          name = "ollama";
          paths = [
            models
            wrapped
            ollama
          ];
        };

Then, I can override the ollama that's in nixpkgs with my expression that preloads models:

      forAllSystems = f: nixpkgs.lib.genAttrs allSystems (system: f {
        system = system;
        pkgs = import nixpkgs {
          inherit system;
          overlays = [
            # Use ollama from unstableNixPkgs, because it's a bit more
            # bleeding edge.
            (final: prev: {
              ollama = (wrappedOllama system (unstableNixPkgs system));
            })
          ];
        };
      });
<!-- gh-comment-id:2379974446 --> @a-h commented on GitHub (Sep 27, 2024): With the changes, I'm able to define the models I want to have present in a Nix Flake. In my case, I'm using version 3.11 as a base, and I've applied my PR as a patch to test it. ```nix # Wrap ollama so that we can set environment variables to provide models. wrappedOllama = system: pkgs: let #TODO: When https://github.com/ollama/ollama/pull/7001 is merged and the unstable # nixpkgs uses the version with it, we can remove the src and vendorHash overrides, # keeping the acceleration override. ollama = (pkgs.ollama.overrideAttrs { version = "3.11-patch"; src = pkgs.fetchFromGitHub { owner = "a-h"; repo = "ollama"; rev = "42e790d02524f5f461eb241d88de12cf6d9afdb2"; fetchSubmodules = true; hash = "sha256-R7KT1Vg4VRtoI1lXBiIKbQJQfxn6sAYXBwAisl1MN5c="; }; vendorHash = "sha256-hSxcREAujhvzHVNwnRTfhi0MKI3s8HNavER2VLz6SYk="; }).override (oldAttrs: { acceleration = if system == "aarch64-darwin" || system == "x86_64-darwin" # If darwin, use metal. then null else "cuda"; # If linux, use cuda. (change manually to "rocm" for AMD GPUs) }); models = pkgs.runCommand "pull-models" { } '' export HOME="$out" ${ollama}/bin/ollama pull mistral-nemo --local ${ollama}/bin/ollama pull nomic-embedded-text --local ''; wrapped = pkgs.writeShellScriptBin "ollama" '' export HOME="${models}" export OLLAMA_MODELS="${models}/.ollama/models" exec ${ollama}/bin/ollama "$@" ''; in pkgs.symlinkJoin { name = "ollama"; paths = [ models wrapped ollama ]; }; ``` Then, I can override the `ollama` that's in nixpkgs with my expression that preloads models: ```nix forAllSystems = f: nixpkgs.lib.genAttrs allSystems (system: f { system = system; pkgs = import nixpkgs { inherit system; overlays = [ # Use ollama from unstableNixPkgs, because it's a bit more # bleeding edge. (final: prev: { ollama = (wrappedOllama system (unstableNixPkgs system)); }) ]; }; }); ```
Author
Owner

@a-h commented on GitHub (Oct 4, 2024):

@dhiltgen - You recently closed an #7046 saying that it was tracked here. Not sure if you saw that I raised a PR for this? Thanks!

<!-- gh-comment-id:2393733675 --> @a-h commented on GitHub (Oct 4, 2024): @dhiltgen - You recently closed an #7046 saying that it was tracked here. Not sure if you saw that I raised a PR for this? Thanks!
Author
Owner

@ghost commented on GitHub (Jan 22, 2025):

is there a reason I still can't ollama pull before serve in CLI? i must have downloaded the models manually the first time, because if you don't specify models before ollama serve, you are forced to download a 1.6GB model. over networks with high security, this takes a really long time, and there's no progress bar even indicating a download is ongoing. i only found out it was doing this after leaving it up on accident searching for a solution. but I can't pull a model using ollama pull, and I don't mind downloading it manually and changing paths - it's just a bit problematic it forces a silent 16 part 100MB each download and doesn't log any of it until it's done. In my case it's the same time investment, but arriving to the conclusion that was the problem at all was an entire day of red herrings, in some cases it can be because the port is not appropriate or firewall issues, and in others, it actually finishes, so it's pretty time consuming. both outputs in the working default port case and a port that you don't have proper permissions to listen to look identical in output up to the point where it silent downloads. ollama -v shows the default port working, ollama -v shows the custom port is not running with the common message asking you to run ollama.

<!-- gh-comment-id:2607791993 --> @ghost commented on GitHub (Jan 22, 2025): is there a reason I still can't ollama pull before serve in CLI? i must have downloaded the models manually the first time, because if you don't specify models before ollama serve, you are forced to download a 1.6GB model. over networks with high security, this takes a really long time, and there's no progress bar even indicating a download is ongoing. i only found out it was doing this after leaving it up on accident searching for a solution. but I can't pull a model using ollama pull, and I don't mind downloading it manually and changing paths - it's just a bit problematic it forces a silent 16 part 100MB each download and doesn't log any of it until it's done. In my case it's the same time investment, but arriving to the conclusion that was the problem at all was an entire day of red herrings, in some cases it can be because the port is not appropriate or firewall issues, and in others, it actually finishes, so it's pretty time consuming. both outputs in the working default port case and a port that you don't have proper permissions to listen to look identical in output up to the point where it silent downloads. ollama -v shows the default port working, ollama -v shows the custom port is not running with the common message asking you to run ollama.
Author
Owner

@xyproto commented on GitHub (Jan 22, 2025):

This utility may, maybe, possibly, be helpful to you:

https://github.com/xyproto/ollamaurl

<!-- gh-comment-id:2607809551 --> @xyproto commented on GitHub (Jan 22, 2025): This utility may, maybe, possibly, be helpful to you: https://github.com/xyproto/ollamaurl
Author
Owner

@ghost commented on GitHub (Jan 22, 2025):

Thanks. Do you still think it'd be useful given I don't even run ollama pull? Really, my main issue is just signaling to the user a process that can create large time investment in iteration. A download bar during serve, people still saying to use ollama serve and ollama pull model separately, it's hard to know even when I know what model is being downloaded, if ollama serve is actually invoking ollama pull when there isn't a model. It's hard to understand why I'm still seeing an implicit dependence in my experience by nature of expected logic flow. everything seems to imply a different schema. i just assert either it should be more deliberate, or the decoupling should work in a way that you can serve first and then pull model after. i can't pull without serving first, and my two options are scp from faster network or wait. And the latter implies so much more is wrong than actually is because there's just not a simple tdqm style progress tracker on an already split download. could just print chunks as they finish. i didn't even know this was happening at all.

<!-- gh-comment-id:2607815117 --> @ghost commented on GitHub (Jan 22, 2025): Thanks. Do you still think it'd be useful given I don't even run ollama pull? Really, my main issue is just signaling to the user a process that can create large time investment in iteration. A download bar during serve, people still saying to use ollama serve and ollama pull model separately, it's hard to know even when I know what model is being downloaded, if ollama serve is actually invoking ollama pull when there isn't a model. It's hard to understand why I'm still seeing an implicit dependence in my experience by nature of expected logic flow. everything seems to imply a different schema. i just assert either it should be more deliberate, or the decoupling should work in a way that you can serve first and then pull model after. i can't pull without serving first, and my two options are scp from faster network or wait. And the latter implies so much more is wrong than actually is because there's just not a simple tdqm style progress tracker on an already split download. could just print chunks as they finish. i didn't even know this was happening at all.
Author
Owner

@ghost commented on GitHub (Jan 22, 2025):

Could also be misdiagnosing, I just notice after the GPU specs, it hangs for 15 minutes. it then shows api calls trying to be made later, and eventually, it will do one more api call and then say 16 chunks of 100MB were downloaded or something similar.

time=2025-01-22T10:59:34.528-06:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-01-22T10:59:34.767-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-473729a7-a78c-bd5a-eea8-9888394b121a library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
[GIN] 2025/01/22 - 11:22:10 | 200 | 99.533µs | 10.1.120.100 | GET "/"
[GIN] 2025/01/22 - 11:22:10 | 404 | 7.824µs | 10.1.120.100 | GET "/favicon.ico"

<!-- gh-comment-id:2607845054 --> @ghost commented on GitHub (Jan 22, 2025): Could also be misdiagnosing, I just notice after the GPU specs, it hangs for 15 minutes. it then shows api calls trying to be made later, and eventually, it will do one more api call and then say 16 chunks of 100MB were downloaded or something similar. time=2025-01-22T10:59:34.528-06:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-01-22T10:59:34.767-06:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-473729a7-a78c-bd5a-eea8-9888394b121a library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" [GIN] 2025/01/22 - 11:22:10 | 200 | 99.533µs | 10.1.120.100 | GET "/" [GIN] 2025/01/22 - 11:22:10 | 404 | 7.824µs | 10.1.120.100 | GET "/favicon.ico"
Author
Owner

@khteh commented on GitHub (Apr 4, 2025):

https://github.com/ollama/ollama/issues/10122

<!-- gh-comment-id:2777515544 --> @khteh commented on GitHub (Apr 4, 2025): https://github.com/ollama/ollama/issues/10122
Author
Owner

@otobonh commented on GitHub (Sep 10, 2025):

You may add to your Dockerfile:

RUN ollama serve & sleep 5 ; ollama pull $model_name ; echo "kill 'ollama serve' process" ; ps -ef | grep 'ollama serve' | grep -v grep | awk '{print $2}' | xargs -r kill -9

This worked for me. Thanks

<!-- gh-comment-id:3273208423 --> @otobonh commented on GitHub (Sep 10, 2025): > You may add to your Dockerfile: > > RUN ollama serve & sleep 5 ; ollama pull $model_name ; echo "kill 'ollama serve' process" ; ps -ef | grep 'ollama serve' | grep -v grep | awk '{print $2}' | xargs -r kill -9 This worked for me. Thanks
Author
Owner

@BOPOHA commented on GitHub (Feb 18, 2026):

I know this doesn't solve the core issue of needing the server running to pull models, but here is a working workaround for Kubernetes users.

spec:
  containers:
    - name: ollama
      image: ollama/ollama:0.3.12
      command: ["/bin/sh"]
      args:
        - "-c"
        - |
          set -eu
          ollama serve &
          until ollama list >/dev/null 2>&1; do sleep 1; done
          ollama pull hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF:Q8_0
          ollama pull nomic-embed-text
          wait


You may add to your Dockerfile:

RUN ollama serve & sleep 5 ; ollama pull $model_name ; echo "kill 'ollama serve' process" ; ps -ef | grep 'ollama serve' | grep -v grep | awk '{print $2}' | xargs -r kill -9

lol

RUN ollama serve & \
    PID=$! && \
    sleep 5 && \
    ollama pull $model_name && \
    kill -9 $PID

this approach is more concise and relies on the shell’s built-in $! variable to terminate the background process explicitly:

<!-- gh-comment-id:3922191352 --> @BOPOHA commented on GitHub (Feb 18, 2026): I know this doesn't solve the core issue of needing the server running to pull models, but here is a working workaround for Kubernetes users. ```yaml spec: containers: - name: ollama image: ollama/ollama:0.3.12 command: ["/bin/sh"] args: - "-c" - | set -eu ollama serve & until ollama list >/dev/null 2>&1; do sleep 1; done ollama pull hf.co/Qwen/Qwen3-Embedding-0.6B-GGUF:Q8_0 ollama pull nomic-embed-text wait ``` --- > You may add to your Dockerfile: > > RUN ollama serve & sleep 5 ; ollama pull $model_name ; echo "kill 'ollama serve' process" ; ps -ef | grep 'ollama serve' | grep -v grep | awk '{print $2}' | xargs -r kill -9 lol ``` RUN ollama serve & \ PID=$! && \ sleep 5 && \ ollama pull $model_name && \ kill -9 $PID ``` this approach is more concise and relies on the shell’s built-in `$!` variable to terminate the background process explicitly:
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48584