[GH-ISSUE #10031] Unload model to RAM instead of disk #68635

Closed
opened 2026-05-04 14:40:34 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @AlbertoSinigaglia on GitHub (Mar 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10031

I have a setup that has a loooot of RAM, and many models, which are loaded and unloaded from the main disk. However, ideally what I would love to be able to do, is to unload the models to the RAM, and then when needed reload them in the VRAM, instead of fetching them back every time from the main, disk, which is way slower.

Is there a way to do it? like an OLLAMA_MAX_LOADED_MODELS, but for caching in RAM of the model weights.

Originally created by @AlbertoSinigaglia on GitHub (Mar 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10031 I have a setup that has a loooot of RAM, and many models, which are loaded and unloaded from the main disk. However, ideally what I would love to be able to do, is to unload the models to the RAM, and then when needed reload them in the VRAM, instead of fetching them back every time from the main, disk, which is way slower. Is there a way to do it? like an `OLLAMA_MAX_LOADED_MODELS`, but for caching in RAM of the model weights.
GiteaMirror added the feature request label 2026-05-04 14:40:34 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 29, 2025):

The operating system already caches the model. If you are not doing a lot of disk reads outside of model loading, the model will be available in the page cache.

<!-- gh-comment-id:2762949952 --> @rick-github commented on GitHub (Mar 29, 2025): The operating system already caches the model. If you are not doing a lot of disk reads outside of model loading, the model will be available in the page cache.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 29, 2025):

Which makes sense, however the setup is not quite on point with that POV. We have few users, and a lot of RAM (around 512Gb), so keeping around a 40Gb model is not a major problem, and if that helps with the time it takes for model swaps, it would be great. I already do this with ComfyUI with StableDiffusion models, and RAM-to-VRAM speed feels way faster than NVME to VRAM (though testing the NVME speed is around 6Gbps)

<!-- gh-comment-id:2762952886 --> @AlbertoSinigaglia commented on GitHub (Mar 29, 2025): Which makes sense, however the setup is not quite on point with that POV. We have few users, and a lot of RAM (around 512Gb), so keeping around a 40Gb model is not a major problem, and if that helps with the time it takes for model swaps, it would be great. I already do this with ComfyUI with StableDiffusion models, and RAM-to-VRAM speed feels way faster than NVME to VRAM (though testing the NVME speed is around 6Gbps)
Author
Owner

@rick-github commented on GitHub (Mar 29, 2025):

If the page cache doesn't suffice, create a ramdisk that holds the model.

Create a mount point for the ramdisk:

$ sudo mkdir /mnt/ollama
$ echo "tmpfs /mnt/ollama tmpfs size=45G,mode=755,uid=ollama,gid=ollama 0 0" | sudo tee -a /etc/fstab

Create a script to populate the ramdisk. In this instance, the model qwq will be in the ramdisk, the other models remain in ~ollama/.ollama/models.

populate.sh
#!/bin/bash

die(){
    echo "$1" >&2
    exit 1
}

_=$(command -v jq) || die "Need jq"

! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
    die '`getopt --test` failed in this environment.'
fi
OPTIONS=ns:d:
LONGOPTS=dryrun,source:,destination:
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
    exit 2
fi
eval set -- "$PARSED"

DRYRUN=
src=${OLLAMA_MODELS-/usr/local/share/ollama/.ollama/models}

while true; do
  case "$1" in
    -n|--dry_run|--dryrun)
      DRYRUN=echo
      shift
      ;;
    -s|--source)
      src="$2"
      shift 2
      ;;
    -d|--destination)
      dst="$2"
      shift 2
      ;;
    --)
      shift
      break
      ;;
    *)
      die "Programming error"
      ;;
  esac
done

models=$*

[ -z "$src" -o ! -d "$src" -o ! -d "$src/blobs" -o ! -d "$src/manifests" ] && die "Invalid source '$src'"
[ -z "$dst" ] && die "No destination supplied"
dst_root=$(dirname "$dst")
[ ! -d "$dst_root" ] && die "Destination root '$dst_root' doesn't exist"

src=${src%%/}
dst=${dst%%/}

for m in $models ; do
  domain=registry.ollama.ai
  library=library
  name=${m%:*} ; name=${name##*/}
  tag=latest
  [[ $m = *:* ]] && tag=${m#*:}
  [[ $m = */*/* ]] && domain=${m%%/*}
  [[ $m = */* ]] && { library=${m%/*} ; library=${library#*/} ; }
  path="$src/manifests/$domain/$library/$name/$tag"
  [ ! -f "$path" ] && missing+=( "$m" )
  paths+=( "$path" )
done

[ ${#missing[*]} -gt 0 ] && die "Couldn't find model(s): ${missing[*]}"

[ -d "$dst" ] && { $DRYRUN chmod 777 "$dst"/{blobs,manifests} || die "Couldn't set "$dst" writeable" ; }

$DRYRUN cp --archive --symbolic-link --no-target-directory --no-clobber "$src" "$dst" || die "Copy failed"

for path in ${paths[*]} ; do
  blobs=$(jq -r '.layers[],.config|.digest' "$path" | tr : -)
  for blob in $blobs ; do
    $DRYRUN rm "$dst/blobs/$blob"
    $DRYRUN cp "$src/blobs/$blob" "$dst/blobs/$blob" || die "Blob '$src/blobs/$blob' copy failed"
  done
done

$DRYRUN chmod 555 "$dst"/{blobs,manifests} || die "Couldn't set "$dst" non-writeable"
$ sudo mount /mnt/ollama
$ sudo -u ollama ./populate.sh -s /usr/share/ollama/.ollama/models -d /mnt/ollama/models qwq

Adjust the ollama service file to reference the ramdisk.

$ sudo systemctl edit ollama

Add the following lines. The ExecStartPre populates the ramdisk whenever the service is restarted.

[Service]
ExecStartPre=/path/to/populate.sh -s /usr/share/ollama/.ollama/models -d /mnt/ollama/models qwq
Environment="OLLAMA_MODELS=/mnt/ollama/models"

Restart the service:

$ sudo systemctl restart ollama
<!-- gh-comment-id:2764124977 --> @rick-github commented on GitHub (Mar 29, 2025): If the page cache doesn't suffice, create a ramdisk that holds the model. Create a mount point for the ramdisk: ```console $ sudo mkdir /mnt/ollama $ echo "tmpfs /mnt/ollama tmpfs size=45G,mode=755,uid=ollama,gid=ollama 0 0" | sudo tee -a /etc/fstab ``` Create a script to populate the ramdisk. In this instance, the model `qwq` will be in the ramdisk, the other models remain in ~ollama/.ollama/models. <details close> <summary>populate.sh</summary> ```sh #!/bin/bash die(){ echo "$1" >&2 exit 1 } _=$(command -v jq) || die "Need jq" ! getopt --test > /dev/null if [[ ${PIPESTATUS[0]} -ne 4 ]]; then die '`getopt --test` failed in this environment.' fi OPTIONS=ns:d: LONGOPTS=dryrun,source:,destination: ! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@") if [[ ${PIPESTATUS[0]} -ne 0 ]]; then exit 2 fi eval set -- "$PARSED" DRYRUN= src=${OLLAMA_MODELS-/usr/local/share/ollama/.ollama/models} while true; do case "$1" in -n|--dry_run|--dryrun) DRYRUN=echo shift ;; -s|--source) src="$2" shift 2 ;; -d|--destination) dst="$2" shift 2 ;; --) shift break ;; *) die "Programming error" ;; esac done models=$* [ -z "$src" -o ! -d "$src" -o ! -d "$src/blobs" -o ! -d "$src/manifests" ] && die "Invalid source '$src'" [ -z "$dst" ] && die "No destination supplied" dst_root=$(dirname "$dst") [ ! -d "$dst_root" ] && die "Destination root '$dst_root' doesn't exist" src=${src%%/} dst=${dst%%/} for m in $models ; do domain=registry.ollama.ai library=library name=${m%:*} ; name=${name##*/} tag=latest [[ $m = *:* ]] && tag=${m#*:} [[ $m = */*/* ]] && domain=${m%%/*} [[ $m = */* ]] && { library=${m%/*} ; library=${library#*/} ; } path="$src/manifests/$domain/$library/$name/$tag" [ ! -f "$path" ] && missing+=( "$m" ) paths+=( "$path" ) done [ ${#missing[*]} -gt 0 ] && die "Couldn't find model(s): ${missing[*]}" [ -d "$dst" ] && { $DRYRUN chmod 777 "$dst"/{blobs,manifests} || die "Couldn't set "$dst" writeable" ; } $DRYRUN cp --archive --symbolic-link --no-target-directory --no-clobber "$src" "$dst" || die "Copy failed" for path in ${paths[*]} ; do blobs=$(jq -r '.layers[],.config|.digest' "$path" | tr : -) for blob in $blobs ; do $DRYRUN rm "$dst/blobs/$blob" $DRYRUN cp "$src/blobs/$blob" "$dst/blobs/$blob" || die "Blob '$src/blobs/$blob' copy failed" done done $DRYRUN chmod 555 "$dst"/{blobs,manifests} || die "Couldn't set "$dst" non-writeable" ``` </details> ```console $ sudo mount /mnt/ollama $ sudo -u ollama ./populate.sh -s /usr/share/ollama/.ollama/models -d /mnt/ollama/models qwq ``` Adjust the ollama service file to reference the ramdisk. ```console $ sudo systemctl edit ollama ``` Add the following lines. The `ExecStartPre` populates the ramdisk whenever the service is restarted. ``` [Service] ExecStartPre=/path/to/populate.sh -s /usr/share/ollama/.ollama/models -d /mnt/ollama/models qwq Environment="OLLAMA_MODELS=/mnt/ollama/models" ``` Restart the service: ```console $ sudo systemctl restart ollama ```
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 31, 2025):

@rick-github that looks good, I'll give it a try, thanks a lot!

<!-- gh-comment-id:2765901434 --> @AlbertoSinigaglia commented on GitHub (Mar 31, 2025): @rick-github that looks good, I'll give it a try, thanks a lot!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68635