[GH-ISSUE #1600] Is there any option to unload a model from memory? #62922

Closed
opened 2026-05-03 10:49:38 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @DanielMazurkiewicz on GitHub (Dec 19, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1600

As in the title. I want to unload model, is there any option for it?

Originally created by @DanielMazurkiewicz on GitHub (Dec 19, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1600 As in the title. I want to unload model, is there any option for it?
Author
Owner

@zach030 commented on GitHub (Dec 19, 2023):

No support yet, you can only shutdown the serve to close the model

<!-- gh-comment-id:1862602460 --> @zach030 commented on GitHub (Dec 19, 2023): No support yet, you can only shutdown the serve to close the model
Author
Owner

@technovangelist commented on GitHub (Dec 19, 2023):

The model gets automatically unloaded after 5 minutes. It sounds like you want it unloaded in less time than that? Or are you saying it's taking longer than 5 minutes?

<!-- gh-comment-id:1862805298 --> @technovangelist commented on GitHub (Dec 19, 2023): The model gets automatically unloaded after 5 minutes. It sounds like you want it unloaded in less time than that? Or are you saying it's taking longer than 5 minutes?
Author
Owner

@igorschlum commented on GitHub (Dec 19, 2023):

I think that it could be interesting to set a parameter for the 5 minutes. Sometimes, you want less or more, depending the usage of Ollama you make.

<!-- gh-comment-id:1863585714 --> @igorschlum commented on GitHub (Dec 19, 2023): I think that it could be interesting to set a parameter for the 5 minutes. Sometimes, you want less or more, depending the usage of Ollama you make.
Author
Owner

@mattbisme commented on GitHub (Dec 20, 2023):

There are some use cases where it would be nice to have it unload almost immediately (single-turn Siri request). But other times where it would be better to have it loaded indefinitely (a chatbot that is prompted occasionally where a fast response provides a better experience).

Being able to configure this parameter at model runtime (especially via API) would be great!

<!-- gh-comment-id:1864929291 --> @mattbisme commented on GitHub (Dec 20, 2023): There are some use cases where it would be nice to have it unload almost immediately (single-turn Siri request). But other times where it would be better to have it loaded indefinitely (a chatbot that is prompted occasionally where a fast response provides a better experience). Being able to configure this parameter at model runtime (especially via API) would be great!
Author
Owner

@chymian commented on GitHub (Dec 27, 2023):

+1 for the API- & Modelfile-tuneable unload parameter, especially: keep-inifinit

<!-- gh-comment-id:1870161894 --> @chymian commented on GitHub (Dec 27, 2023): +1 for the API- & Modelfile-tuneable unload parameter, especially: __keep-inifinit__
Author
Owner

@dennisorlando commented on GitHub (Jan 6, 2024):

Something as simple as "ollama load" and "ollama unload" would suffice

<!-- gh-comment-id:1879601943 --> @dennisorlando commented on GitHub (Jan 6, 2024): Something as simple as "ollama load" and "ollama unload" would suffice
Author
Owner

@olekse commented on GitHub (Jan 6, 2024):

^^^
edit. btw for me it always stays in ram (Windows 10/WSL /w https://github.com/ollama-webui/ollama-webui)

<!-- gh-comment-id:1879723034 --> @olekse commented on GitHub (Jan 6, 2024): ^^^ edit. btw for me it always stays in ram (Windows 10/WSL /w https://github.com/ollama-webui/ollama-webui)
Author
Owner

@pdevine commented on GitHub (Jan 28, 2024):

#2146 adds this which will be available in 0.1.23. Going to go ahead and close this. You can set keep_alive to -1 when calling the chat API and it will leave the model loaded in memory.

<!-- gh-comment-id:1913743956 --> @pdevine commented on GitHub (Jan 28, 2024): #2146 adds this which will be available in `0.1.23`. Going to go ahead and close this. You can set `keep_alive` to `-1` when calling the chat API and it will leave the model loaded in memory.
Author
Owner

@nathanleclaire commented on GitHub (Feb 7, 2024):

@pdevine For what it's worth I would still like the ability to manually evict a model from VRAM through API + CLI command. The keepalive functionality is nice but on my Linux box (will have to double-check later to make sure it's latest version, but installed very recently) after a chat session the model just sits there in VRAM and I have to restart ollama to get it out if something else wants to use the GPU. Which, of course, is a shame because I would also like to ollama pull things at the same time :)

<!-- gh-comment-id:1932961839 --> @nathanleclaire commented on GitHub (Feb 7, 2024): @pdevine For what it's worth I would still like the ability to manually evict a model from VRAM through API + CLI command. The keepalive functionality is nice but on my Linux box (will have to double-check later to make sure it's latest version, but installed very recently) after a chat session the model just sits there in VRAM and I have to restart ollama to get it out if something else wants to use the GPU. Which, of course, is a shame because I would also like to `ollama pull` things at the same time :)
Author
Owner

@pdevine commented on GitHub (Feb 7, 2024):

@nathanleclaire I've been thinking about adding an OLLAMA_KEEP_ALIVE env variable to be able to change the default timeout. I don't want to go too extreme here though because ideally there would be much richer controls (e.g. access controls/policies/multiple models/etc.)

<!-- gh-comment-id:1932979809 --> @pdevine commented on GitHub (Feb 7, 2024): @nathanleclaire I've been thinking about adding an `OLLAMA_KEEP_ALIVE` env variable to be able to change the default timeout. I don't want to go too extreme here though because ideally there would be much richer controls (e.g. access controls/policies/multiple models/etc.)
Author
Owner

@OliChase404 commented on GitHub (Feb 21, 2024):

This might help, from the faq.md file:

How do I keep a model loaded in memory or make it unload immediately?

By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the keep_alive parameter with either the /api/generate and /api/chat API endpoints to control how long the model is left in memory.
The keep_alive parameter can be set to:

  • a duration string (such as "10m" or "24h")
  • a number in seconds (such as 3600)
  • any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
  • '0' which will unload the model immediately after generating a response
    For example, to preload a model and leave it in memory use:
curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}'

To unload the model and free up memory use:

curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'
<!-- gh-comment-id:1957410750 --> @OliChase404 commented on GitHub (Feb 21, 2024): This might help, from the [faq.md](https://faq.md/) file: ## How do I keep a model loaded in memory or make it unload immediately? By default models are kept in memory for 5 minutes before being unloaded. This allows for quicker response times if you are making numerous requests to the LLM. You may, however, want to free up the memory before the 5 minutes have elapsed or keep the model loaded indefinitely. Use the `keep_alive` parameter with either the `/api/generate` and `/api/chat` API endpoints to control how long the model is left in memory. The `keep_alive` parameter can be set to: * a duration string (such as "10m" or "24h") * a number in seconds (such as 3600) * any negative number which will keep the model loaded in memory (e.g. -1 or "-1m") * '0' which will unload the model immediately after generating a response For example, to preload a model and leave it in memory use: ```shell curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": -1}' ``` To unload the model and free up memory use: ```shell curl http://localhost:11434/api/generate -d '{"model": "llama2", "keep_alive": 0}'
Author
Owner

@OliChase404 commented on GitHub (Feb 21, 2024):

So to immediately unload a model and free your vram run:

curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

Replace MODELNAME with the name of the model currently loaded.

<!-- gh-comment-id:1957413071 --> @OliChase404 commented on GitHub (Feb 21, 2024): So to immediately unload a model and free your vram run: curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}' Replace MODELNAME with the name of the model currently loaded.
Author
Owner

@olekse commented on GitHub (Mar 18, 2024):

What about removing the model from the RAM? It still stays in RAM when VRAM is cleared.

<!-- gh-comment-id:2004115286 --> @olekse commented on GitHub (Mar 18, 2024): What about removing the model from the RAM? It still stays in RAM when VRAM is cleared.
Author
Owner

@pdevine commented on GitHub (Mar 18, 2024):

@olekse it's the same.

<!-- gh-comment-id:2004447996 --> @pdevine commented on GitHub (Mar 18, 2024): @olekse it's the same.
Author
Owner

@wenhui01 commented on GitHub (Apr 15, 2024):

This parameter can't work with embedding model.

<!-- gh-comment-id:2056553367 --> @wenhui01 commented on GitHub (Apr 15, 2024): This parameter can't work with embedding model.
Author
Owner

@Divelix commented on GitHub (May 5, 2024):

CLI command to immediately unload model from memory is a must have for me. Most of the time 5 mins is fine, but occasionally I have to shut it down right now to use full VRAM for smth else. Implement what @dennisorlando suggested above, please.

<!-- gh-comment-id:2094846514 --> @Divelix commented on GitHub (May 5, 2024): CLI command to immediately unload model from memory is a must have for me. Most of the time 5 mins is fine, but occasionally I have to shut it down right now to use full VRAM for smth else. Implement what @dennisorlando suggested above, please.
Author
Owner

@VfBfoerst commented on GitHub (May 8, 2024):

curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

worked for me :)

<!-- gh-comment-id:2099851587 --> @VfBfoerst commented on GitHub (May 8, 2024): > curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}' worked for me :)
Author
Owner

@davidearlyoung commented on GitHub (Jun 2, 2024):

Since terminal command line ollama ps is a thing now, as well as loading multiple models, I feel that this should be revisited. Such as reconsidering adding the ability to target an already loaded model for unloading using terminal command line. Rather then sending a client request with keep_alive parameter to 0 when ever I want to evict a select model from vram/ram. I feel that the keep_alive method to target unloading a single model is a bit cumbersome. Especially from a server admin's perspective.

<!-- gh-comment-id:2143922439 --> @davidearlyoung commented on GitHub (Jun 2, 2024): Since terminal command line `ollama ps` is a thing now, as well as loading multiple models, I feel that this should be revisited. Such as reconsidering adding the ability to target an already loaded model for unloading using terminal command line. Rather then sending a client request with `keep_alive` parameter to 0 when ever I want to evict a select model from vram/ram. I feel that the keep_alive method to target unloading a single model is a bit cumbersome. Especially from a server admin's perspective.
Author
Owner

@eav-solution commented on GitHub (Jun 19, 2024):

Vote

<!-- gh-comment-id:2177730856 --> @eav-solution commented on GitHub (Jun 19, 2024): Vote
Author
Owner

@Dement242 commented on GitHub (Aug 16, 2024):

Here is a script that will list all running models on the ip´s (edit the script and change to your ip-addresses) and unload the model you select.

#!/bin/bash

# Define the IP addresses
ips=("192.168.1.40" "192.168.1.42")

# Function to fetch models from a given IP address
fetch_models() {
    local ip=$1
    json_response=$(curl -s http://$ip:11434/api/ps)
    echo "$json_response" | jq -r --arg ip "$ip" '.models[] | "\($ip) \(.model)"'
}

# Step 1: Fetch models from all IP addresses and display as a numbered list
declare -a models_list
index=1
echo "Available models:"
for ip in "${ips[@]}"; do
    echo "Server $ip:"
    models=$(fetch_models $ip)
    IFS=$'\n' read -rd '' -a models_array <<< "$models"
    for model in "${models_array[@]}"; do
        models_list+=("$model")
        echo "  $index) ${model#* }"
        ((index++))
    done
done

# Step 2: Let the user select a model by number
echo -e "\nEnter the model number (or 0 to exit):"
read model_num
[[ $model_num -eq 0 ]] && exit 0

# Step 3: Get the selected IP and model
selected_entry="${models_list[$((model_num-1))]}"
selected_ip="${selected_entry%% *}"
selected_model="${selected_entry#* }"

# Step 4: Run the final curl command with the selected IP and model
curl http://$selected_ip:11434/api/generate -d "{\"model\": \"$selected_model\", \"keep_alive\": 0}"
echo
<!-- gh-comment-id:2293281293 --> @Dement242 commented on GitHub (Aug 16, 2024): Here is a script that will list all running models on the ip´s (edit the script and change to your ip-addresses) and unload the model you select. ``` #!/bin/bash # Define the IP addresses ips=("192.168.1.40" "192.168.1.42") # Function to fetch models from a given IP address fetch_models() { local ip=$1 json_response=$(curl -s http://$ip:11434/api/ps) echo "$json_response" | jq -r --arg ip "$ip" '.models[] | "\($ip) \(.model)"' } # Step 1: Fetch models from all IP addresses and display as a numbered list declare -a models_list index=1 echo "Available models:" for ip in "${ips[@]}"; do echo "Server $ip:" models=$(fetch_models $ip) IFS=$'\n' read -rd '' -a models_array <<< "$models" for model in "${models_array[@]}"; do models_list+=("$model") echo " $index) ${model#* }" ((index++)) done done # Step 2: Let the user select a model by number echo -e "\nEnter the model number (or 0 to exit):" read model_num [[ $model_num -eq 0 ]] && exit 0 # Step 3: Get the selected IP and model selected_entry="${models_list[$((model_num-1))]}" selected_ip="${selected_entry%% *}" selected_model="${selected_entry#* }" # Step 4: Run the final curl command with the selected IP and model curl http://$selected_ip:11434/api/generate -d "{\"model\": \"$selected_model\", \"keep_alive\": 0}" echo ```
Author
Owner

@MoreColors123 commented on GitHub (Aug 25, 2024):

i was battling with full VRAM while using ollama and FLUX simultaneously. here is a version for anyone looking for a solution for WINDOWS. it is written with chat gpt, so i don't know how, but it works. just adjust the IP which ollama is running on in the first line. Below i also included a .bat script to execute this .ps1

/// put this in a file called unload.ps1

# Define the IP addresses
$ips = @("127.0.0.1")

# Function to fetch models from a given IP address
function Fetch-Models {
    param (
        [string]$ip
    )
    $json_response = Invoke-RestMethod -Uri "http://${ip}:11434/api/ps"
    $models = $json_response.models | ForEach-Object { 
        "$ip $($_.model)" 
    }
    return $models
}

# Step 1: Fetch models from all IP addresses and display as a numbered list
$models_list = @()
$index = 1
Write-Host "Available models:"
foreach ($ip in $ips) {
    Write-Host "Server ${ip}:"
    $models = Fetch-Models -ip $ip
    foreach ($model in $models) {
        $models_list += $model
        Write-Host "  $index) $($model -split ' ')[1]"
        $index++
    }
}

# Step 2: Let the user select a model by number
$model_num = Read-Host "`nEnter the model number (or 0 to exit):"
if ($model_num -eq 0) {
    exit
}

# Step 3: Get the selected IP and model
$selected_entry = $models_list[$model_num - 1].Trim()
$selected_ip = ($selected_entry -split ' ')[0].Trim()
$selected_model = ($selected_entry -split ' ')[1].Trim()

# Debug output
Write-Host "Selected IP: $selected_ip"
Write-Host "Selected Model: $selected_model"

# Step 4: Run the final Invoke-RestMethod command with the selected IP and model
$body = @{ model = $selected_model; keep_alive = 0 } | ConvertTo-Json
Invoke-RestMethod -Uri "http://${selected_ip}:11434/api/generate" -Method Post -Body $body -ContentType "application/json"
Write-Host "Model Unloaded"

/// put this in a .bat file:

@echo off
PowerShell -NoProfile -ExecutionPolicy Bypass -File "unload.ps1"
pause

-> double the click .bat file to choose the ollama model to unload

<!-- gh-comment-id:2308926543 --> @MoreColors123 commented on GitHub (Aug 25, 2024): i was battling with full VRAM while using ollama and FLUX simultaneously. here is a version for anyone looking for a solution for WINDOWS. it is written with chat gpt, so i don't know how, but it works. just adjust the IP which ollama is running on in the first line. Below i also included a .bat script to execute this .ps1 /// put this in a file called unload.ps1 ```powershell # Define the IP addresses $ips = @("127.0.0.1") # Function to fetch models from a given IP address function Fetch-Models { param ( [string]$ip ) $json_response = Invoke-RestMethod -Uri "http://${ip}:11434/api/ps" $models = $json_response.models | ForEach-Object { "$ip $($_.model)" } return $models } # Step 1: Fetch models from all IP addresses and display as a numbered list $models_list = @() $index = 1 Write-Host "Available models:" foreach ($ip in $ips) { Write-Host "Server ${ip}:" $models = Fetch-Models -ip $ip foreach ($model in $models) { $models_list += $model Write-Host " $index) $($model -split ' ')[1]" $index++ } } # Step 2: Let the user select a model by number $model_num = Read-Host "`nEnter the model number (or 0 to exit):" if ($model_num -eq 0) { exit } # Step 3: Get the selected IP and model $selected_entry = $models_list[$model_num - 1].Trim() $selected_ip = ($selected_entry -split ' ')[0].Trim() $selected_model = ($selected_entry -split ' ')[1].Trim() # Debug output Write-Host "Selected IP: $selected_ip" Write-Host "Selected Model: $selected_model" # Step 4: Run the final Invoke-RestMethod command with the selected IP and model $body = @{ model = $selected_model; keep_alive = 0 } | ConvertTo-Json Invoke-RestMethod -Uri "http://${selected_ip}:11434/api/generate" -Method Post -Body $body -ContentType "application/json" Write-Host "Model Unloaded" ``` /// put this in a .bat file: ```shell @echo off PowerShell -NoProfile -ExecutionPolicy Bypass -File "unload.ps1" pause ``` -> double the click .bat file to choose the ollama model to unload
Author
Owner

@pdevine commented on GitHub (Nov 17, 2024):

To unload a model, just use ollama stop <model>. To keep it in memory you can use ollama run --keepalive -1s <model>

<!-- gh-comment-id:2481056856 --> @pdevine commented on GitHub (Nov 17, 2024): To unload a model, just use `ollama stop <model>`. To keep it in memory you can use `ollama run --keepalive -1s <model>`
Author
Owner

@somera commented on GitHub (Feb 27, 2025):

Instead of

for ip in "${ips[@]}"; do
echo "Server $ip:"
models=$(fetch_models $ip)
IFS=$'\n' read -rd '' -a models_array <<< "$models"
for model in "${models_array[@]}"; do
models_list+=("$model")
echo " $index) ${model#* }"
((index++))
done
done

it works for me with:

for ip in "${ips[@]}"; do
    echo "Server $ip:"
    models=$(fetch_models "$ip")

    # Read the models into an array
    mapfile -t models_array <<< "$models"

    for model in "${models_array[@]}"; do
        models_list+=("$model")
        echo "  $index) ${model#* }"
        ((index++))
    done
done
<!-- gh-comment-id:2689166592 --> @somera commented on GitHub (Feb 27, 2025): Instead of > for ip in "${ips[@]}"; do > echo "Server $ip:" > models=$(fetch_models $ip) > IFS=$'\n' read -rd '' -a models_array <<< "$models" > for model in "${models_array[@]}"; do > models_list+=("$model") > echo " $index) ${model#* }" > ((index++)) > done > done it works for me with: ``` for ip in "${ips[@]}"; do echo "Server $ip:" models=$(fetch_models "$ip") # Read the models into an array mapfile -t models_array <<< "$models" for model in "${models_array[@]}"; do models_list+=("$model") echo " $index) ${model#* }" ((index++)) done done ```
Author
Owner

@mbylstra commented on GitHub (Mar 12, 2025):

To unload a model, just use ollama stop <model>. To keep it in memory you can use ollama run --keepalive -1s <model>

I don't think that's possible from the REST API (or the Python client). It'd be great if there was a straightforward endpoint to immediately unload a model. As I understand it is possible with /api/generate -d '{"model": "MODELNAME", "keep_alive": 0}', but as mentioned previously it is a cumbersome and counterintuitive way to do it. I don't think it'd hurt to add a specific endpoint for this.

A common use case is sharing a GPU between image generation and text generation (and needing to unload from memory when switching from text generation to image generation if the GPU does not have enough VRAM for both)

<!-- gh-comment-id:2717667705 --> @mbylstra commented on GitHub (Mar 12, 2025): > To unload a model, just use `ollama stop <model>`. To keep it in memory you can use `ollama run --keepalive -1s <model>` I don't think that's possible from the [REST API](https://github.com/ollama/ollama/blob/main/docs/api.md) (or the Python client). It'd be great if there was a straightforward endpoint to immediately unload a model. As I understand it is possible with `/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'`, but as mentioned previously it is a cumbersome and counterintuitive way to do it. I don't think it'd hurt to add a specific endpoint for this. A common use case is sharing a GPU between image generation and text generation (and needing to unload from memory when switching from text generation to image generation if the GPU does not have enough VRAM for both)
Author
Owner

@dstadulis commented on GitHub (Sep 12, 2025):

Offered is a solution which:

  • Create JSON objects of the ollama ps lines of the models
  • Selects name
  • Passes these names as inputs to a gnu-parallel command
    • Which sets ollama's keepalive value to 0, effectively purging the model from memory
# Unload all models listed by ollama ps -- by setting keepalive value to 0
parallel "ollama run --keepalive 0s {}" ::: $(ollama ps | awk '
  NR == 1 {
    split($0, headers)
  }
  NR > 1 {
    obj = "{"
    for (i=1; i<=NF; i++) {
      gsub(/[[:space:]]+$/, "", $i) # Trim trailing spaces
      if (i < NF) {
        obj = sprintf("%s\"%s\": \"%s\", ", obj, headers[i], $i)
      } else {
        obj = sprintf("%s\"%s\": \"%s\"}", obj, headers[i], $i)
      }
    }
    print obj
  }
' |   jq -r '.NAME')
<!-- gh-comment-id:3284202757 --> @dstadulis commented on GitHub (Sep 12, 2025): Offered is a solution which: - Create JSON objects of the ollama ps lines of the models - Selects name - Passes these names as inputs to a gnu-parallel command - Which sets ollama's `keepalive` value to 0, effectively purging the model from memory ```zsh # Unload all models listed by ollama ps -- by setting keepalive value to 0 parallel "ollama run --keepalive 0s {}" ::: $(ollama ps | awk ' NR == 1 { split($0, headers) } NR > 1 { obj = "{" for (i=1; i<=NF; i++) { gsub(/[[:space:]]+$/, "", $i) # Trim trailing spaces if (i < NF) { obj = sprintf("%s\"%s\": \"%s\", ", obj, headers[i], $i) } else { obj = sprintf("%s\"%s\": \"%s\"}", obj, headers[i], $i) } } print obj } ' | jq -r '.NAME') ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62922