[GH-ISSUE #5557] Model getting unloaded from memory every time prompt is sent #65510

Closed
opened 2026-05-03 21:31:19 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @vjsyong on GitHub (Jul 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5557

Originally assigned to: @jmorganca on GitHub.

What is the issue?

With this latest version of Ollama, every time a prompt is sent to the model, the model gets unloaded from memory and needs to be reinitialised. Leading to much longer time-to-first-token responses. It's the same for every model that I've tested.

Here's what is looks like in the current version 0.2.0 docker container

unload

This is what it used to be in the previous version 0.1.47 docker container

unload_old

Here's the docker compose config that I'm using

EDIT: I had linked an older docker compose YAML file that I used for the 0.1.47 version. I've updated it to the one actually used in the problematic version. To note: OLLAMA_NUM_PARALLEL is set to 4 and OLLAMA_MAX_LOADED_MODELS is set to 2 instead of 1 and 1 as originally posted

services:
  ollama:
    image: ollama/ollama:0.2.0
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    tty: true
    restart: unless-stopped
    environment:
      - OLLAMA_FLASH_ATTENTION=0
      - OLLAMA_KEEP_ALIVE=10m
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    networks:
      - llms

  open-webui:
    build:
      context: .
      args:
        OLLAMA_BASE_URL: '/ollama'
      dockerfile: Dockerfile
    image: ghcr.io/open-webui/open-webui:latest
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    depends_on:
      - ollama
    ports:
      - ${OPEN_WEBUI_PORT-3000}:8080
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: unless-stopped
    networks:
      - llms

volumes:
  ollama: {}
  open-webui: {}

networks:
  llms:
    external: true

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.2.0

Originally created by @vjsyong on GitHub (Jul 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5557 Originally assigned to: @jmorganca on GitHub. ### What is the issue? With this latest version of Ollama, every time a prompt is sent to the model, the model gets unloaded from memory and needs to be reinitialised. Leading to much longer time-to-first-token responses. It's the same for every model that I've tested. Here's what is looks like in the current version 0.2.0 docker container ![unload](https://github.com/ollama/ollama/assets/46567682/86cb0093-5a20-4033-9595-d986674ee05c) This is what it used to be in the previous version 0.1.47 docker container ![unload_old](https://github.com/ollama/ollama/assets/46567682/02ab926a-7f46-4704-b873-c7ab852c311d) Here's the docker compose config that I'm using **EDIT: I had linked an older docker compose YAML file that I used for the 0.1.47 version. I've updated it to the one actually used in the problematic version. To note: `OLLAMA_NUM_PARALLEL` is set to 4 and `OLLAMA_MAX_LOADED_MODELS` is set to 2 instead of 1 and 1 as originally posted** ``` services: ollama: image: ollama/ollama:0.2.0 container_name: ollama volumes: - ollama:/root/.ollama tty: true restart: unless-stopped environment: - OLLAMA_FLASH_ATTENTION=0 - OLLAMA_KEEP_ALIVE=10m - OLLAMA_NUM_PARALLEL=4 - OLLAMA_MAX_LOADED_MODELS=2 networks: - llms open-webui: build: context: . args: OLLAMA_BASE_URL: '/ollama' dockerfile: Dockerfile image: ghcr.io/open-webui/open-webui:latest container_name: open-webui volumes: - open-webui:/app/backend/data depends_on: - ollama ports: - ${OPEN_WEBUI_PORT-3000}:8080 environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_SECRET_KEY= extra_hosts: - host.docker.internal:host-gateway restart: unless-stopped networks: - llms volumes: ollama: {} open-webui: {} networks: llms: external: true ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.2.0
GiteaMirror added the bug label 2026-05-03 21:31:19 -05:00
Author
Owner

@jmorganca commented on GitHub (Jul 9, 2024):

Hi @vjsyong sorry this is happening. A few questions:

  1. Is this with Docker Desktop?
  2. Do you have the logs handy from the container? That might give the reason for the reload. Would it be possible to use -e OLLAMA_DEBUG=1 as well so we have more info on why? (note, this will have prompt info so start a clean container)

Thanks so much - we'll get this fixed

<!-- gh-comment-id:2216309055 --> @jmorganca commented on GitHub (Jul 9, 2024): Hi @vjsyong sorry this is happening. A few questions: 1. Is this with Docker Desktop? 2. Do you have the logs handy from the container? That might give the reason for the reload. Would it be possible to use `-e OLLAMA_DEBUG=1` as well so we have more info on why? (note, this will have prompt info so start a clean container) Thanks so much - we'll get this fixed
Author
Owner

@eric0095 commented on GitHub (Jul 9, 2024):

same issuse found.
After updating to v 0.2, each time i send request the model will reload any

way.
waiting to solve.

windows desktop ollama with three rtx 3080 gpus (works well under pre version)
sever log is updated server.log

新建位图图像

<!-- gh-comment-id:2216314467 --> @eric0095 commented on GitHub (Jul 9, 2024): same issuse found. After updating to v 0.2, each time i send request the model will reload any way. waiting to solve. windows desktop ollama with three rtx 3080 gpus (works well under pre version) sever log is updated [server.log](https://github.com/user-attachments/files/16136858/server.log) ![新建位图图像](https://github.com/ollama/ollama/assets/13792899/77405c87-6d27-476b-a9b5-5a1f963336e8)
Author
Owner

@vjsyong commented on GitHub (Jul 9, 2024):

Hi @vjsyong sorry this is happening. A few questions:

  1. Is this with Docker Desktop?
  2. Do you have the logs handy from the container? That might give the reason for the reload. Would it be possible to use -e OLLAMA_DEBUG=1 as well so we have more info on why? (note, this will have prompt info so start a clean container)

Thanks so much - we'll get this fixed

Thanks for the prompt response!

This is running docker on a ubuntu remote server. I will upload the logs later on.

<!-- gh-comment-id:2216318604 --> @vjsyong commented on GitHub (Jul 9, 2024): > Hi @vjsyong sorry this is happening. A few questions: > > 1. Is this with Docker Desktop? > 2. Do you have the logs handy from the container? That might give the reason for the reload. Would it be possible to use `-e OLLAMA_DEBUG=1` as well so we have more info on why? (note, this will have prompt info so start a clean container) > > Thanks so much - we'll get this fixed Thanks for the prompt response! This is running docker on a ubuntu remote server. I will upload the logs later on.
Author
Owner

@rasodu commented on GitHub (Jul 9, 2024):

I am using rocm on docker. I am having the same issue.

<!-- gh-comment-id:2216321492 --> @rasodu commented on GitHub (Jul 9, 2024): I am using rocm on docker. I am having the same issue.
Author
Owner

@vjsyong commented on GitHub (Jul 9, 2024):

Here's the logs

https://pastebin.com/Pk3CK7LN

Also I realised that I mixed up the docker compose I linked in this issue. Looks like someone else also encountered this issue in #5542

Specifically. This only occurs with OLLAMA_NUM_PARALLEL > 1

I incorrectly specified that it was equal to one in my original post. I was using an older docker compose YAML file. I will edit it to the one I actually used when this issue came up

<!-- gh-comment-id:2216327776 --> @vjsyong commented on GitHub (Jul 9, 2024): Here's the logs https://pastebin.com/Pk3CK7LN Also I realised that I mixed up the docker compose I linked in this issue. Looks like someone else also encountered this issue in [#5542 ](https://github.com/ollama/ollama/issues/5542) Specifically. This only occurs with `OLLAMA_NUM_PARALLEL` > 1 I incorrectly specified that it was equal to one in my original post. I was using an older docker compose YAML file. I will edit it to the one I actually used when this issue came up
Author
Owner

@eric0095 commented on GitHub (Jul 9, 2024):

Here's the logs

https://pastebin.com/Pk3CK7LN

Also I realised that I mixed up the docker compose I linked in this issue. Looks like someone else also encountered this issue in #5542

Specifically. This only occurs with OLLAMA_NUM_PARALLEL > 1

I incorrectly specified that it was equal to one in my original post. I was using an older docker compose YAML file. I will edit it to the one I actually used when this issue came up

i have retested. it`s true , this issue will not occur when OLLAMA_NUM_PARALLEL =1.

<!-- gh-comment-id:2216333158 --> @eric0095 commented on GitHub (Jul 9, 2024): > Here's the logs > > https://pastebin.com/Pk3CK7LN > > Also I realised that I mixed up the docker compose I linked in this issue. Looks like someone else also encountered this issue in [#5542 ](https://github.com/ollama/ollama/issues/5542) > > Specifically. This only occurs with `OLLAMA_NUM_PARALLEL` > 1 > > I incorrectly specified that it was equal to one in my original post. I was using an older docker compose YAML file. I will edit it to the one I actually used when this issue came up i have retested. it`s true , this issue will not occur when OLLAMA_NUM_PARALLEL =1.
Author
Owner

@jmorganca commented on GitHub (Jul 9, 2024):

Working on a fix - sorry folks!

<!-- gh-comment-id:2216487185 --> @jmorganca commented on GitHub (Jul 9, 2024): Working on a fix - sorry folks!
Author
Owner

@jmorganca commented on GitHub (Jul 9, 2024):

Fixed in https://github.com/ollama/ollama/releases/tag/v0.2.1

<!-- gh-comment-id:2216826322 --> @jmorganca commented on GitHub (Jul 9, 2024): Fixed in https://github.com/ollama/ollama/releases/tag/v0.2.1
Author
Owner

@mxmp210 commented on GitHub (Jul 9, 2024):

This issue isn't resolved yet, there's something wrong with scheduler checks which force closes runner after "context for request finished" and when next request comes in it goes to runner.needsReload(ctx, pending) which should return false if runner is already loaded and in memory with NUM_PARALLEL is > 1 .

It has to do something with Runner options check & Normalize the NumCtx for parallelism which are changed with parallel requests / runners somewhere outside the context. My guess is because when first time runner loads it loads with greater numParallel value and then on second pass at L132 it gets resettled to prevent crashing.

Hope this helps debug these issues further. Referencing #4165 as it is related. Docs should reflect 'experimental' env flags which would suggest that settings might not work until they are stable enough.

<!-- gh-comment-id:2217687042 --> @mxmp210 commented on GitHub (Jul 9, 2024): This issue isn't resolved yet, there's something wrong with scheduler checks which force closes runner after "context for request finished" and when next request comes in it goes to `runner.needsReload(ctx, pending)` which should return false if runner is already loaded and in memory with `NUM_PARALLEL is > 1` . It has to do something with [Runner options check](https://github.com/ollama/ollama/blob/e4ff73297db2f53f1ea4b603df5670c5bde6a944/server/sched.go#L626) & [Normalize the NumCtx for parallelism](https://github.com/ollama/ollama/blob/e4ff73297db2f53f1ea4b603df5670c5bde6a944/server/sched.go#L622) which are changed with parallel requests / runners somewhere outside the context. My guess is because when first time runner loads it loads with greater numParallel value and then on second pass at [L132](https://github.com/ollama/ollama/blob/e4ff73297db2f53f1ea4b603df5670c5bde6a944/server/sched.go#L132) it gets resettled to prevent crashing. Hope this helps debug these issues further. Referencing #4165 as it is related. Docs should reflect 'experimental' env flags which would suggest that settings might not work until they are stable enough.
Author
Owner

@cafesao commented on GitHub (Jul 11, 2024):

I emphasize that this problem also happens to me.

<!-- gh-comment-id:2223039483 --> @cafesao commented on GitHub (Jul 11, 2024): I emphasize that this problem also happens to me.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65510