[GH-ISSUE #7114] Loaded two models together in memory without delay when set OLLAMA_KEEP_ALIVE=-1 not work. #66577

Closed
opened 2026-05-04 07:28:46 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @zw963 on GitHub (Oct 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7114

What is the issue?

Let me give a example for explain.

I use ollama 3.2 for translate on my local laptop. like this:

 ╰─ $ ollama run llama3.2 "Translate the following content from English to Chinese without additional explanation:
 hello world"
问候世界

When first time run it, it spend several seconds to loaded it into memory, then, run it again, it give my result immedately without delay.

But, if i open a new terminal to run another model, e.g. ollama run llama3.1 interactively, then return to continue running 3.2, i have to waiting several seconds again to wait 3.2 reloaded into memory.

What i expected is, i can loaded those two model into memory without unload both of them for this case, one for translate and give me result then exit, but can be run again immediately when i translate again, at the same time, another 3.1 model keeping opened interactively for ask question.

Any idea? Or only a configure issue?

Following is my service file, my GPU share memory was set to 8G

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
Environment="OLLAMA_KEEP_ALIVE=-1"
ExecStart=/home/zw963/utils/llms/bin/ollama serve
Restart=always
RestartSec=3

[Install]
WantedBy=default.target

Thanks

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.3.12

Originally created by @zw963 on GitHub (Oct 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7114 ### What is the issue? Let me give a example for explain. I use ollama 3.2 for translate on my local laptop. like this: ```sh ╰─ $ ollama run llama3.2 "Translate the following content from English to Chinese without additional explanation: hello world" 问候世界 ``` When first time run it, it spend several seconds to loaded it into memory, then, run it again, it give my result immedately without delay. But, if i open a new terminal to run another model, e.g. `ollama run llama3.1` interactively, then return to continue running 3.2, i have to waiting several seconds again to wait 3.2 reloaded into memory. What i expected is, i can loaded those two model into memory without unload both of them for this case, one for translate and give me result then exit, but can be run again immediately when i translate again, at the same time, another 3.1 model keeping opened interactively for ask question. Any idea? Or only a configure issue? Following is my service file, my GPU share memory was set to 8G ``` [Unit] Description=Ollama Service After=network-online.target [Service] Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0" Environment="OLLAMA_KEEP_ALIVE=-1" ExecStart=/home/zw963/utils/llms/bin/ollama serve Restart=always RestartSec=3 [Install] WantedBy=default.target ``` Thanks ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.3.12
GiteaMirror added the bug label 2026-05-04 07:28:46 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 7, 2024):

$ ollama ps
NAME            ID              SIZE    PROCESSOR       UNTIL            
llama3.1:latest 75382d0899df    5.5 GB  100% GPU        2 hours from now
llama3.2:latest a80c4f17acd5    3.1 GB  100% GPU        2 hours from now

Combined size of the models is 8.6G, they cannot co-reside on an 8G GPU. You can make one model live in RAM by setting num_gpu to 0, see https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650. llama3.2 would be the best choice for this, it is smaller and so runs faster anyway, leaving GPU for the larger llama3.1 model.

<!-- gh-comment-id:2397163230 --> @rick-github commented on GitHub (Oct 7, 2024): ```console $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.1:latest 75382d0899df 5.5 GB 100% GPU 2 hours from now llama3.2:latest a80c4f17acd5 3.1 GB 100% GPU 2 hours from now ``` Combined size of the models is 8.6G, they cannot co-reside on an 8G GPU. You can make one model live in RAM by setting `num_gpu` to 0, see https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650. llama3.2 would be the best choice for this, it is smaller and so runs faster anyway, leaving GPU for the larger llama3.1 model.
Author
Owner

@zw963 commented on GitHub (Oct 7, 2024):

Combined size of the models is 8.6G, they cannot co-reside on an 8G GPU.

Hi, thanks for answer, I use 7840hs with 780m iGPU, i set share GPU memory to 16GB, now i can load both of them in the memory now.

 ╰─ $ ollama ps
NAME               ID              SIZE      PROCESSOR    UNTIL
llama3.1:latest    91ab477bec9d    6.7 GB    100% GPU     Forever
llama3.2:latest    a80c4f17acd5    4.0 GB    100% GPU     Forever

EDIT: in fact, what i means is load model into GPU. (my mini PC GPU memory is shared memory)

<!-- gh-comment-id:2397433092 --> @zw963 commented on GitHub (Oct 7, 2024): > Combined size of the models is 8.6G, they cannot co-reside on an 8G GPU. Hi, thanks for answer, I use 7840hs with 780m iGPU, i set share GPU memory to 16GB, now i can load both of them in the memory now. ```sh ╰─ $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.1:latest 91ab477bec9d 6.7 GB 100% GPU Forever llama3.2:latest a80c4f17acd5 4.0 GB 100% GPU Forever ``` -------------- EDIT: in fact, what i means is load model into GPU. (my mini PC GPU memory is shared memory)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66577