[GH-ISSUE #4541] Ollama reload same model when called in different python scripts #2849

Closed
opened 2026-04-12 13:11:21 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @x66ccff on GitHub (May 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4541

What is the issue?

I am running the qwen:32b model on dual RTX A6000 GPUs (48GB each). There seems to be sufficient VRAM available, with cuda0 using 21GB and cuda1 using less than 5GB. According to the logs, all layers of the model have been loaded onto the GPUs.

When I call the ollama library in a single Python script, it works as expected. However, when I try to call ollama from two different Python scripts simultaneously, both requiring the same qwen:32b model, ollama appears to be reloading the same model repeatedly for each API call from the different scripts. I'm puzzled as to why this behavior is occurring.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.36

Originally created by @x66ccff on GitHub (May 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4541 ### What is the issue? I am running the qwen:32b model on dual RTX A6000 GPUs (48GB each). There seems to be sufficient VRAM available, with cuda0 using 21GB and cuda1 using less than 5GB. According to the logs, all layers of the model have been loaded onto the GPUs. When I call the ollama library in a single Python script, it works as expected. However, when I try to call ollama from two different Python scripts simultaneously, both requiring the same qwen:32b model, ollama appears to be reloading the same model repeatedly for each API call from the different scripts. I'm puzzled as to why this behavior is occurring. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.36
GiteaMirror added the bug label 2026-04-12 13:11:21 -05:00
Author
Owner

@pdevine commented on GitHub (May 20, 2024):

@x66ccff can you try updating to ollama 0.1.38?

<!-- gh-comment-id:2120831727 --> @pdevine commented on GitHub (May 20, 2024): @x66ccff can you try updating to ollama `0.1.38`?
Author
Owner

@x66ccff commented on GitHub (May 21, 2024):

@x66ccff can you try updating to ollama 0.1.38?

The issue persist in 0.1.38

<!-- gh-comment-id:2121624246 --> @x66ccff commented on GitHub (May 21, 2024): > @x66ccff can you try updating to ollama `0.1.38`? The issue persist in 0.1.38
Author
Owner

@x66ccff commented on GitHub (May 21, 2024):

The issue seems to be caused by two Python scripts importing the ollama library differently. One uses llama_index.llms.ollama, while the other imports ollama.chat directly. This difference in import methods might be leading to the qwen:32b model being reloaded repeatedly when called from multiple scripts, despite requiring the same model. Any suggestions to solve this?

            options = ollama.Options(
                num_ctx=8192,
                num_predict=2,
            )

               response = ollama.chat(model=model_name,
                                        messages=[
                                            {
                                                'role': 'user',
                                                'content': batch_text + prompt
                                            },
                                        ],
                                        options=options)
from llama_index.llms.ollama import Ollama
llm = Ollama(model=model_name, request_timeout=600)

# ...

    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model_zh if select_language =='zh' else embed_model_en,
        text_splitter=text_splitter,
        system_prompt=system_prompt
    )

# ...
index_ = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

# ...
self.chat_engine = self.index.as_chat_engine(
                                                    chat_mode="context"
                                                     streaming=True)
# ...
stream = self.chat_engine.stream_chat(f"{users_ask}")
<!-- gh-comment-id:2121632345 --> @x66ccff commented on GitHub (May 21, 2024): The issue seems to be caused by two Python scripts importing the ollama library differently. One uses `llama_index.llms.ollama`, while the other imports `ollama.chat` directly. This difference in import methods might be leading to the qwen:32b model being reloaded repeatedly when called from multiple scripts, despite requiring the same model. Any suggestions to solve this? ```python options = ollama.Options( num_ctx=8192, num_predict=2, ) response = ollama.chat(model=model_name, messages=[ { 'role': 'user', 'content': batch_text + prompt }, ], options=options) ``` ```python from llama_index.llms.ollama import Ollama llm = Ollama(model=model_name, request_timeout=600) # ... service_context = ServiceContext.from_defaults( llm=llm, embed_model=embed_model_zh if select_language =='zh' else embed_model_en, text_splitter=text_splitter, system_prompt=system_prompt ) # ... index_ = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True) # ... self.chat_engine = self.index.as_chat_engine( chat_mode="context" streaming=True) # ... stream = self.chat_engine.stream_chat(f"{users_ask}") ```
Author
Owner

@x66ccff commented on GitHub (May 21, 2024):

Each method can be called repeatedly from multiple scripts, but two different methods cannot be called at the same time, which will cause the model to be reloaded for different methods.

<!-- gh-comment-id:2121638572 --> @x66ccff commented on GitHub (May 21, 2024): Each method can be called repeatedly from multiple scripts, but two different methods cannot be called at the same time, which will cause the model to be reloaded for different methods.
Author
Owner

@x66ccff commented on GitHub (May 21, 2024):

Change the ctx length in ollama.Options from 8192 to 3900 (same with llama_index default) solved this problem

<!-- gh-comment-id:2121657522 --> @x66ccff commented on GitHub (May 21, 2024): Change the ctx length in ollama.Options from `8192` to `3900` (same with llama_index default) solved this problem
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2849