[GH-ISSUE #2730] Langchain + Chainlit integration issue #1640

Closed
opened 2026-04-12 11:35:04 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Michelklingler on GitHub (Feb 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2730

Hi!

I'm using Ollama on a local server RTX A6000 ADA running Mixtral 8x7B.
I run Ollama locally and expose an API endpoint for multiple user to connect and use the LLM in a chat powered by Chainlit + Langchain.

There is 2 issue I want to solve:

1 - Ollama is serving, but the model seems to deload from the GPU after a period of inactivity. This makes the first requet to be a little longer than usual. Is it possible to make sure the model is always running and loaded in the GPU?

2 - When I do 2 request at the same time the output streaming for both user freeze for a few second before continuing slower.
It makes sens it goes slower as Ollama is serving 2 user at the same time, but I wonder why there is this constant freeze, when I do 10 requests at the same time some of them don't even go through.

Could anyone help me on this?
Thanks!
Michel

Originally created by @Michelklingler on GitHub (Feb 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2730 Hi! I'm using Ollama on a local server RTX A6000 ADA running Mixtral 8x7B. I run Ollama locally and expose an API endpoint for multiple user to connect and use the LLM in a chat powered by Chainlit + Langchain. There is 2 issue I want to solve: 1 - Ollama is serving, but the model seems to deload from the GPU after a period of inactivity. This makes the first requet to be a little longer than usual. Is it possible to make sure the model is always running and loaded in the GPU? 2 - When I do 2 request at the same time the output streaming for both user freeze for a few second before continuing slower. It makes sens it goes slower as Ollama is serving 2 user at the same time, but I wonder why there is this constant freeze, when I do 10 requests at the same time some of them don't even go through. Could anyone help me on this? Thanks! Michel
Author
Owner

@ryanspain commented on GitHub (Mar 9, 2024):

See the docs on how do I keep a model loaded in memory.

<!-- gh-comment-id:1986890090 --> @ryanspain commented on GitHub (Mar 9, 2024): See the docs on [how do I keep a model loaded in memory](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately).
Author
Owner

@jmorganca commented on GitHub (Mar 12, 2024):

Hi there, yes! Ollama can keep models in memory indefinitely by setting the keep_alive API parameter to -1. For your second issue, Ollama serves one request at a time right now, although this is something that should improve soon. I'll close this for now but feel free to re-open if you're still hitting issues. Thanks!

<!-- gh-comment-id:1990378746 --> @jmorganca commented on GitHub (Mar 12, 2024): Hi there, yes! Ollama can keep models in memory indefinitely by setting the `keep_alive` API parameter to `-1`. For your second issue, Ollama serves one request at a time right now, although this is something that should improve soon. I'll close this for now but feel free to re-open if you're still hitting issues. Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1640