pausing and resuming inference #8127

Closed
opened 2025-11-12 14:31:25 -06:00 by GiteaMirror · 22 comments
Owner

Originally created by @Abdulrahman392011 on GitHub (Sep 6, 2025).

so I am trying to make ollama conversational as in talk to it. The main issue is the inability to pause the inference and resuming it. to clarify i need to pause the inference after each sentence and run the TTS then resuming the next sentence and repeat.

I tried parsing the output and splitting it and run the TTS in parallel. it works but ollama usually isn't really in sync with the TTS model and that makes a lot of inference happening that is thrown in the trash when i interrupt, in other words it's inefficient. it will be a lot better if I can use the python api to pause the inference and then call for resuming when I need to.

can you please see this through with me.

Originally created by @Abdulrahman392011 on GitHub (Sep 6, 2025). so I am trying to make ollama conversational as in talk to it. The main issue is the inability to pause the inference and resuming it. to clarify i need to pause the inference after each sentence and run the TTS then resuming the next sentence and repeat. I tried parsing the output and splitting it and run the TTS in parallel. it works but ollama usually isn't really in sync with the TTS model and that makes a lot of inference happening that is thrown in the trash when i interrupt, in other words it's inefficient. it will be a lot better if I can use the python api to pause the inference and then call for resuming when I need to. can you please see this through with me.
GiteaMirror added the feature request label 2025-11-12 14:31:25 -06:00
Author
Owner

@rick-github commented on GitHub (Sep 6, 2025):

Let me re-state the issue to make sure I understand it.

You want to have a TTS system render the output of the model. You want to interrupt the model before all of its output has been rendered as speech. You are concerned that the model is generating tokens faster than they can be rendered as speech, and hence interrupting the model wastes tokens. You want to control the rate of token generation so that when you interrupt the model, few or no tokens are queued for speech rendering.

@rick-github commented on GitHub (Sep 6, 2025): Let me re-state the issue to make sure I understand it. You want to have a TTS system render the output of the model. You want to interrupt the model before all of its output has been rendered as speech. You are concerned that the model is generating tokens faster than they can be rendered as speech, and hence interrupting the model wastes tokens. You want to control the rate of token generation so that when you interrupt the model, few or no tokens are queued for speech rendering.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 6, 2025):

yeah you kinda got the gist of it.

so I can easily parse the output of the model and when i hit a full stop character I will use the feature I am asking for to pause the inference so that all the compute resources can be allocated to the TTS model. once that sentence becomes audio. I will use the feature I am asking for to resume the inference of the model to generate the next sentence, and so on.

this will be able to run any model in a conversational way. so when you interrupt the model verbally it's very smooth and not a toll on the compute resources.

@Abdulrahman392011 commented on GitHub (Sep 6, 2025): yeah you kinda got the gist of it. so I can easily parse the output of the model and when i hit a full stop character I will use the feature I am asking for to pause the inference so that all the compute resources can be allocated to the TTS model. once that sentence becomes audio. I will use the feature I am asking for to resume the inference of the model to generate the next sentence, and so on. this will be able to run any model in a conversational way. so when you interrupt the model verbally it's very smooth and not a toll on the compute resources.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set keep_alive=0 to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed.

#!/usr/bin/env python3

import ollama
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("--model", default="qwen2.5:0.5b")
parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"])
args = parser.parse_args()

#break_chars = [ '.', ',' ]
break_chars = [ '.' ]
#options = {"temperature":0,"seed":42}
options = {}
#keep_alive = 0
keep_alive = -1

def speak(phrase):
  phrase = phrase.replace("\n", " ").lstrip(" ")
  print(f"speak('{phrase}')")

def phrases(prompt):
  messages = [{"role":"user","content":prompt}]
  assistant = ''
  phrase = ''
  while True:
    m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else [])
    response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive)
    for r in response:
      c = r.message.content
      assistant += c
      phrase += c
      if r.done:
        yield phrase
        return
      dot = max([phrase.rfind(c) for c in break_chars])
      if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]:
        response.close()
        yield phrase[:dot+1]
        phrase = phrase[dot+1:]

def main():
  prompt = " ".join(args.prompt)
  for phrase in phrases(prompt):
    speak(phrase)
  #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content)

if __name__ == "__main__":
  main()

In practice you wouldn't wait until speak() had finished playing the audio before restarting the inference to get the next phrase, you would use asyncio and multi-thread, but you get the idea.

@rick-github commented on GitHub (Sep 7, 2025): There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set `keep_alive=0` to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed. ```python #!/usr/bin/env python3 import ollama import argparse import time parser = argparse.ArgumentParser() parser.add_argument("--model", default="qwen2.5:0.5b") parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"]) args = parser.parse_args() #break_chars = [ '.', ',' ] break_chars = [ '.' ] #options = {"temperature":0,"seed":42} options = {} #keep_alive = 0 keep_alive = -1 def speak(phrase): phrase = phrase.replace("\n", " ").lstrip(" ") print(f"speak('{phrase}')") def phrases(prompt): messages = [{"role":"user","content":prompt}] assistant = '' phrase = '' while True: m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else []) response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive) for r in response: c = r.message.content assistant += c phrase += c if r.done: yield phrase return dot = max([phrase.rfind(c) for c in break_chars]) if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]: response.close() yield phrase[:dot+1] phrase = phrase[dot+1:] def main(): prompt = " ".join(args.prompt) for phrase in phrases(prompt): speak(phrase) #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content) if __name__ == "__main__": main() ``` In practice you wouldn't wait until `speak()` had finished playing the audio before restarting the inference to get the next phrase, you would use `asyncio` and multi-thread, but you get the idea.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself.

thanks for great support, always fast to reply and understanding.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself. thanks for great support, always fast to reply and understanding.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu

As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.

@rick-github commented on GitHub (Sep 7, 2025): > I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy.

I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user.

I will look into that feature and tell you what I found.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy. I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user. I will look into that feature and tell you what I found.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

cause when you instantiate another reply it will start over.

It does not. Did you even try running the example?

@rick-github commented on GitHub (Sep 7, 2025): > cause when you instantiate another reply it will start over. It does not. Did you even try running the example?
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I ran the code it works beautifully. that's just what I needed, thanks rick you rock

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I ran the code it works beautifully. that's just what I needed, thanks rick you rock
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on.

can you share your thoughts about this. i think it will work.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range. the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on. can you share your thoughts about this. i think it will work.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

use a smaller model or a larger model depending on the confidence percentage of the token

This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation.

can you share your thoughts about this. i think it will work.

It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google uses it for some of their products.

@rick-github commented on GitHub (Sep 21, 2025): > use a smaller model or a larger model depending on the confidence percentage of the token This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192. > the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation. > can you share your thoughts about this. i think it will work. It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google [uses it](https://research.google/blog/looking-back-at-speculative-decoding/) for some of their products.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time.

what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time. what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.

@rick-github commented on GitHub (Sep 21, 2025): Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs.

not to mention that if you can switch between large and small models in the same response it will enhance the performance.

also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): so maybe the mistake they are doing is the choice of the models they are running. if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs. not to mention that if you can switch between large and small models in the same response it will enhance the performance. also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

What mistake?

but if the model only generate the most crucial tokens

I think this is the crux, determining the cruciality of tokens.

@rick-github commented on GitHub (Sep 21, 2025): > so maybe the mistake they are doing is the choice of the models they are running. What mistake? > but if the model only generate the most crucial tokens I think this is the crux, determining the cruciality of tokens.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): by mistake i meant the mistake that were done that made not really popular and a must have in every engine. it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.

@rick-github commented on GitHub (Sep 21, 2025): > by mistake i meant the mistake that were done that made not really popular and a must have in every engine. Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible.

every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable.

the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible. every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable. the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.

@rick-github commented on GitHub (Sep 21, 2025): It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

At the time, development on llama.cpp was languishing. They had basically shelved the idea of supporting multi-modal models, just when vision processing was starting to take off. So the ollama team decided to develop their own vision processing and other capabilities, causing the server codebase to diverge. ollama still uses the GPU kernels, so the two projects mainly differ at the server level. llama.cpp has since revitalized their development and is still used in ollama for those models that haven't been integrated into the new ollama engine.

As far as I know this shouldn't pose any problems to your experiments.

@rick-github commented on GitHub (Sep 21, 2025): At the time, development on llama.cpp was languishing. They had basically shelved the idea of supporting multi-modal models, just when vision processing was starting to take off. So the ollama team decided to develop their own vision processing and other capabilities, causing the server codebase to diverge. ollama still uses the GPU kernels, so the two projects mainly differ at the server level. llama.cpp has since revitalized their development and is still used in ollama for those models that haven't been integrated into the new ollama engine. As far as I know this shouldn't pose any problems to your experiments.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#8127