pausing and resuming inference #8127

@rick-github commented on GitHub (Sep 7, 2025):

There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set keep_alive=0 to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed.

#!/usr/bin/env python3

import ollama
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("--model", default="qwen2.5:0.5b")
parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"])
args = parser.parse_args()

#break_chars = [ '.', ',' ]
break_chars = [ '.' ]
#options = {"temperature":0,"seed":42}
options = {}
#keep_alive = 0
keep_alive = -1

def speak(phrase):
  phrase = phrase.replace("\n", " ").lstrip(" ")
  print(f"speak('{phrase}')")

def phrases(prompt):
  messages = [{"role":"user","content":prompt}]
  assistant = ''
  phrase = ''
  while True:
    m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else [])
    response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive)
    for r in response:
      c = r.message.content
      assistant += c
      phrase += c
      if r.done:
        yield phrase
        return
      dot = max([phrase.rfind(c) for c in break_chars])
      if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]:
        response.close()
        yield phrase[:dot+1]
        phrase = phrase[dot+1:]

def main():
  prompt = " ".join(args.prompt)
  for phrase in phrases(prompt):
    speak(phrase)
  #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content)

if __name__ == "__main__":
  main()

In practice you wouldn't wait until speak() had finished playing the audio before restarting the inference to get the next phrase, you would use asyncio and multi-thread, but you get the idea.

@rick-github commented on GitHub (Sep 7, 2025): There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set `keep_alive=0` to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed. ```python #!/usr/bin/env python3 import ollama import argparse import time parser = argparse.ArgumentParser() parser.add_argument("--model", default="qwen2.5:0.5b") parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"]) args = parser.parse_args() #break_chars = [ '.', ',' ] break_chars = [ '.' ] #options = {"temperature":0,"seed":42} options = {} #keep_alive = 0 keep_alive = -1 def speak(phrase): phrase = phrase.replace("\n", " ").lstrip(" ") print(f"speak('{phrase}')") def phrases(prompt): messages = [{"role":"user","content":prompt}] assistant = '' phrase = '' while True: m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else []) response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive) for r in response: c = r.message.content assistant += c phrase += c if r.done: yield phrase return dot = max([phrase.rfind(c) for c in break_chars]) if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]: response.close() yield phrase[:dot+1] phrase = phrase[dot+1:] def main(): prompt = " ".join(args.prompt) for phrase in phrases(prompt): speak(phrase) #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content) if __name__ == "__main__": main() ``` In practice you wouldn't wait until `speak()` had finished playing the audio before restarting the inference to get the next phrase, you would use `asyncio` and multi-thread, but you get the idea.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself.

thanks for great support, always fast to reply and understanding.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself. thanks for great support, always fast to reply and understanding.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.

GiteaMirror commented

@rick-github commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu

As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.

@rick-github commented on GitHub (Sep 7, 2025): > I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy.

I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user.

I will look into that feature and tell you what I found.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy. I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user. I will look into that feature and tell you what I found.

GiteaMirror commented

@rick-github commented on GitHub (Sep 7, 2025):

cause when you instantiate another reply it will start over.

It does not. Did you even try running the example?

@rick-github commented on GitHub (Sep 7, 2025): > cause when you instantiate another reply it will start over. It does not. Did you even try running the example?

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again

GiteaMirror commented

2025-11-12 14:31:29 -06:00

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I ran the code it works beautifully. that's just what I needed, thanks rick you rock

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I ran the code it works beautifully. that's just what I needed, thanks rick you rock

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on.

can you share your thoughts about this. i think it will work.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range. the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on. can you share your thoughts about this. i think it will work.

GiteaMirror commented

2025-11-12 14:31:29 -06:00

@rick-github commented on GitHub (Sep 21, 2025):

use a smaller model or a larger model depending on the confidence percentage of the token

This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation.

can you share your thoughts about this. i think it will work.

It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google uses it for some of their products.

@rick-github commented on GitHub (Sep 21, 2025): > use a smaller model or a larger model depending on the confidence percentage of the token This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192. > the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation. > can you share your thoughts about this. i think it will work. It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google [uses it](https://research.google/blog/looking-back-at-speculative-decoding/) for some of their products.

GiteaMirror commented

2025-11-12 14:31:29 -06:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time.

what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time. what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?

GiteaMirror commented

@rick-github commented on GitHub (Sep 21, 2025):

Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.

@rick-github commented on GitHub (Sep 21, 2025): Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs.

not to mention that if you can switch between large and small models in the same response it will enhance the performance.

also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): so maybe the mistake they are doing is the choice of the models they are running. if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs. not to mention that if you can switch between large and small models in the same response it will enhance the performance. also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.

GiteaMirror commented

@rick-github commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

What mistake?

but if the model only generate the most crucial tokens

I think this is the crux, determining the cruciality of tokens.

@rick-github commented on GitHub (Sep 21, 2025): > so maybe the mistake they are doing is the choice of the models they are running. What mistake? > but if the model only generate the most crucial tokens I think this is the crux, determining the cruciality of tokens.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): by mistake i meant the mistake that were done that made not really popular and a must have in every engine. it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.

GiteaMirror commented

@rick-github commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.

@rick-github commented on GitHub (Sep 21, 2025): > by mistake i meant the mistake that were done that made not really popular and a must have in every engine. Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible.

every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable.

the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible. every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable. the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.

GiteaMirror commented

@rick-github commented on GitHub (Sep 21, 2025):

It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.

@rick-github commented on GitHub (Sep 21, 2025): It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.

GiteaMirror commented

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?

GiteaMirror commented