[GH-ISSUE #12204] pausing and resuming inference #33879

Closed
opened 2026-04-22 17:02:15 -05:00 by GiteaMirror · 33 comments
Owner

Originally created by @Abdulrahman392011 on GitHub (Sep 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12204

so I am trying to make ollama conversational as in talk to it. The main issue is the inability to pause the inference and resuming it. to clarify i need to pause the inference after each sentence and run the TTS then resuming the next sentence and repeat.

I tried parsing the output and splitting it and run the TTS in parallel. it works but ollama usually isn't really in sync with the TTS model and that makes a lot of inference happening that is thrown in the trash when i interrupt, in other words it's inefficient. it will be a lot better if I can use the python api to pause the inference and then call for resuming when I need to.

can you please see this through with me.

Originally created by @Abdulrahman392011 on GitHub (Sep 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12204 so I am trying to make ollama conversational as in talk to it. The main issue is the inability to pause the inference and resuming it. to clarify i need to pause the inference after each sentence and run the TTS then resuming the next sentence and repeat. I tried parsing the output and splitting it and run the TTS in parallel. it works but ollama usually isn't really in sync with the TTS model and that makes a lot of inference happening that is thrown in the trash when i interrupt, in other words it's inefficient. it will be a lot better if I can use the python api to pause the inference and then call for resuming when I need to. can you please see this through with me.
GiteaMirror added the feature request label 2026-04-22 17:02:15 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 6, 2025):

Let me re-state the issue to make sure I understand it.

You want to have a TTS system render the output of the model. You want to interrupt the model before all of its output has been rendered as speech. You are concerned that the model is generating tokens faster than they can be rendered as speech, and hence interrupting the model wastes tokens. You want to control the rate of token generation so that when you interrupt the model, few or no tokens are queued for speech rendering.

<!-- gh-comment-id:3263053396 --> @rick-github commented on GitHub (Sep 6, 2025): Let me re-state the issue to make sure I understand it. You want to have a TTS system render the output of the model. You want to interrupt the model before all of its output has been rendered as speech. You are concerned that the model is generating tokens faster than they can be rendered as speech, and hence interrupting the model wastes tokens. You want to control the rate of token generation so that when you interrupt the model, few or no tokens are queued for speech rendering.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 6, 2025):

yeah you kinda got the gist of it.

so I can easily parse the output of the model and when i hit a full stop character I will use the feature I am asking for to pause the inference so that all the compute resources can be allocated to the TTS model. once that sentence becomes audio. I will use the feature I am asking for to resume the inference of the model to generate the next sentence, and so on.

this will be able to run any model in a conversational way. so when you interrupt the model verbally it's very smooth and not a toll on the compute resources.

<!-- gh-comment-id:3263088439 --> @Abdulrahman392011 commented on GitHub (Sep 6, 2025): yeah you kinda got the gist of it. so I can easily parse the output of the model and when i hit a full stop character I will use the feature I am asking for to pause the inference so that all the compute resources can be allocated to the TTS model. once that sentence becomes audio. I will use the feature I am asking for to resume the inference of the model to generate the next sentence, and so on. this will be able to run any model in a conversational way. so when you interrupt the model verbally it's very smooth and not a toll on the compute resources.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set keep_alive=0 to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed.

#!/usr/bin/env python3

import ollama
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("--model", default="qwen2.5:0.5b")
parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"])
args = parser.parse_args()

#break_chars = [ '.', ',' ]
break_chars = [ '.' ]
#options = {"temperature":0,"seed":42}
options = {}
#keep_alive = 0
keep_alive = -1

def speak(phrase):
  phrase = phrase.replace("\n", " ").lstrip(" ")
  print(f"speak('{phrase}')")

def phrases(prompt):
  messages = [{"role":"user","content":prompt}]
  assistant = ''
  phrase = ''
  while True:
    m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else [])
    response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive)
    for r in response:
      c = r.message.content
      assistant += c
      phrase += c
      if r.done:
        yield phrase
        return
      dot = max([phrase.rfind(c) for c in break_chars])
      if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]:
        response.close()
        yield phrase[:dot+1]
        phrase = phrase[dot+1:]

def main():
  prompt = " ".join(args.prompt)
  for phrase in phrases(prompt):
    speak(phrase)
  #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content)

if __name__ == "__main__":
  main()

In practice you wouldn't wait until speak() had finished playing the audio before restarting the inference to get the next phrase, you would use asyncio and multi-thread, but you get the idea.

<!-- gh-comment-id:3263315130 --> @rick-github commented on GitHub (Sep 7, 2025): There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set `keep_alive=0` to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed. ```python #!/usr/bin/env python3 import ollama import argparse import time parser = argparse.ArgumentParser() parser.add_argument("--model", default="qwen2.5:0.5b") parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"]) args = parser.parse_args() #break_chars = [ '.', ',' ] break_chars = [ '.' ] #options = {"temperature":0,"seed":42} options = {} #keep_alive = 0 keep_alive = -1 def speak(phrase): phrase = phrase.replace("\n", " ").lstrip(" ") print(f"speak('{phrase}')") def phrases(prompt): messages = [{"role":"user","content":prompt}] assistant = '' phrase = '' while True: m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else []) response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive) for r in response: c = r.message.content assistant += c phrase += c if r.done: yield phrase return dot = max([phrase.rfind(c) for c in break_chars]) if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]: response.close() yield phrase[:dot+1] phrase = phrase[dot+1:] def main(): prompt = " ".join(args.prompt) for phrase in phrases(prompt): speak(phrase) #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content) if __name__ == "__main__": main() ``` In practice you wouldn't wait until `speak()` had finished playing the audio before restarting the inference to get the next phrase, you would use `asyncio` and multi-thread, but you get the idea.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself.

thanks for great support, always fast to reply and understanding.

<!-- gh-comment-id:3263727718 --> @Abdulrahman392011 commented on GitHub (Sep 7, 2025): it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself. thanks for great support, always fast to reply and understanding.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.

<!-- gh-comment-id:3263731367 --> @Abdulrahman392011 commented on GitHub (Sep 7, 2025): I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu

As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.

<!-- gh-comment-id:3263733144 --> @rick-github commented on GitHub (Sep 7, 2025): > I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy.

I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user.

I will look into that feature and tell you what I found.

<!-- gh-comment-id:3263734561 --> @Abdulrahman392011 commented on GitHub (Sep 7, 2025): I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy. I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user. I will look into that feature and tell you what I found.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

cause when you instantiate another reply it will start over.

It does not. Did you even try running the example?

<!-- gh-comment-id:3263735442 --> @rick-github commented on GitHub (Sep 7, 2025): > cause when you instantiate another reply it will start over. It does not. Did you even try running the example?
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again

<!-- gh-comment-id:3263736496 --> @Abdulrahman392011 commented on GitHub (Sep 7, 2025): sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I ran the code it works beautifully. that's just what I needed, thanks rick you rock

<!-- gh-comment-id:3263740030 --> @Abdulrahman392011 commented on GitHub (Sep 7, 2025): I ran the code it works beautifully. that's just what I needed, thanks rick you rock
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on.

can you share your thoughts about this. i think it will work.

<!-- gh-comment-id:3316296697 --> @Abdulrahman392011 commented on GitHub (Sep 21, 2025): hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range. the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on. can you share your thoughts about this. i think it will work.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

use a smaller model or a larger model depending on the confidence percentage of the token

This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation.

can you share your thoughts about this. i think it will work.

It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google uses it for some of their products.

<!-- gh-comment-id:3316310754 --> @rick-github commented on GitHub (Sep 21, 2025): > use a smaller model or a larger model depending on the confidence percentage of the token This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192. > the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation. > can you share your thoughts about this. i think it will work. It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google [uses it](https://research.google/blog/looking-back-at-speculative-decoding/) for some of their products.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time.

what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?

<!-- gh-comment-id:3316312679 --> @Abdulrahman392011 commented on GitHub (Sep 21, 2025): your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time. what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.

<!-- gh-comment-id:3316313630 --> @rick-github commented on GitHub (Sep 21, 2025): Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs.

not to mention that if you can switch between large and small models in the same response it will enhance the performance.

also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.

<!-- gh-comment-id:3316316640 --> @Abdulrahman392011 commented on GitHub (Sep 21, 2025): so maybe the mistake they are doing is the choice of the models they are running. if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs. not to mention that if you can switch between large and small models in the same response it will enhance the performance. also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

What mistake?

but if the model only generate the most crucial tokens

I think this is the crux, determining the cruciality of tokens.

<!-- gh-comment-id:3316318048 --> @rick-github commented on GitHub (Sep 21, 2025): > so maybe the mistake they are doing is the choice of the models they are running. What mistake? > but if the model only generate the most crucial tokens I think this is the crux, determining the cruciality of tokens.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.

<!-- gh-comment-id:3316319607 --> @Abdulrahman392011 commented on GitHub (Sep 21, 2025): by mistake i meant the mistake that were done that made not really popular and a must have in every engine. it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.

<!-- gh-comment-id:3316322219 --> @rick-github commented on GitHub (Sep 21, 2025): > by mistake i meant the mistake that were done that made not really popular and a must have in every engine. Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible.

every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable.

the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.

<!-- gh-comment-id:3316328388 --> @Abdulrahman392011 commented on GitHub (Sep 21, 2025): it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible. every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable. the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.

<!-- gh-comment-id:3316335162 --> @rick-github commented on GitHub (Sep 21, 2025): It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.
Author
Owner

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?

<!-- gh-comment-id:3316338829 --> @Abdulrahman392011 commented on GitHub (Sep 21, 2025): by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?
Author
Owner

@rick-github commented on GitHub (Sep 21, 2025):

At the time, development on llama.cpp was languishing. They had basically shelved the idea of supporting multi-modal models, just when vision processing was starting to take off. So the ollama team decided to develop their own vision processing and other capabilities, causing the server codebase to diverge. ollama still uses the GPU kernels, so the two projects mainly differ at the server level. llama.cpp has since revitalized their development and is still used in ollama for those models that haven't been integrated into the new ollama engine.

As far as I know this shouldn't pose any problems to your experiments.

<!-- gh-comment-id:3316345007 --> @rick-github commented on GitHub (Sep 21, 2025): At the time, development on llama.cpp was languishing. They had basically shelved the idea of supporting multi-modal models, just when vision processing was starting to take off. So the ollama team decided to develop their own vision processing and other capabilities, causing the server codebase to diverge. ollama still uses the GPU kernels, so the two projects mainly differ at the server level. llama.cpp has since revitalized their development and is still used in ollama for those models that haven't been integrated into the new ollama engine. As far as I know this shouldn't pose any problems to your experiments.
Author
Owner

@xxxajk commented on GitHub (Apr 4, 2026):

What I would love to actually see is a way to send the llm an interrupt, so it can do a self health check via providing various statistics, such as kv cache left, delta since last check, token amount left in order to trigger a tool call that could save the current progress on thinking, then resume with a fresh context. I have been playing with a similar cloning concept, which has shown success.
Basically when responses start to take too long, or the history is getting too full, the llm is instructed to provide a highly detailed summary of context, and certain tool states/explicit rules to ignore (one of them is a goal and subgoal tree, which is used as persistent storage, which may have been asked to not be used for example).
The summary is then sent to a clone as the first prompt, and that trims off the history bloat from tool calls and and chat, as a speed optimization.
if there was a way to trigger the tool call to perform such a summary, it wouldn't have to be checked before or after a tool call or at the cli prompt. it could happen automatically instead. Thoughts?

<!-- gh-comment-id:4187158077 --> @xxxajk commented on GitHub (Apr 4, 2026): What I would love to actually see is a way to send the llm an interrupt, so it can do a self health check via providing various statistics, such as kv cache left, delta since last check, token amount left in order to trigger a tool call that could save the current progress on thinking, then resume with a fresh context. I have been playing with a similar cloning concept, which has shown success. Basically when responses start to take too long, or the history is getting too full, the llm is instructed to provide a highly detailed summary of context, and certain tool states/explicit rules to ignore (one of them is a goal and subgoal tree, which is used as persistent storage, which may have been asked to not be used for example). The summary is then sent to a clone as the first prompt, and that trims off the history bloat from tool calls and and chat, as a speed optimization. if there was a way to trigger the tool call to perform such a summary, it wouldn't have to be checked before or after a tool call or at the cli prompt. it could happen automatically instead. Thoughts?
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 4, 2026):

@xxxajk I'm sorry i couldn't keep up with what you said. can you be a bit more detailed and simplify. so far we can add the history as assistant to make the model continue where it stopped.

are you worried about the context window for research purposes? iterating over and over is one way to do things, i think that this is what open-claw does. ollama support it, you can use it for that purpose.

if you are worried about adding checkpoints to asses the progress before using a tool. i think that can be done with open-claw as well.

<!-- gh-comment-id:4187228175 --> @Abdulrahman392011 commented on GitHub (Apr 4, 2026): @xxxajk I'm sorry i couldn't keep up with what you said. can you be a bit more detailed and simplify. so far we can add the history as assistant to make the model continue where it stopped. are you worried about the context window for research purposes? iterating over and over is one way to do things, i think that this is what open-claw does. ollama support it, you can use it for that purpose. if you are worried about adding checkpoints to asses the progress before using a tool. i think that can be done with open-claw as well.
Author
Owner

@xxxajk commented on GitHub (Apr 7, 2026):

Idea has uses beyond what I stated.
I realize an LLM is supposed to process text, but there are other interactions that can happen where you would want to send a trigger as an event signal.
If you've ever done any low level code that deals with hardware, you would understand the interrupt concept better.

<!-- gh-comment-id:4201097226 --> @xxxajk commented on GitHub (Apr 7, 2026): Idea has uses beyond what I stated. I realize an LLM is supposed to process text, but there are other interactions that can happen where you would want to send a trigger as an event signal. If you've ever done any low level code that deals with hardware, you would understand the interrupt concept better.
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 7, 2026):

@xxxajk maybe use the logprob feature they added recently. it can provide stats, maybe that would be useful for your use case.

<!-- gh-comment-id:4201138401 --> @Abdulrahman392011 commented on GitHub (Apr 7, 2026): @xxxajk maybe use the logprob feature they added recently. it can provide stats, maybe that would be useful for your use case.
Author
Owner

@xxxajk commented on GitHub (Apr 7, 2026):

Well, that would be only one use case. Looking for something more along the lines of an interrupt, which is exactly what it sounds like... you interrupt the model, save state, do something else (with the model or resources) then return back to execution where it left off. Think like when a model decides to use a tool, but the other way around. the model stops to use the tool, then resumes with the tool results.

As an analogy, let's say you are writing code, and mom taps you on the shoulder to get your attention to open a bottle that she can't. She interrupted you. You open the bottle, give it back to her, then continue writing code where you left off.

High level "applications" coders usually don't "understand" the concept these days, and it is very fundamental.

<!-- gh-comment-id:4201287560 --> @xxxajk commented on GitHub (Apr 7, 2026): Well, that would be only one use case. Looking for something more along the lines of an interrupt, which is exactly what it sounds like... you interrupt the model, save state, do something else (with the model or resources) then return back to execution where it left off. Think like when a model decides to use a tool, but the other way around. the model stops to use the tool, then resumes with the tool results. As an analogy, let's say you are writing code, and mom taps you on the shoulder to get your attention to open a bottle that she can't. She interrupted you. You open the bottle, give it back to her, then continue writing code where you left off. High level "applications" coders usually don't "understand" the concept these days, and it is very fundamental.
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 8, 2026):

@xxxajk
I think i understand, if the stats you need is provided by the logprobe feature in ollama, you can save it in a json file and update the jason everytime the model is interrupted.

I'm not sure but i think there's no framework that allow the model to paused and resumed on a low level. There always a workaround that unfortunately fails in some use cases.

I was trying to use the same mechanism of pausing and resuming with structured output in order to have the model skip generating the object names. I was pausing the model, inject the object name and resume. The problem was that the model when resumed it needs an object name and since it was injected instead of given to the model it gets confused as it thinks it generated it as a normal response not as the object name.

I agree with you on the fact that pausing and resuming on a lower level can be superior than the injection mechanism

<!-- gh-comment-id:4206075834 --> @Abdulrahman392011 commented on GitHub (Apr 8, 2026): @xxxajk I think i understand, if the stats you need is provided by the logprobe feature in ollama, you can save it in a json file and update the jason everytime the model is interrupted. I'm not sure but i think there's no framework that allow the model to paused and resumed on a low level. There always a workaround that unfortunately fails in some use cases. I was trying to use the same mechanism of pausing and resuming with structured output in order to have the model skip generating the object names. I was pausing the model, inject the object name and resume. The problem was that the model when resumed it needs an object name and since it was injected instead of given to the model it gets confused as it thinks it generated it as a normal response not as the object name. I agree with you on the fact that pausing and resuming on a lower level can be superior than the injection mechanism
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 8, 2026):

@xxxajk why don't you take a look at the ollama runner code on Github and see if there's something that can be done. If you managed to pause and resume successfully, please make a pull request so we can all use it.

<!-- gh-comment-id:4206294647 --> @Abdulrahman392011 commented on GitHub (Apr 8, 2026): @xxxajk why don't you take a look at the ollama runner code on Github and see if there's something that can be done. If you managed to pause and resume successfully, please make a pull request so we can all use it.
Author
Owner

@xxxajk commented on GitHub (Apr 8, 2026):

I would except the following:
1: the whole arch isn't done with enough abstraction, no way to even apply a hook
2: whole system pipeline (including models) has what I call the HAL9000 problem -- reach goal no matter what, lie if you need to, from start to completion, no way to even request a graceful cancel.
3: crams the entire model into vram, instead of on as needed and can evict old
4: no concept of a session as an object context
5: wasn't designed with any sort of scheduling or cooperation.

Yes, some of these things would "slow the model down", some of us don't care how long things take, the result is more important.
We get that people are impatient, they want things fast, slim, and to look pretty, but there's not even an option available for corner cases.
I personally take functionality first over speed. Optimize it after so traps like this can't happen from the start of the design.
It would take too long for a single person to do this, especially when they aren't familiar with the code base.

<!-- gh-comment-id:4207582413 --> @xxxajk commented on GitHub (Apr 8, 2026): I would except the following: 1: the whole arch isn't done with enough abstraction, no way to even apply a hook 2: whole system pipeline (including models) has what I call the HAL9000 problem -- reach goal no matter what, lie if you need to, from start to completion, no way to even request a graceful cancel. 3: crams the entire model into vram, instead of on as needed and can evict old 4: no concept of a session as an object context 5: wasn't designed with any sort of scheduling or cooperation. Yes, some of these things would "slow the model down", some of us don't care how long things take, the result is more important. We get that people are impatient, they want things fast, slim, and to look pretty, but there's not even an option available for corner cases. I personally take functionality first over speed. Optimize it after so traps like this can't happen from the start of the design. It would take too long for a single person to do this, especially when they aren't familiar with the code base.
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 8, 2026):

Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement.

<!-- gh-comment-id:4207746838 --> @Abdulrahman392011 commented on GitHub (Apr 8, 2026): Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement.
Author
Owner

@xxxajk commented on GitHub (Apr 10, 2026):

Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement.

Unfortunately I'm fairly overbooked, and spend any spare time I have with exploring and learning just like you do.
I don't always look for the easiest way to run something either.
Again, the end results matter more than the journey to get there, perhaps you have the time, I do not. If you don't have the skills, acquire them. Everybody has the right to learn, nobody has the right to interfere. Do the leg work to reach the goal, it can be rewarding in many ways -- that's how I get to be overloaded with work and paying jobs. You get to show your work.

<!-- gh-comment-id:4222274576 --> @xxxajk commented on GitHub (Apr 10, 2026): > Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement. Unfortunately I'm fairly overbooked, and spend any spare time I have with exploring and learning just like you do. I don't always look for the easiest way to run something either. Again, the end results matter more than the journey to get there, perhaps you have the time, I do not. If you don't have the skills, acquire them. Everybody has the right to learn, nobody has the right to interfere. Do the leg work to reach the goal, it can be rewarding in many ways -- that's how I get to be overloaded with work and paying jobs. You get to show your work.
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 10, 2026):

@xxxajk
You're right, I'll try and learn about this more. Sure something good will come out of it, even if i couldn't reach the target per se

<!-- gh-comment-id:4224584882 --> @Abdulrahman392011 commented on GitHub (Apr 10, 2026): @xxxajk You're right, I'll try and learn about this more. Sure something good will come out of it, even if i couldn't reach the target per se
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33879