[GH-ISSUE #12204] pausing and resuming inference #33879

New Issue

GiteaMirror · 2026-04-22T17:02:15-05:00

GiteaMirror commented

2026-04-22 17:02:15 -05:00

Originally created by @Abdulrahman392011 on GitHub (Sep 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12204

so I am trying to make ollama conversational as in talk to it. The main issue is the inability to pause the inference and resuming it. to clarify i need to pause the inference after each sentence and run the TTS then resuming the next sentence and repeat.

I tried parsing the output and splitting it and run the TTS in parallel. it works but ollama usually isn't really in sync with the TTS model and that makes a lot of inference happening that is thrown in the trash when i interrupt, in other words it's inefficient. it will be a lot better if I can use the python api to pause the inference and then call for resuming when I need to.

can you please see this through with me.

Originally created by @Abdulrahman392011 on GitHub (Sep 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12204 so I am trying to make ollama conversational as in talk to it. The main issue is the inability to pause the inference and resuming it. to clarify i need to pause the inference after each sentence and run the TTS then resuming the next sentence and repeat. I tried parsing the output and splitting it and run the TTS in parallel. it works but ollama usually isn't really in sync with the TTS model and that makes a lot of inference happening that is thrown in the trash when i interrupt, in other words it's inefficient. it will be a lot better if I can use the python api to pause the inference and then call for resuming when I need to. can you please see this through with me.

GiteaMirror added the feature request label 2026-04-22 17:02:15 -05:00

GiteaMirror closed this issue

2026-04-22 17:02:16 -05:00

GiteaMirror commented

2026-04-22 17:02:17 -05:00

@rick-github commented on GitHub (Sep 6, 2025):

Let me re-state the issue to make sure I understand it.

You want to have a TTS system render the output of the model. You want to interrupt the model before all of its output has been rendered as speech. You are concerned that the model is generating tokens faster than they can be rendered as speech, and hence interrupting the model wastes tokens. You want to control the rate of token generation so that when you interrupt the model, few or no tokens are queued for speech rendering.

@rick-github commented on GitHub (Sep 6, 2025): Let me re-state the issue to make sure I understand it. You want to have a TTS system render the output of the model. You want to interrupt the model before all of its output has been rendered as speech. You are concerned that the model is generating tokens faster than they can be rendered as speech, and hence interrupting the model wastes tokens. You want to control the rate of token generation so that when you interrupt the model, few or no tokens are queued for speech rendering.

GiteaMirror commented

2026-04-22 17:02:17 -05:00

@Abdulrahman392011 commented on GitHub (Sep 6, 2025):

yeah you kinda got the gist of it.

so I can easily parse the output of the model and when i hit a full stop character I will use the feature I am asking for to pause the inference so that all the compute resources can be allocated to the TTS model. once that sentence becomes audio. I will use the feature I am asking for to resume the inference of the model to generate the next sentence, and so on.

this will be able to run any model in a conversational way. so when you interrupt the model verbally it's very smooth and not a toll on the compute resources.

@Abdulrahman392011 commented on GitHub (Sep 6, 2025): yeah you kinda got the gist of it. so I can easily parse the output of the model and when i hit a full stop character I will use the feature I am asking for to pause the inference so that all the compute resources can be allocated to the TTS model. once that sentence becomes audio. I will use the feature I am asking for to resume the inference of the model to generate the next sentence, and so on. this will be able to run any model in a conversational way. so when you interrupt the model verbally it's very smooth and not a toll on the compute resources.

GiteaMirror commented

2026-04-22 17:02:18 -05:00

@rick-github commented on GitHub (Sep 7, 2025):

There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set keep_alive=0 to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed.

#!/usr/bin/env python3

import ollama
import argparse
import time

parser = argparse.ArgumentParser()
parser.add_argument("--model", default="qwen2.5:0.5b")
parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"])
args = parser.parse_args()

#break_chars = [ '.', ',' ]
break_chars = [ '.' ]
#options = {"temperature":0,"seed":42}
options = {}
#keep_alive = 0
keep_alive = -1

def speak(phrase):
  phrase = phrase.replace("\n", " ").lstrip(" ")
  print(f"speak('{phrase}')")

def phrases(prompt):
  messages = [{"role":"user","content":prompt}]
  assistant = ''
  phrase = ''
  while True:
    m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else [])
    response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive)
    for r in response:
      c = r.message.content
      assistant += c
      phrase += c
      if r.done:
        yield phrase
        return
      dot = max([phrase.rfind(c) for c in break_chars])
      if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]:
        response.close()
        yield phrase[:dot+1]
        phrase = phrase[dot+1:]

def main():
  prompt = " ".join(args.prompt)
  for phrase in phrases(prompt):
    speak(phrase)
  #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content)

if __name__ == "__main__":
  main()

In practice you wouldn't wait until speak() had finished playing the audio before restarting the inference to get the next phrase, you would use asyncio and multi-thread, but you get the idea.

@rick-github commented on GitHub (Sep 7, 2025): There's currently no mechanism for pausing and resuming inference. The closest you can get is to stop the inference and then restart it. This frees up the GPU for the TTS, but the model stays loaded for when the inference is restarted. If you want to let the TTS model take over the VRAM as well, you can set `keep_alive=0` to evict the model. This frees up VRAM but has the downside that there will be a delay re-loading the model for the next phrase inference (about 0.6 seconds on a 4070), and the prompt cache is flushed. ```python #!/usr/bin/env python3 import ollama import argparse import time parser = argparse.ArgumentParser() parser.add_argument("--model", default="qwen2.5:0.5b") parser.add_argument("prompt", nargs='*', default=["Why is the sky blue?"]) args = parser.parse_args() #break_chars = [ '.', ',' ] break_chars = [ '.' ] #options = {"temperature":0,"seed":42} options = {} #keep_alive = 0 keep_alive = -1 def speak(phrase): phrase = phrase.replace("\n", " ").lstrip(" ") print(f"speak('{phrase}')") def phrases(prompt): messages = [{"role":"user","content":prompt}] assistant = '' phrase = '' while True: m = messages + ([{"role":"assistant","content":assistant}] if len(assistant) > 0 else []) response = ollama.chat(model=args.model, messages=m, options=options, stream=True, keep_alive=keep_alive) for r in response: c = r.message.content assistant += c phrase += c if r.done: yield phrase return dot = max([phrase.rfind(c) for c in break_chars]) if dot > -1 and len(phrase) > dot+1 and phrase[dot+1] in [" ", "\n"]: response.close() yield phrase[:dot+1] phrase = phrase[dot+1:] def main(): prompt = " ".join(args.prompt) for phrase in phrases(prompt): speak(phrase) #print(ollama.chat(model=args.model, messages=[{"role":"user","content":prompt}], options=options, stream=False).message.content) if __name__ == "__main__": main() ``` In practice you wouldn't wait until `speak()` had finished playing the audio before restarting the inference to get the next phrase, you would use `asyncio` and multi-thread, but you get the idea.

GiteaMirror commented

2026-04-22 17:02:18 -05:00

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself.

thanks for great support, always fast to reply and understanding.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): it's a feature request, keep it in mind if you came across a way to do implement it in the runner itself. theoretically speaking it's possible but in practice things aren't that simple. talk with the rest of the team and keep it for when someone get across a way to integrate in the runner itself. thanks for great support, always fast to reply and understanding.

GiteaMirror commented

2026-04-22 17:02:19 -05:00

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu. I am trying to decrease the latency as much as possible to make the conversation with the model feel smooth and normal. so both the tts model and the llm can easily fit in the vram or ram, the issue is the cpu and gpu resources.

GiteaMirror commented

2026-04-22 17:02:19 -05:00

@rick-github commented on GitHub (Sep 7, 2025):

I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu

As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.

@rick-github commented on GitHub (Sep 7, 2025): > I just had another note. when I say compute resources, I don't mean the ram, I mean the cpu and gpu As explained in the solution, the ollama runner stops doing inference, hence stops using CPU and GPU.

GiteaMirror commented

2026-04-22 17:02:20 -05:00

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy.

I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user.

I will look into that feature and tell you what I found.

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I understand but if you lose the inference reply then it defeats the point. cause when you instantiate another reply it will start over. the point is to carry on so no redundancy. I am not 100% sure of this but I think that linux kernal have a feature that will allow you to pause the process and resume it. if you can look into which process have the model inference then pause that when you receive the request to do so from the python api and then resume it again when you receive another request from the user. I will look into that feature and tell you what I found.

GiteaMirror commented

2026-04-22 17:02:20 -05:00

@rick-github commented on GitHub (Sep 7, 2025):

cause when you instantiate another reply it will start over.

It does not. Did you even try running the example?

@rick-github commented on GitHub (Sep 7, 2025): > cause when you instantiate another reply it will start over. It does not. Did you even try running the example?

GiteaMirror commented

2026-04-22 17:02:21 -05:00

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): sorry you're right, I misread the code, I didn't run it. I will do so now. sorry again

GiteaMirror commented

2026-04-22 17:02:21 -05:00

@Abdulrahman392011 commented on GitHub (Sep 7, 2025):

I ran the code it works beautifully. that's just what I needed, thanks rick you rock

@Abdulrahman392011 commented on GitHub (Sep 7, 2025): I ran the code it works beautifully. that's just what I needed, thanks rick you rock

GiteaMirror commented

2026-04-22 17:02:22 -05:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on.

can you share your thoughts about this. i think it will work.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): hey @rick-github so I used your code and realized something very interesting. the same method can actually be used to switch between models on a token bases. so instead of resuming the generation on the same model, can use a smaller model or a larger model depending on the confidence percentage of the token. assuming that that the temperature is set to zero, the small model might not know the answer of the question so the exact token that include the answer will have a lower confidence level. that can trigger the system to push the same conversation history and the already generated token up to right before the token that have the answer, to a larger model that actually know what the answer is gonna be. then return back to the small model when the confidence level is within range. the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. theoretically speaking it will improve the quality of the outputs and cut latency and electricity usage not to mention that the same architecture can be modified to have more than just a large and small model. it can have many models that specialize in different things that are already available and open sourced. so it will refer the tokens that are mathematical for instance to a math model and so on. can you share your thoughts about this. i think it will work.

GiteaMirror commented

2026-04-22 17:02:22 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

use a smaller model or a larger model depending on the confidence percentage of the token

This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192.

the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common.

Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation.

can you share your thoughts about this. i think it will work.

It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google uses it for some of their products.

@rick-github commented on GitHub (Sep 21, 2025): > use a smaller model or a larger model depending on the confidence percentage of the token This is similar to a technique called speculative decoding, see https://arxiv.org/pdf/2211.17192. > the main downside to this that both the large and small models have to fit in memory together. but on a server level that's pretty common. Only if you want to reduce latency, unloading a model and reloading will have no effect on token generation. > can you share your thoughts about this. i think it will work. It will work, there's no magic in context pre-fill. llama.cpp implements speculative decoding, although I haven't experimented with it to see what the performance/fidelity metrics are like. Google [uses it](https://research.google/blog/looking-back-at-speculative-decoding/) for some of their products.

GiteaMirror commented

2026-04-22 17:02:22 -05:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time.

what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): your're right it's similar to speculative decoding. but it targets a different issue than speculative decoding. I think. isn't speculative decoding is about generating multiple tokens at the same time. what I am suggesting here is not really about that. I am trying to switch the models. speculative decoding uses the same model for multiple tokens simultaneously. I think. am I right?

GiteaMirror commented

2026-04-22 17:02:23 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.

@rick-github commented on GitHub (Sep 21, 2025): Similar, not the same. Speculative decoding is implemented in different ways, llama.cpp uses multiple models.

GiteaMirror commented

2026-04-22 17:02:23 -05:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs.

not to mention that if you can switch between large and small models in the same response it will enhance the performance.

also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): so maybe the mistake they are doing is the choice of the models they are running. if the models are specialized in different things this will allow people to use the models as modules for their system. in other words customize their own system instead of relying on one company to make an architecture that suits their needs. not to mention that if you can switch between large and small models in the same response it will enhance the performance. also it will make a market for really large models in server environments. no one can actually deploy a 2 trillion parameter model to say hi and hello to customers. but if the model only generate the most crucial tokens then suddenly it makes sense to have that in the ram running.

GiteaMirror commented

2026-04-22 17:02:24 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

so maybe the mistake they are doing is the choice of the models they are running.

What mistake?

but if the model only generate the most crucial tokens

I think this is the crux, determining the cruciality of tokens.

@rick-github commented on GitHub (Sep 21, 2025): > so maybe the mistake they are doing is the choice of the models they are running. What mistake? > but if the model only generate the most crucial tokens I think this is the crux, determining the cruciality of tokens.

GiteaMirror commented

2026-04-22 17:02:25 -05:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): by mistake i meant the mistake that were done that made not really popular and a must have in every engine. it's relatively very easy to make such system. it's not really all that difficult. all what's missing from ollama is actually the logprobs that we were talking about earlier to get the crux. the rest we can do in python and provide some templates for different implementation with different models.

GiteaMirror commented

2026-04-22 17:02:26 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

by mistake i meant the mistake that were done that made not really popular and a must have in every engine.

Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.

@rick-github commented on GitHub (Sep 21, 2025): > by mistake i meant the mistake that were done that made not really popular and a must have in every engine. Speculative decoding is available in llama.cpp, LMStudio and vLLM, which covers a fair chunk of the open inference engine market. It's not widely used because it's a bit niche and not well understood.

GiteaMirror commented

2026-04-22 17:02:26 -05:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible.

every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable.

the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): it's not well understood, that is absolutely true. they market it as a way of getting better token generation speed. which isn't really the biggest impact of the technique. the biggest impact from my opinion is the ability to construct your ai system with multiple models and have them work in harmony to give out the perfect response in the most efficient way possible. every year ram is getting cheaper and unified memory is growing to be more of the standard among smaller devices. we will reach a point where it's ok to leave half a trillion parameter model running in the ram only to generate the crucial tokens and even on the local machine that isn't really that performant, the wait for the crucial token to be generated right becomes justifiable. the thing is I get things wrong a lot but I think that what I am asking here even similar speculative decoding in technique, it's different in spirit.

GiteaMirror commented

2026-04-22 17:02:27 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.

@rick-github commented on GitHub (Sep 21, 2025): It's sufficiently different that I forsee implementation and performance issues, but it is an interesting application. I look forward to the implementation when logprobs are available in ollama. In the meantime, you can experiment with the engines that already offer logprobs.

GiteaMirror commented

2026-04-22 17:02:28 -05:00

@Abdulrahman392011 commented on GitHub (Sep 21, 2025):

by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?

@Abdulrahman392011 commented on GitHub (Sep 21, 2025): by the way rick, why did ollama move away from llama.cpp . is there any limitations that I should be aware of while experimenting?

GiteaMirror commented

2026-04-22 17:02:28 -05:00

@rick-github commented on GitHub (Sep 21, 2025):

At the time, development on llama.cpp was languishing. They had basically shelved the idea of supporting multi-modal models, just when vision processing was starting to take off. So the ollama team decided to develop their own vision processing and other capabilities, causing the server codebase to diverge. ollama still uses the GPU kernels, so the two projects mainly differ at the server level. llama.cpp has since revitalized their development and is still used in ollama for those models that haven't been integrated into the new ollama engine.

As far as I know this shouldn't pose any problems to your experiments.

@rick-github commented on GitHub (Sep 21, 2025): At the time, development on llama.cpp was languishing. They had basically shelved the idea of supporting multi-modal models, just when vision processing was starting to take off. So the ollama team decided to develop their own vision processing and other capabilities, causing the server codebase to diverge. ollama still uses the GPU kernels, so the two projects mainly differ at the server level. llama.cpp has since revitalized their development and is still used in ollama for those models that haven't been integrated into the new ollama engine. As far as I know this shouldn't pose any problems to your experiments.

GiteaMirror commented

2026-04-22 17:02:29 -05:00

@xxxajk commented on GitHub (Apr 4, 2026):

What I would love to actually see is a way to send the llm an interrupt, so it can do a self health check via providing various statistics, such as kv cache left, delta since last check, token amount left in order to trigger a tool call that could save the current progress on thinking, then resume with a fresh context. I have been playing with a similar cloning concept, which has shown success.
Basically when responses start to take too long, or the history is getting too full, the llm is instructed to provide a highly detailed summary of context, and certain tool states/explicit rules to ignore (one of them is a goal and subgoal tree, which is used as persistent storage, which may have been asked to not be used for example).
The summary is then sent to a clone as the first prompt, and that trims off the history bloat from tool calls and and chat, as a speed optimization.
if there was a way to trigger the tool call to perform such a summary, it wouldn't have to be checked before or after a tool call or at the cli prompt. it could happen automatically instead. Thoughts?

@xxxajk commented on GitHub (Apr 4, 2026): What I would love to actually see is a way to send the llm an interrupt, so it can do a self health check via providing various statistics, such as kv cache left, delta since last check, token amount left in order to trigger a tool call that could save the current progress on thinking, then resume with a fresh context. I have been playing with a similar cloning concept, which has shown success. Basically when responses start to take too long, or the history is getting too full, the llm is instructed to provide a highly detailed summary of context, and certain tool states/explicit rules to ignore (one of them is a goal and subgoal tree, which is used as persistent storage, which may have been asked to not be used for example). The summary is then sent to a clone as the first prompt, and that trims off the history bloat from tool calls and and chat, as a speed optimization. if there was a way to trigger the tool call to perform such a summary, it wouldn't have to be checked before or after a tool call or at the cli prompt. it could happen automatically instead. Thoughts?

GiteaMirror commented

2026-04-22 17:02:30 -05:00

@Abdulrahman392011 commented on GitHub (Apr 4, 2026):

@xxxajk I'm sorry i couldn't keep up with what you said. can you be a bit more detailed and simplify. so far we can add the history as assistant to make the model continue where it stopped.

are you worried about the context window for research purposes? iterating over and over is one way to do things, i think that this is what open-claw does. ollama support it, you can use it for that purpose.

if you are worried about adding checkpoints to asses the progress before using a tool. i think that can be done with open-claw as well.

@Abdulrahman392011 commented on GitHub (Apr 4, 2026): @xxxajk I'm sorry i couldn't keep up with what you said. can you be a bit more detailed and simplify. so far we can add the history as assistant to make the model continue where it stopped. are you worried about the context window for research purposes? iterating over and over is one way to do things, i think that this is what open-claw does. ollama support it, you can use it for that purpose. if you are worried about adding checkpoints to asses the progress before using a tool. i think that can be done with open-claw as well.

GiteaMirror commented

2026-04-22 17:02:30 -05:00

@xxxajk commented on GitHub (Apr 7, 2026):

Idea has uses beyond what I stated.
I realize an LLM is supposed to process text, but there are other interactions that can happen where you would want to send a trigger as an event signal.
If you've ever done any low level code that deals with hardware, you would understand the interrupt concept better.

@xxxajk commented on GitHub (Apr 7, 2026): Idea has uses beyond what I stated. I realize an LLM is supposed to process text, but there are other interactions that can happen where you would want to send a trigger as an event signal. If you've ever done any low level code that deals with hardware, you would understand the interrupt concept better.

GiteaMirror commented

2026-04-22 17:02:31 -05:00

@Abdulrahman392011 commented on GitHub (Apr 7, 2026):

@xxxajk maybe use the logprob feature they added recently. it can provide stats, maybe that would be useful for your use case.

@Abdulrahman392011 commented on GitHub (Apr 7, 2026): @xxxajk maybe use the logprob feature they added recently. it can provide stats, maybe that would be useful for your use case.

GiteaMirror commented

2026-04-22 17:02:32 -05:00

@xxxajk commented on GitHub (Apr 7, 2026):

Well, that would be only one use case. Looking for something more along the lines of an interrupt, which is exactly what it sounds like... you interrupt the model, save state, do something else (with the model or resources) then return back to execution where it left off. Think like when a model decides to use a tool, but the other way around. the model stops to use the tool, then resumes with the tool results.

As an analogy, let's say you are writing code, and mom taps you on the shoulder to get your attention to open a bottle that she can't. She interrupted you. You open the bottle, give it back to her, then continue writing code where you left off.

High level "applications" coders usually don't "understand" the concept these days, and it is very fundamental.

@xxxajk commented on GitHub (Apr 7, 2026): Well, that would be only one use case. Looking for something more along the lines of an interrupt, which is exactly what it sounds like... you interrupt the model, save state, do something else (with the model or resources) then return back to execution where it left off. Think like when a model decides to use a tool, but the other way around. the model stops to use the tool, then resumes with the tool results. As an analogy, let's say you are writing code, and mom taps you on the shoulder to get your attention to open a bottle that she can't. She interrupted you. You open the bottle, give it back to her, then continue writing code where you left off. High level "applications" coders usually don't "understand" the concept these days, and it is very fundamental.

GiteaMirror commented

2026-04-22 17:02:32 -05:00

@Abdulrahman392011 commented on GitHub (Apr 8, 2026):

@xxxajk
I think i understand, if the stats you need is provided by the logprobe feature in ollama, you can save it in a json file and update the jason everytime the model is interrupted.

I'm not sure but i think there's no framework that allow the model to paused and resumed on a low level. There always a workaround that unfortunately fails in some use cases.

I was trying to use the same mechanism of pausing and resuming with structured output in order to have the model skip generating the object names. I was pausing the model, inject the object name and resume. The problem was that the model when resumed it needs an object name and since it was injected instead of given to the model it gets confused as it thinks it generated it as a normal response not as the object name.

I agree with you on the fact that pausing and resuming on a lower level can be superior than the injection mechanism

@Abdulrahman392011 commented on GitHub (Apr 8, 2026): @xxxajk I think i understand, if the stats you need is provided by the logprobe feature in ollama, you can save it in a json file and update the jason everytime the model is interrupted. I'm not sure but i think there's no framework that allow the model to paused and resumed on a low level. There always a workaround that unfortunately fails in some use cases. I was trying to use the same mechanism of pausing and resuming with structured output in order to have the model skip generating the object names. I was pausing the model, inject the object name and resume. The problem was that the model when resumed it needs an object name and since it was injected instead of given to the model it gets confused as it thinks it generated it as a normal response not as the object name. I agree with you on the fact that pausing and resuming on a lower level can be superior than the injection mechanism

GiteaMirror commented

2026-04-22 17:02:33 -05:00

@Abdulrahman392011 commented on GitHub (Apr 8, 2026):

@xxxajk why don't you take a look at the ollama runner code on Github and see if there's something that can be done. If you managed to pause and resume successfully, please make a pull request so we can all use it.

@Abdulrahman392011 commented on GitHub (Apr 8, 2026): @xxxajk why don't you take a look at the ollama runner code on Github and see if there's something that can be done. If you managed to pause and resume successfully, please make a pull request so we can all use it.

GiteaMirror commented

2026-04-22 17:02:33 -05:00

@xxxajk commented on GitHub (Apr 8, 2026):

I would except the following:
1: the whole arch isn't done with enough abstraction, no way to even apply a hook
2: whole system pipeline (including models) has what I call the HAL9000 problem -- reach goal no matter what, lie if you need to, from start to completion, no way to even request a graceful cancel.
3: crams the entire model into vram, instead of on as needed and can evict old
4: no concept of a session as an object context
5: wasn't designed with any sort of scheduling or cooperation.

Yes, some of these things would "slow the model down", some of us don't care how long things take, the result is more important.
We get that people are impatient, they want things fast, slim, and to look pretty, but there's not even an option available for corner cases.
I personally take functionality first over speed. Optimize it after so traps like this can't happen from the start of the design.
It would take too long for a single person to do this, especially when they aren't familiar with the code base.

@xxxajk commented on GitHub (Apr 8, 2026): I would except the following: 1: the whole arch isn't done with enough abstraction, no way to even apply a hook 2: whole system pipeline (including models) has what I call the HAL9000 problem -- reach goal no matter what, lie if you need to, from start to completion, no way to even request a graceful cancel. 3: crams the entire model into vram, instead of on as needed and can evict old 4: no concept of a session as an object context 5: wasn't designed with any sort of scheduling or cooperation. Yes, some of these things would "slow the model down", some of us don't care how long things take, the result is more important. We get that people are impatient, they want things fast, slim, and to look pretty, but there's not even an option available for corner cases. I personally take functionality first over speed. Optimize it after so traps like this can't happen from the start of the design. It would take too long for a single person to do this, especially when they aren't familiar with the code base.

GiteaMirror commented

2026-04-22 17:02:34 -05:00

@Abdulrahman392011 commented on GitHub (Apr 8, 2026):

Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement.

@Abdulrahman392011 commented on GitHub (Apr 8, 2026): Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement.

GiteaMirror commented

2026-04-22 17:02:34 -05:00

@xxxajk commented on GitHub (Apr 10, 2026):

Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement.

Unfortunately I'm fairly overbooked, and spend any spare time I have with exploring and learning just like you do.
I don't always look for the easiest way to run something either.
Again, the end results matter more than the journey to get there, perhaps you have the time, I do not. If you don't have the skills, acquire them. Everybody has the right to learn, nobody has the right to interfere. Do the leg work to reach the goal, it can be rewarding in many ways -- that's how I get to be overloaded with work and paying jobs. You get to show your work.

@xxxajk commented on GitHub (Apr 10, 2026): > Ollama's team have done a great job in simplifying the process of running a local model which allowed people like me to get closer to the field. However you're right that the compromises made in the process are problematic. But stay cheerful. At least they got to this point self funding the whole way. The problems can be tackled when they present themselves like what we're dealing with pausing and resuming. When you have some free time take a look at their code maybe it's easy to implement. Unfortunately I'm fairly overbooked, and spend any spare time I have with exploring and learning just like you do. I don't always look for the easiest way to run something either. Again, the end results matter more than the journey to get there, perhaps you have the time, I do not. If you don't have the skills, acquire them. Everybody has the right to learn, nobody has the right to interfere. Do the leg work to reach the goal, it can be rewarding in many ways -- that's how I get to be overloaded with work and paying jobs. You get to show your work.

GiteaMirror commented

2026-04-22 17:02:35 -05:00

@Abdulrahman392011 commented on GitHub (Apr 10, 2026):

@xxxajk
You're right, I'll try and learn about this more. Sure something good will come out of it, even if i couldn't reach the target per se

@Abdulrahman392011 commented on GitHub (Apr 10, 2026): @xxxajk You're right, I'll try and learn about this more. Sure something good will come out of it, even if i couldn't reach the target per se

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#33879