[GH-ISSUE #11257] Model Inference Hanging #53931

Closed
opened 2026-04-29 04:58:06 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @SIMSB-99 on GitHub (Jul 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11257

What is the issue?

I have a Python script that takes images from a folder on my PC and sends them to the model (VLM; llama3.2-vision:latest) on-by-one with the prompt. The script hangs after a few minutes of until the ollama unloads the model (ollama ps shows the model stopping) or I manually stop the model via ollama stop llama3.2-vision:latest command. Once the model stops, the script resumes until it gets stuck again.

I previously ran the same script on the ZimaBlueAI/Qwen2-VL-7B-Instruct model without running into any such issues. I am stumped on what could be causing this issue or how it can potentially be fixed.

Relevant log output

(When the model is running)
C:\Windows\System32>ollama ps                      
NAME                      ID              SIZE     PROCESSOR    UNTIL
llama3.2-vision:latest    6f2f9757ae97    12 GB    100% GPU     4 minutes from now

(When the model gets stuck; after the 4 minute timeout)
C:\Windows\System32>ollama ps
NAME                      ID              SIZE     PROCESSOR    UNTIL
llama3.2-vision:latest    6f2f9757ae97    12 GB    100% GPU     Stopping...

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.9.3

Originally created by @SIMSB-99 on GitHub (Jul 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11257 ### What is the issue? I have a Python script that takes images from a folder on my PC and sends them to the model (VLM; llama3.2-vision:latest) on-by-one with the prompt. The script hangs after a few minutes of until the ollama unloads the model (ollama ps shows the model stopping) or I manually stop the model via ollama stop llama3.2-vision:latest command. Once the model stops, the script resumes until it gets stuck again. I previously ran the same script on the ZimaBlueAI/Qwen2-VL-7B-Instruct model without running into any such issues. I am stumped on what could be causing this issue or how it can potentially be fixed. ### Relevant log output ```shell (When the model is running) C:\Windows\System32>ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2-vision:latest 6f2f9757ae97 12 GB 100% GPU 4 minutes from now (When the model gets stuck; after the 4 minute timeout) C:\Windows\System32>ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2-vision:latest 6f2f9757ae97 12 GB 100% GPU Stopping... ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.3
GiteaMirror added the bug label 2026-04-29 04:58:06 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3025235965 --> @rick-github commented on GitHub (Jul 1, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@SIMSB-99 commented on GitHub (Jul 1, 2025):

server.log

Here you go. The script hangs and after waiting for 4 minutes I run the ollama ps command which shows UNTIL: Stopping... (as shown above) before it resumes again just to be stuck after some API calls.

<!-- gh-comment-id:3025295333 --> @SIMSB-99 commented on GitHub (Jul 1, 2025): [server.log](https://github.com/user-attachments/files/21006545/server.log) Here you go. The script hangs and after waiting for 4 minutes I run the ollama ps command which shows UNTIL: Stopping... (as shown above) before it resumes again just to be stuck after some API calls.
Author
Owner

@rick-github commented on GitHub (Jul 1, 2025):

Could you add OLLAMA_DEBUG=2 to the server environment and post the resulting logs?

<!-- gh-comment-id:3025306664 --> @rick-github commented on GitHub (Jul 1, 2025): Could you add `OLLAMA_DEBUG=2` to the server environment and post the resulting logs?
Author
Owner

@SIMSB-99 commented on GitHub (Jul 1, 2025):

I have uploaded the log file to my Google Drive due to file size restrictions on upload here. Here is the link: https://drive.google.com/file/d/1v_Mzx9qpBFc_I0pGM7VzdH-w-kv4FAF-/view?usp=sharing

<!-- gh-comment-id:3025422543 --> @SIMSB-99 commented on GitHub (Jul 1, 2025): I have uploaded the log file to my Google Drive due to file size restrictions on upload here. Here is the link: https://drive.google.com/file/d/1v_Mzx9qpBFc_I0pGM7VzdH-w-kv4FAF-/view?usp=sharing
Author
Owner

@SIMSB-99 commented on GitHub (Jul 7, 2025):

Update: I tried running another smaller model (gemma3:7b), which is smaller in size (6.0 GB) than ZimaBlueAI/Qwen2-VL-7B-Instruct (6.7 GB). However, I am still facing the issue with the script hanging after every couple of images. The script seems to be running perfectly fine with the original ZimaBlueAI/Qwen2-VL-7B-Instruct model.

Here is how I am calling the model in the script:

vlm_response = chat(
messages=[
{
"role": "user",
"content": prompt,
"images": [image]
}],
model=model_name,
format=SCHEMA
)

<!-- gh-comment-id:3045983592 --> @SIMSB-99 commented on GitHub (Jul 7, 2025): Update: I tried running another smaller model (gemma3:7b), which is smaller in size (6.0 GB) than ZimaBlueAI/Qwen2-VL-7B-Instruct (6.7 GB). However, I am still facing the issue with the script hanging after every couple of images. The script seems to be running perfectly fine with the original ZimaBlueAI/Qwen2-VL-7B-Instruct model. Here is how I am calling the model in the script: vlm_response = chat( messages=[ { "role": "user", "content": prompt, "images": [image] }], model=model_name, format=SCHEMA )
Author
Owner

@rick-github commented on GitHub (Jul 8, 2025):

Server logs, attached to this issue, may aid in debugging.

<!-- gh-comment-id:3050460169 --> @rick-github commented on GitHub (Jul 8, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues), attached to this issue, may aid in debugging.
Author
Owner

@SIMSB-99 commented on GitHub (Jul 9, 2025):

I have uploaded the log file to my Google Drive due to file size restrictions on upload here. Here is the link: https://drive.google.com/file/d/1v_Mzx9qpBFc_I0pGM7VzdH-w-kv4FAF-/view?usp=sharing

I have uploaded them here, as previously requested!

<!-- gh-comment-id:3053660909 --> @SIMSB-99 commented on GitHub (Jul 9, 2025): > I have uploaded the log file to my Google Drive due to file size restrictions on upload here. Here is the link: https://drive.google.com/file/d/1v_Mzx9qpBFc_I0pGM7VzdH-w-kv4FAF-/view?usp=sharing I have uploaded them here, as previously requested!
Author
Owner

@rick-github commented on GitHub (Jul 9, 2025):

Requires a google sign-in.

<!-- gh-comment-id:3054267439 --> @rick-github commented on GitHub (Jul 9, 2025): Requires a google sign-in.
Author
Owner

@SIMSB-99 commented on GitHub (Jul 11, 2025):

I have updated the permission now.

<!-- gh-comment-id:3060105235 --> @SIMSB-99 commented on GitHub (Jul 11, 2025): I have updated the permission now.
Author
Owner

@SIMSB-99 commented on GitHub (Jul 22, 2025):

Any update on this issue? I have tried running my Script on my University's GPU cluster, but I am still facing the same issue. I also came across this post (https://www.reddit.com/r/LocalLLaMA/comments/1et810q/comment/n2uv8n1/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) with a similar issue that was fixed with a version update of Ollama in the past.

<!-- gh-comment-id:3103731685 --> @SIMSB-99 commented on GitHub (Jul 22, 2025): Any update on this issue? I have tried running my Script on my University's GPU cluster, but I am still facing the same issue. I also came across this post (https://www.reddit.com/r/LocalLLaMA/comments/1et810q/comment/n2uv8n1/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) with a similar issue that was fixed with a version update of Ollama in the past.
Author
Owner

@skwde commented on GitHub (Jul 25, 2025):

Same issue here with 0.9.6 on Linux with Intel CPU and I guess the same with AMD CPU as well (because the GPU is not used see https://github.com/ollama/ollama/issues/11519)

<!-- gh-comment-id:3116546846 --> @skwde commented on GitHub (Jul 25, 2025): Same issue here with `0.9.6` on Linux with Intel CPU and I guess the same with AMD CPU as well (because the GPU is not used see https://github.com/ollama/ollama/issues/11519)
Author
Owner

@rick-github commented on GitHub (Jul 25, 2025):

Logs show that the model got stuck generating the JSON for the last image in the log:

{
  "out_archetype": "2",
  "privacy_label": "not private",
  "out_category": "airport",
  "Explanation": "The scene is an airport, which is a public and public-often-expected-to-be-visited-places-where-people-never-expected-privacy-protects-which-never-expected-privacy-protects-which-never-expected-places-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which- never-…-which-never-expected-which-never-…-which-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-

This happens occasionally with models - they lose coherence and start rambling. Being restricted to a JSON structure may exacerbate the problem.

It's possible that adding more instruction to the Explanation entry in the system message may help, or perhaps an example of the expected output.

A workaround is to set num_predict to make the model terminate when it starts to go off the rails. This will cause the inference to terminate with a done_reason of length, allowing the client to retry if it detects a boundary failure.

<!-- gh-comment-id:3116935335 --> @rick-github commented on GitHub (Jul 25, 2025): Logs show that the model got stuck generating the JSON for the last image in the log: ```json { "out_archetype": "2", "privacy_label": "not private", "out_category": "airport", "Explanation": "The scene is an airport, which is a public and public-often-expected-to-be-visited-places-where-people-never-expected-privacy-protects-which-never-expected-privacy-protects-which-never-expected-places-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which-never-expected-which- never-…-which-never-expected-which-never-…-which-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…-…- ``` This happens occasionally with models - they lose coherence and start rambling. Being restricted to a JSON structure may exacerbate the problem. It's possible that adding more instruction to the `Explanation` entry in the system message may help, or perhaps an example of the expected output. A workaround is to set `num_predict` to make the model terminate when it starts to go off the rails. This will cause the inference to terminate with a `done_reason` of `length`, allowing the client to retry if it detects a boundary failure.
Author
Owner

@karim20010 commented on GitHub (Jul 31, 2025):

Hello I am runnig ollama qwen2.5vl:7b model on google colab , I have created a script where it takes a text file and an image and correct the text file according to the image it do well for the pages at first then there is a page where it is stuck it do not do anything. I have tried to put a timeout for the page where it is stuck it then continue normally but I need this page to be processed. any help with this issue?

<!-- gh-comment-id:3139790113 --> @karim20010 commented on GitHub (Jul 31, 2025): Hello I am runnig ollama qwen2.5vl:7b model on google colab , I have created a script where it takes a text file and an image and correct the text file according to the image it do well for the pages at first then there is a page where it is stuck it do not do anything. I have tried to put a timeout for the page where it is stuck it then continue normally but I need this page to be processed. any help with this issue?
Author
Owner

@rick-github commented on GitHub (Jul 31, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:3139938636 --> @rick-github commented on GitHub (Jul 31, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53931