[GH-ISSUE #8513] Support for Multiple Images in /chat Endpoint #31247

Closed
opened 2026-04-22 11:31:02 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @pmedina-42 on GitHub (Jan 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8513

Currently, the /chat endpoint includes the images field, but it only supports a single image. While this is functional, it introduces an additional layer of complexity when performing RAG with images embedded in base64.

For instance, if the content retriever returns multiple embeddings with the highest scores referencing different images, we need to manually reconstruct the full images (potentially missing lower-score records in case the embedding size isn't big enough to store the whole image in one record) and then make a separate /chat call for each retrieved image. Finally, all the responses must be summarized into one.

This manual process could be significantly simplified if the images field allowed for passing multiple images in a single request.

Is there any plan in the near future to support multiple images in the images field? This enhancement would greatly streamline workflows and reduce the overhead in scenarios like the one described above.

Additionally, I’m just getting started and don’t have much experience yet, so it’s possible that I’m overlooking something that could make this process easier. If there’s a better approach or workaround I might have missed, I’d be grateful for any guidance.

Thank you beforehand!

Originally created by @pmedina-42 on GitHub (Jan 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8513 Currently, the /chat endpoint includes the images field, but it only supports a single image. While this is functional, it introduces an additional layer of complexity when performing RAG with images embedded in base64. For instance, if the content retriever returns multiple embeddings with the highest scores referencing different images, we need to manually reconstruct the full images (potentially missing lower-score records in case the embedding size isn't big enough to store the whole image in one record) and then make a separate /chat call for each retrieved image. Finally, all the responses must be summarized into one. This manual process could be significantly simplified if the images field allowed for passing multiple images in a single request. Is there any plan in the near future to support multiple images in the images field? This enhancement would greatly streamline workflows and reduce the overhead in scenarios like the one described above. Additionally, I’m just getting started and don’t have much experience yet, so it’s possible that I’m overlooking something that could make this process easier. If there’s a better approach or workaround I might have missed, I’d be grateful for any guidance. Thank you beforehand!
GiteaMirror added the feature request label 2026-04-22 11:31:02 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 21, 2025):

ollama supports multiple images, but most models do not. Note how llava merges the two images and describes a kitten with the collar from the puppy image.

$ for i in minicpm-v:8b-2.6-q4_K_M moondream:1.8b-v2-fp16 llava ; do 
  echo $i ; 
  echo '{"model": "'$i'",
         "messages":[{
            "role":"user","content":"describe the animals shown in the images",
            "images": [
              "'"$(base64 puppy.jpg)"'",
              "'"$(base64 kitten.jpg)"'"
            ]
          }],
         "stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content ;
done

#minicpm-v:8b-2.6-q4_K_M
The first image shows a small white puppy sitting on what appears to be concrete steps. The puppy
has bright eyes and is wearing a red collar with a bell attached.

The second image depicts an orange kitten lying down, looking directly at the camera. It has large,
expressive greenish-yellow eyes and pointed ears typical of many cat breeds.

#moondream:1.8b-v2-fp16
In the image, there is a cute orange kitten sitting on top of an object that appears to be either a couch
or a bed. The kitten seems to be looking directly at the camera and has its eyes open wide, capturing
attention from viewers. The scene exudes warmth and playfulness as this adorable feline takes center
stage in the composition.

#llava
The image shows a small, kitten-like cat with a white coat and tan-colored ears. It has striking blue
eyes and is wearing a red collar with a tag. The cat appears to be sitting or lying down on what looks
like a wooden floor or deck. 
<!-- gh-comment-id:2604186877 --> @rick-github commented on GitHub (Jan 21, 2025): ollama supports multiple images, but most models do not. Note how llava merges the two images and describes a kitten with the collar from the puppy image. ```console $ for i in minicpm-v:8b-2.6-q4_K_M moondream:1.8b-v2-fp16 llava ; do echo $i ; echo '{"model": "'$i'", "messages":[{ "role":"user","content":"describe the animals shown in the images", "images": [ "'"$(base64 puppy.jpg)"'", "'"$(base64 kitten.jpg)"'" ] }], "stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content ; done #minicpm-v:8b-2.6-q4_K_M The first image shows a small white puppy sitting on what appears to be concrete steps. The puppy has bright eyes and is wearing a red collar with a bell attached. The second image depicts an orange kitten lying down, looking directly at the camera. It has large, expressive greenish-yellow eyes and pointed ears typical of many cat breeds. #moondream:1.8b-v2-fp16 In the image, there is a cute orange kitten sitting on top of an object that appears to be either a couch or a bed. The kitten seems to be looking directly at the camera and has its eyes open wide, capturing attention from viewers. The scene exudes warmth and playfulness as this adorable feline takes center stage in the composition. #llava The image shows a small, kitten-like cat with a white coat and tan-colored ears. It has striking blue eyes and is wearing a red collar with a tag. The cat appears to be sitting or lying down on what looks like a wooden floor or deck. ```
Author
Owner

@joshuabolick commented on GitHub (Dec 8, 2025):

Greetings all, we have been struggling with this too. We are using the Gemma3:12b model, sending two images in the request to the generate endpoint via python api, and it appears very much like it is merging the two images when we are trying to compare them.

Is this definitely true that this cannot be done with current ollama using the Gemma3:12b model?

We have also been doing this calling directly out to Google API using this same model and same request and it works no problem so it has been very puzzling.

Please let me know and thank you!

<!-- gh-comment-id:3628884515 --> @joshuabolick commented on GitHub (Dec 8, 2025): Greetings all, we have been struggling with this too. We are using the Gemma3:12b model, sending two images in the request to the generate endpoint via python api, and it appears very much like it is merging the two images when we are trying to compare them. Is this definitely true that this cannot be done with current ollama using the Gemma3:12b model? We have also been doing this calling directly out to Google API using this same model and same request and it works no problem so it has been very puzzling. Please let me know and thank you!
Author
Owner

@rick-github commented on GitHub (Dec 8, 2025):

$ echo '{                
  "model": "gemma3:12b",
  "messages":[{
    "role":"user","content":"describe the animals shown in the images",
    "images": [
      "'"$(base64 image1.jpg)"'",
      "'"$(base64 image2.jpg)"'"
    ]
  }],
  "stream":false
}' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content
Here's a description of the animals in the images:

**Image 1: White Puppy**

*   **Type:** It appears to be a very young puppy, likely a mix. Given its fluffy white coat, it could have some Spitz breed in its heritage (like an American Eskimo Dog or Samoyed, but it's difficult to be certain).
*   **Appearance:** The puppy is small and fluffy, with a pristine white coat. It has a small, slightly upturned nose and dark eyes. It's wearing a red collar with a gold bell.
*   **Pose:** The puppy is sitting on a set of concrete steps, looking off to the side with a slightly melancholy expression.

**Image 2: Ginger Kitten**

*   **Type:** It is a kitten.
*   **Appearance:** The kitten has a vibrant ginger (orange/red) coat, with white markings around its face. It has large, expressive eyes, and its ears are large in proportion to its head.
*   **Pose:** The kitten is perched on a textured surface and looking directly into the camera.



Let me know if you're curious about anything else regarding these images!
<!-- gh-comment-id:3628925244 --> @rick-github commented on GitHub (Dec 8, 2025): ```console $ echo '{ "model": "gemma3:12b", "messages":[{ "role":"user","content":"describe the animals shown in the images", "images": [ "'"$(base64 image1.jpg)"'", "'"$(base64 image2.jpg)"'" ] }], "stream":false }' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content Here's a description of the animals in the images: **Image 1: White Puppy** * **Type:** It appears to be a very young puppy, likely a mix. Given its fluffy white coat, it could have some Spitz breed in its heritage (like an American Eskimo Dog or Samoyed, but it's difficult to be certain). * **Appearance:** The puppy is small and fluffy, with a pristine white coat. It has a small, slightly upturned nose and dark eyes. It's wearing a red collar with a gold bell. * **Pose:** The puppy is sitting on a set of concrete steps, looking off to the side with a slightly melancholy expression. **Image 2: Ginger Kitten** * **Type:** It is a kitten. * **Appearance:** The kitten has a vibrant ginger (orange/red) coat, with white markings around its face. It has large, expressive eyes, and its ears are large in proportion to its head. * **Pose:** The kitten is perched on a textured surface and looking directly into the camera. Let me know if you're curious about anything else regarding these images! ```
Author
Owner

@joshuabolick commented on GitHub (Dec 8, 2025):

Okay thanks so it should work with this model then it sounds like?

We are using the generate endpoint, should that work the same way or?

<!-- gh-comment-id:3629059480 --> @joshuabolick commented on GitHub (Dec 8, 2025): Okay thanks so it should work with this model then it sounds like? We are using the generate endpoint, should that work the same way or?
Author
Owner

@rick-github commented on GitHub (Dec 8, 2025):

$ echo '{                
  "model": "gemma3:12b",
  "prompt":"describe the animals shown in the images",
  "images": [
    "'"$(base64 image1.jpg)"'",
    "'"$(base64 image2.jpg)"'"
  ],
  "stream":false
}' | curl -s http://localhost:11434/api/generate -d @- | jq -r .response
Here's a description of the animals in the images:

**Image 1: The Puppy**

*   **Type:** The puppy appears to be a Samoyed or a breed with similar characteristics.
*   **Appearance:** It's a fluffy, snow-white puppy. It has a cute, slightly worried expression and small, dark eyes. It's wearing a red collar with a bell.
*   **Pose:** It's sitting on what looks like a stone step.

**Image 2: The Kitten**

*   **Type:** It's an orange tabby kitten.
*   **Appearance:** The kitten is a beautiful orange tabby with a fluffy appearance. It has a sweet expression and large, curious eyes.
*   **Pose:** The kitten is sitting or crouching on a textured surface that appears to be part of a chair or cushion.



Let me know if you want me to describe anything else!
<!-- gh-comment-id:3629065528 --> @rick-github commented on GitHub (Dec 8, 2025): ```console $ echo '{ "model": "gemma3:12b", "prompt":"describe the animals shown in the images", "images": [ "'"$(base64 image1.jpg)"'", "'"$(base64 image2.jpg)"'" ], "stream":false }' | curl -s http://localhost:11434/api/generate -d @- | jq -r .response Here's a description of the animals in the images: **Image 1: The Puppy** * **Type:** The puppy appears to be a Samoyed or a breed with similar characteristics. * **Appearance:** It's a fluffy, snow-white puppy. It has a cute, slightly worried expression and small, dark eyes. It's wearing a red collar with a bell. * **Pose:** It's sitting on what looks like a stone step. **Image 2: The Kitten** * **Type:** It's an orange tabby kitten. * **Appearance:** The kitten is a beautiful orange tabby with a fluffy appearance. It has a sweet expression and large, curious eyes. * **Pose:** The kitten is sitting or crouching on a textured surface that appears to be part of a chair or cushion. Let me know if you want me to describe anything else! ```
Author
Owner

@joshuabolick commented on GitHub (Dec 9, 2025):

Thank you so much for your help!

So for our usage we are using the Python ollama library (just installed the latest v0.6.1) and also updated to the very latest version of ollama (0.13.2) running on our Unbuntu server which has 2 NVIDIA GPUs that look to be balancing the request load correctly.

Anyway I am manually testing this on our server just with 2 images and you can see our prompt below (I know its long, we are still adjusting it) but it feels like the base64 encoded images or the requests may be being cached or something? I have been searching and found some various ways to try to disable any caching that you can see in the request options, but so far I am still getting the same summary for both frame a and frame b (I have attached these images frame a is a dog and frame b is a bird). I also have confirmed at least a couple times the images we are sending are correct, and even then decoded and saved them again after just to confirm we are sending the right images which from everything I can tell we are sending the right images.

Image
Image

Also for our use case we are analyzing a video at certain frame intervals so are iterating through the frames to compare them so we are making these requests anywhere from 50 to 200+ times per run we do.

When testing locally on my MacBook (it runs slow but still works) if I tried enough times it would one of the times finally give me the frame b summary of the bird, but mostly every time I tried I was still getting frame a and frame b summary both describing the dog as shown in the logs below.

Please let me know, I am not sure if I am doing something wrong or if I have found some other issue or maybe how I can disable cache so this always works with the correct summaries for both images? Thank you so much for your help!

Below is the python code for how we are sending the request and also the logged response as well:

Python code:
Image

Logged response:

2025-12-09 11:42:19,989 - root - INFO - frame_pair: ['frame_000001.jpg', 'frame_000002.jpg']
2025-12-09 11:42:19,989 - root - INFO - model: gemma3:12b
2025-12-09 11:42:19,989 - root - INFO - frame1_path: ../scene_analysis_output/frames/frame_000001.jpg
2025-12-09 11:42:19,989 - root - INFO - frame2_path: ../scene_analysis_output/frames/frame_000002.jpg
2025-12-09 11:42:19,990 - root - INFO - prompt_text=

SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

ROLE: You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group.

CRITICAL DIRECTIVE: You must act as a strict OCR (Optical Character Recognition) Validator. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

OUTPUT FORMAT: You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.

TASK: FRAME COMPARISON LOGIC

STEP 1: TEXT EXTRACTION (CRITICAL)
First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

STEP 2: VISUAL SUMMARIZATION
Briefly summarize the visual content (objects, setting, action) in the summary fields.

STEP 3: LOGIC GATES (Execute in Order)

GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)

  • Compare the extracted "frame_a_date_text" and "frame_b_date_text".
  • If the text exists on both but the DAY number is different (e.g., "Dec 01" vs "Dec 02"), you MUST output "same_scene: false".
  • Reasoning: "Date Mismatch." (STOP AND RETURN)

GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)

  • If one frame has a date/text overlay and the other does not, you MUST output "same_scene: false".
  • Reasoning: "Overlay Inconsistency." (STOP AND RETURN)

GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)

  • If both frames have a date overlay and the date strings are IDENTICAL, you MUST output "same_scene: true".
  • NOTE: This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)

  • If (and ONLY if) there are NO text overlays on either frame:
    • BREAK if the location/setting changes to a completely new different location/setting.
    • BREAK if one frame is solid color/static/artifact.
    • KEEP if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

REQUIRED JSON OUTPUT SCHEMA

Respond ONLY with a single valid JSON object.

{
"frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
"frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
"frame_a_summary": "[Summary of visual content in Frame A]",
"frame_b_summary": "[Summary of visual content in Frame B]",
"reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
"same_scene": [boolean: true or false]
}

2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.started host='127.0.0.1' port=11434 local_address=None timeout=None socket_options=None
2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7effd3170730>
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.started request=<Request [b'POST']>
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.complete
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.started request=<Request [b'POST']>
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.complete
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - receive_response_headers.started request=<Request [b'POST']>
2025-12-09 11:42:27,207 - httpcore.http11 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json; charset=utf-8'), (b'Date', b'Tue, 09 Dec 2025 18:42:27 GMT'), (b'Transfer-Encoding', b'chunked')])
2025-12-09 11:42:27,208 - httpx - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.started request=<Request [b'POST']>
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.complete
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.started
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.complete
2025-12-09 11:42:27,208 - root - INFO - response=```json
{
"frame_a_date_text": "NONE",
"frame_b_date_text": "NONE",
"frame_a_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.",
"frame_b_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.",
"reasoning": "No date text found on either frame.",
"same_scene": true
}

<!-- gh-comment-id:3633830388 --> @joshuabolick commented on GitHub (Dec 9, 2025): Thank you so much for your help! So for our usage we are using the Python ollama library (just installed the latest v0.6.1) and also updated to the very latest version of ollama (0.13.2) running on our Unbuntu server which has 2 NVIDIA GPUs that look to be balancing the request load correctly. Anyway I am manually testing this on our server just with 2 images and you can see our prompt below (I know its long, we are still adjusting it) but it feels like the base64 encoded images or the requests may be being cached or something? I have been searching and found some various ways to try to disable any caching that you can see in the request options, but so far I am still getting the same summary for both frame a and frame b (I have attached these images frame a is a dog and frame b is a bird). I also have confirmed at least a couple times the images we are sending are correct, and even then decoded and saved them again after just to confirm we are sending the right images which from everything I can tell we are sending the right images. ![Image](https://github.com/user-attachments/assets/ffe6e2f9-6917-4be1-a0d7-78ef24d6bfb0) ![Image](https://github.com/user-attachments/assets/6ea58831-3d67-4726-ad19-0bbec4e6f456) Also for our use case we are analyzing a video at certain frame intervals so are iterating through the frames to compare them so we are making these requests anywhere from 50 to 200+ times per run we do. When testing locally on my MacBook (it runs slow but still works) if I tried enough times it would one of the times finally give me the frame b summary of the bird, but mostly every time I tried I was still getting frame a and frame b summary both describing the dog as shown in the logs below. Please let me know, I am not sure if I am doing something wrong or if I have found some other issue or maybe how I can disable cache so this always works with the correct summaries for both images? Thank you so much for your help! Below is the python code for how we are sending the request and also the logged response as well: Python code: <img width="895" height="888" alt="Image" src="https://github.com/user-attachments/assets/a42e9866-4d45-46c9-b5ab-0d565e45daf9" /> Logged response: 2025-12-09 11:42:19,989 - root - INFO - frame_pair: ['frame_000001.jpg', 'frame_000002.jpg'] 2025-12-09 11:42:19,989 - root - INFO - model: gemma3:12b 2025-12-09 11:42:19,989 - root - INFO - frame1_path: ../scene_analysis_output/frames/frame_000001.jpg 2025-12-09 11:42:19,989 - root - INFO - frame2_path: ../scene_analysis_output/frames/frame_000002.jpg 2025-12-09 11:42:19,990 - root - INFO - prompt_text= ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } 2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.started host='127.0.0.1' port=11434 local_address=None timeout=None socket_options=None 2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7effd3170730> 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.started request=<Request [b'POST']> 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.complete 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.started request=<Request [b'POST']> 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.complete 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - receive_response_headers.started request=<Request [b'POST']> 2025-12-09 11:42:27,207 - httpcore.http11 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json; charset=utf-8'), (b'Date', b'Tue, 09 Dec 2025 18:42:27 GMT'), (b'Transfer-Encoding', b'chunked')]) 2025-12-09 11:42:27,208 - httpx - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK" 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.started request=<Request [b'POST']> 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.complete 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.started 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.complete 2025-12-09 11:42:27,208 - root - INFO - response=```json { "frame_a_date_text": "NONE", "frame_b_date_text": "NONE", "frame_a_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.", "frame_b_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.", "reasoning": "No date text found on either frame.", "same_scene": true } ```
Author
Owner

@joshuabolick commented on GitHub (Dec 9, 2025):

Oh yes the one other thing to add, is our system can also just use an API key to call out directly to Google API using the same Gemma3:12b model, and when we do it this way the image summaries always appear to be correct. The code leading up to that call is pretty much all the same except for calling out to google you upload the images vs encode them for the Ollama request...

Please let me know and thanks again so much for your help!

<!-- gh-comment-id:3633861912 --> @joshuabolick commented on GitHub (Dec 9, 2025): Oh yes the one other thing to add, is our system can also just use an API key to call out directly to Google API using the same Gemma3:12b model, and when we do it this way the image summaries always appear to be correct. The code leading up to that call is pretty much all the same except for calling out to google you upload the images vs encode them for the Ollama request... Please let me know and thanks again so much for your help!
Author
Owner

@rick-github commented on GitHub (Dec 9, 2025):

It's much easier to test things if you supply the raw text. For example, I assume that the text starting with "SYSTEM INSTRUCTION" is the value of PROMPT_TEXT, but because you've pasted the text in without a markdown block, it's being rendered by the browser and I don't know if detail is being lost. Also, screenshotting the python code means I can't cut and paste to test it.

<!-- gh-comment-id:3633895492 --> @rick-github commented on GitHub (Dec 9, 2025): It's much easier to test things if you supply the raw text. For example, I assume that the text starting with "SYSTEM INSTRUCTION" is the value of `PROMPT_TEXT`, but because you've pasted the text in without a markdown block, it's being rendered by the browser and I don't know if detail is being lost. Also, screenshotting the python code means I can't cut and paste to test it.
Author
Owner

@joshuabolick commented on GitHub (Dec 9, 2025):

Thanks sorry about that, here it is:

frames_prompt_text: |
  -----
  ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group.

  **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

  **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.
  ---
  ### TASK: FRAME COMPARISON LOGIC

  **STEP 1: TEXT EXTRACTION (CRITICAL)**
  First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

  **STEP 2: VISUAL SUMMARIZATION**
  Briefly summarize the visual content (objects, setting, action) in the summary fields.

  **STEP 3: LOGIC GATES (Execute in Order)**

  **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)**
  * Compare the extracted "frame_a_date_text" and "frame_b_date_text".
  * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false".
  * *Reasoning:* "Date Mismatch." (STOP AND RETURN)

  **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)**
  * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false".
  * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN)

  **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)**
  * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true".
  * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

  **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)**
  * If (and ONLY if) there are NO text overlays on either frame:
      * **BREAK** if the location/setting changes to a completely new different location/setting.
      * **BREAK** if one frame is solid color/static/artifact.
      * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

  ---
  ### REQUIRED JSON OUTPUT SCHEMA

  Respond **ONLY** with a single valid JSON object.

  {
  "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
  "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
  "frame_a_summary": "[Summary of visual content in Frame A]",
  "frame_b_summary": "[Summary of visual content in Frame B]",
  "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
  "same_scene": [boolean: true or false]
  }

Also here is that python code:

# Function to encode an image to base64
def encode_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def call_ollama_api(frame_pair, output_folder):
    """
    Calls to local Ollama server running to use Gemma3 AI model to analyze image

    Args:
        frame_path: String path to the input frame image to analyze

    Returns:
        AI model frame image analysis response text
    """
    configuration_settings = get_config()

    # Initialize the Ollama client
    client = ollama.Client(configuration_settings.get(OLLAMA_URL))

    logging.info(f"frame_pair: {frame_pair}")
    logging.info(f"model: {configuration_settings.get(OLLAMA_MODEL_NAME)}")
    frame1_path = output_folder + "/frames/" + frame_pair[0]
    frame2_path = output_folder + "/frames/" + frame_pair[1]

    logging.info(f"frame1_path: {frame1_path}")
    logging.info(f"frame2_path: {frame2_path}")

    # Encode images to base64
    FrameA = encode_image_to_base64(frame1_path)
    FrameB = encode_image_to_base64(frame2_path)

    prompt = configuration_settings.get(PROMPT_TEXT)
    # prompt = "describe the animals shown in the images"
    logging.info(f"prompt={prompt}")

    # Make the generate request with the images
    try:
        response = ollama.generate(
            model=configuration_settings.get(OLLAMA_MODEL_NAME),
            prompt=prompt,
            images=[FrameA, FrameB],
            options={
                'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions
                'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)
            }
        )
        logging.info(f"response={response['response']}")

        return response['response']
    except Exception as e:
        logging.error(f"An error occurred: {e}")
        return f"An error occurred: {e}"

Thank you again so much for your help!

<!-- gh-comment-id:3634054475 --> @joshuabolick commented on GitHub (Dec 9, 2025): Thanks sorry about that, here it is: ``` frames_prompt_text: | ----- ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } ``` Also here is that python code: ``` # Function to encode an image to base64 def encode_image_to_base64(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8') def call_ollama_api(frame_pair, output_folder): """ Calls to local Ollama server running to use Gemma3 AI model to analyze image Args: frame_path: String path to the input frame image to analyze Returns: AI model frame image analysis response text """ configuration_settings = get_config() # Initialize the Ollama client client = ollama.Client(configuration_settings.get(OLLAMA_URL)) logging.info(f"frame_pair: {frame_pair}") logging.info(f"model: {configuration_settings.get(OLLAMA_MODEL_NAME)}") frame1_path = output_folder + "/frames/" + frame_pair[0] frame2_path = output_folder + "/frames/" + frame_pair[1] logging.info(f"frame1_path: {frame1_path}") logging.info(f"frame2_path: {frame2_path}") # Encode images to base64 FrameA = encode_image_to_base64(frame1_path) FrameB = encode_image_to_base64(frame2_path) prompt = configuration_settings.get(PROMPT_TEXT) # prompt = "describe the animals shown in the images" logging.info(f"prompt={prompt}") # Make the generate request with the images try: response = ollama.generate( model=configuration_settings.get(OLLAMA_MODEL_NAME), prompt=prompt, images=[FrameA, FrameB], options={ 'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions 'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) } ) logging.info(f"response={response['response']}") return response['response'] except Exception as e: logging.error(f"An error occurred: {e}") return f"An error occurred: {e}" ``` Thank you again so much for your help!
Author
Owner

@rick-github commented on GitHub (Dec 9, 2025):

Internally, the ollama server just tokenizes the images and prepends them to the start of the prompt. They don't have any identifiers so the model can be confused about which image is being asked about when the prompt targets a specific image. Adding img tags can help the model disambiguate image references:

--- prompt_text.orig	2025-12-09 22:47:28.947539440 +0100
+++ prompt_text	2025-12-09 22:44:38.379639080 +0100
@@ -1,6 +1,6 @@
 ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR
 
-**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group.
+**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group.
 
 **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.
 

'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions

This assertion is incorrect. All this does is reduce the size of the context to the minimum supported by the model, 2048 tokens in the case of image models. Ollama does cache prompting but does not maintain a memory of previous interactions. The cache will be invalidated at the point the prompt deviates from the previous prompt. Unloading the model will also do this, but:

'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)

keep_alive is not a generation option, it is a runner option:

response = ollama.generate(
            model=configuration_settings.get(OLLAMA_MODEL_NAME),
            prompt=prompt,
            images=[FrameA, FrameB],
            keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)
        )
$ ./8513.py 
```json
{
"frame_a_date_text": "NONE",
"frame_b_date_text": "NONE",
"frame_a_summary": "Frame A depicts a small, fluffy dog with brown and white fur, a pink nose, and a blue patterned collar. The dog is looking directly at the camera on a light-colored wooden deck.",
"frame_b_summary": "Frame B shows a colorful bird perched on a branch. The bird has blue and red plumage and a green back. The background is blurred.",
"reasoning": "No date text present on either frame. Location/setting is completely different (dog on deck vs bird on a branch).",
"same_scene": false
}
```
<!-- gh-comment-id:3634463919 --> @rick-github commented on GitHub (Dec 9, 2025): Internally, the ollama server just tokenizes the images and prepends them to the start of the prompt. They don't have any identifiers so the model can be confused about which image is being asked about when the prompt targets a specific image. Adding `img` tags can help the model disambiguate image references: ```diff --- prompt_text.orig 2025-12-09 22:47:28.947539440 +0100 +++ prompt_text 2025-12-09 22:44:38.379639080 +0100 @@ -1,6 +1,6 @@ ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR -**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group. +**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. ``` > 'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions This assertion is incorrect. All this does is reduce the size of the context to the minimum supported by the model, 2048 tokens in the case of image models. Ollama does cache prompting but does not maintain a memory of previous interactions. The cache will be invalidated at the point the prompt deviates from the previous prompt. Unloading the model will also do this, but: > 'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) `keep_alive` is not a generation option, it is a runner option: ```python response = ollama.generate( model=configuration_settings.get(OLLAMA_MODEL_NAME), prompt=prompt, images=[FrameA, FrameB], keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) ) ``` ````console $ ./8513.py ```json { "frame_a_date_text": "NONE", "frame_b_date_text": "NONE", "frame_a_summary": "Frame A depicts a small, fluffy dog with brown and white fur, a pink nose, and a blue patterned collar. The dog is looking directly at the camera on a light-colored wooden deck.", "frame_b_summary": "Frame B shows a colorful bird perched on a branch. The bird has blue and red plumage and a green back. The background is blurred.", "reasoning": "No date text present on either frame. Location/setting is completely different (dog on deck vs bird on a branch).", "same_scene": false } ``` ```` ```
Author
Owner

@joshuabolick commented on GitHub (Dec 9, 2025):

Okay thank you very much!

So where you put the keep_alive there inside of the generate call is correct then?

Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result?

Thank you again for your help I very much appreciate it!!

<!-- gh-comment-id:3634537410 --> @joshuabolick commented on GitHub (Dec 9, 2025): Okay thank you very much! So where you put the keep_alive there inside of the generate call is correct then? Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result? Thank you again for your help I very much appreciate it!!
Author
Owner

@joshuabolick commented on GitHub (Dec 9, 2025):

Also just to confirm, when you say "The cache will be invalidated at the point the prompt deviates from the previous prompt." I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right?

If we are using the same prompt for each of these image pair comparisons will this cause us issues or?

<!-- gh-comment-id:3634548810 --> @joshuabolick commented on GitHub (Dec 9, 2025): Also just to confirm, when you say "The cache will be invalidated at the point the prompt deviates from the previous prompt." I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right? If we are using the same prompt for each of these image pair comparisons will this cause us issues or?
Author
Owner

@rick-github commented on GitHub (Dec 9, 2025):

So where you put the keep_alive there inside of the generate call is correct then?

Yes.

Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result?

Adding the image tags.

I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right?

Images are prepended to the prompt, so a new image will invalidate the entire prompt cache.

If we are using the same prompt for each of these image pair comparisons will this cause us issues or?

No issues.

<!-- gh-comment-id:3634555731 --> @rick-github commented on GitHub (Dec 9, 2025): > So where you put the keep_alive there inside of the generate call is correct then? Yes. > Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result? Adding the image tags. > I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right? Images are prepended to the prompt, so a new image will invalidate the entire prompt cache. > If we are using the same prompt for each of these image pair comparisons will this cause us issues or? No issues.
Author
Owner

@rick-github commented on GitHub (Dec 9, 2025):

Let me correct myself on the image/prompt interaction. If the client is always sending the same amount of images with the same prompt, then the prompt cache will be used. The images themselves are a set of tokens that are processed in a separate batch are not subject to prompt caching. So while the text prompt will be re-used, there should be no issues with image bleedover. If there is, that would be a bug, and you should open a new issue to have it dealt with.

<!-- gh-comment-id:3634566269 --> @rick-github commented on GitHub (Dec 9, 2025): Let me correct myself on the image/prompt interaction. If the client is always sending the same amount of images with the same prompt, then the prompt cache will be used. The images themselves are a set of tokens that are processed in a separate batch are not subject to prompt caching. So while the text prompt will be re-used, there should be no issues with image bleedover. If there is, that would be a bug, and you should open a new issue to have it dealt with.
Author
Owner

@joshuabolick commented on GitHub (Dec 9, 2025):

Thank you again @rick-github this has been extremely helpful!

If you have a buy me a coffee thing or something I will buy you at least a couple! haha thanks again cheers! :)

<!-- gh-comment-id:3634607316 --> @joshuabolick commented on GitHub (Dec 9, 2025): Thank you again @rick-github this has been extremely helpful! If you have a buy me a coffee thing or something I will buy you at least a couple! haha thanks again cheers! :)
Author
Owner

@joshuabolick commented on GitHub (Dec 10, 2025):

Hey @rick-github, I just wanted to follow up. So making that change to add the tags for the images did definitely improve the results but I am still seeing it happen sometimes, especially since we are just iterating around through frame images from a video.

Also one thing I forgot to mention is for our comparisons, we are for example comparing frame 100 to frame 200 as that pair, then next iteration we are then comparing image 200 to image 300 and so on etc so on the next iteration we are sending that frame 200 again now as img-0 where it was img-1 on the previous iteration so I am not sure if that could somehow be potentially causing this issue?

But here attached are a couple image comparisons where you can see if seems to be using the same image for both and also sometimes it will miss the date text overlay which is definitely important for our process.

Anyway I guess just wanted to mention that detail about how we are making these requests and also just wondering if there is anything else I can try to prevent this from happening where sometimes it appears to be using the wrong images in the comparisons? On other iterations it works correctly so it is not always using the wrong images but from what I can tell sometimes it is.

Please let me know what you think and thanks again for all your help!

frames_prompt_text: |
  -----
  ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group.

  **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

  **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.
  ---
  ### TASK: FRAME COMPARISON LOGIC

  **STEP 1: TEXT EXTRACTION (CRITICAL)**
  First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

  **STEP 2: VISUAL SUMMARIZATION**
  Briefly summarize the visual content (objects, setting, action) in the summary fields.

  **STEP 3: LOGIC GATES (Execute in Order)**

  **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)**
  * Compare the extracted "frame_a_date_text" and "frame_b_date_text".
  * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false".
  * *Reasoning:* "Date Mismatch." (STOP AND RETURN)

  **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)**
  * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false".
  * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN)

  **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)**
  * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true".
  * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

  **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)**
  * If (and ONLY if) there are NO text overlays on either frame:
      * **BREAK** if the location/setting changes to a completely new different location/setting.
      * **BREAK** if one frame is solid color/static/artifact.
      * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

  ---
  ### REQUIRED JSON OUTPUT SCHEMA

  Respond **ONLY** with a single valid JSON object.

  {
  "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
  "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
  "frame_a_summary": "[Summary of visual content in Frame A]",
  "frame_b_summary": "[Summary of visual content in Frame B]",
  "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
  "same_scene": [boolean: true or false]
  }
<!-- gh-comment-id:3639112201 --> @joshuabolick commented on GitHub (Dec 10, 2025): Hey @rick-github, I just wanted to follow up. So making that change to add the tags for the images did definitely improve the results but I am still seeing it happen sometimes, especially since we are just iterating around through frame images from a video. Also one thing I forgot to mention is for our comparisons, we are for example comparing frame 100 to frame 200 as that pair, then next iteration we are then comparing image 200 to image 300 and so on etc so on the next iteration we are sending that frame 200 again now as img-0 where it was img-1 on the previous iteration so I am not sure if that could somehow be potentially causing this issue? But here attached are a couple image comparisons where you can see if seems to be using the same image for both and also sometimes it will miss the date text overlay which is definitely important for our process. Anyway I guess just wanted to mention that detail about how we are making these requests and also just wondering if there is anything else I can try to prevent this from happening where sometimes it appears to be using the wrong images in the comparisons? On other iterations it works correctly so it is not always using the wrong images but from what I can tell sometimes it is. Please let me know what you think and thanks again for all your help! ``` frames_prompt_text: | ----- ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } ```
Author
Owner

@joshuabolick commented on GitHub (Dec 10, 2025):

Also here attached are those images if you want to use them for testing, also let me know any other info or anything else I can do to help! Mainly it is just very curious because when we run switching over to call directly out to Google API with same model, this does not happen...

Sorry I had to remove those images but I can provide other examples if needed just let me know thank you!

<!-- gh-comment-id:3639165832 --> @joshuabolick commented on GitHub (Dec 10, 2025): Also here attached are those images if you want to use them for testing, also let me know any other info or anything else I can do to help! Mainly it is just very curious because when we run switching over to call directly out to Google API with same model, this does not happen... Sorry I had to remove those images but I can provide other examples if needed just let me know thank you!
Author
Owner

@joshuabolick commented on GitHub (Dec 29, 2025):

Hey @rick-github hope you are doing well and I just wanted to follow up.

We are still seeing this happen sometimes even with the image tags in the prompt, where it looks like when sending two images to describe and then compare in the Ollama request, sometimes it still gives the exact same description for both images and then says they are the same.

So from our testing so far this does seem like some sort of image bleedover, any other ideas or things I can try and test here? Or do you think I should go ahead and open a new issue to have it dealt with?

Please let me know when you get a chance and thanks again!

<!-- gh-comment-id:3697410117 --> @joshuabolick commented on GitHub (Dec 29, 2025): Hey @rick-github hope you are doing well and I just wanted to follow up. We are still seeing this happen sometimes even with the image tags in the prompt, where it looks like when sending two images to describe and then compare in the Ollama request, sometimes it still gives the exact same description for both images and then says they are the same. So from our testing so far this does seem like some sort of image bleedover, any other ideas or things I can try and test here? Or do you think I should go ahead and open a new issue to have it dealt with? Please let me know when you get a chance and thanks again!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31247