[GH-ISSUE #8513] Support for Multiple Images in /chat Endpoint #31247

New Issue

GiteaMirror · 2026-04-22T11:31:02-05:00

GiteaMirror commented

2026-04-22 11:31:02 -05:00

Originally created by @pmedina-42 on GitHub (Jan 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8513

Currently, the /chat endpoint includes the images field, but it only supports a single image. While this is functional, it introduces an additional layer of complexity when performing RAG with images embedded in base64.

For instance, if the content retriever returns multiple embeddings with the highest scores referencing different images, we need to manually reconstruct the full images (potentially missing lower-score records in case the embedding size isn't big enough to store the whole image in one record) and then make a separate /chat call for each retrieved image. Finally, all the responses must be summarized into one.

This manual process could be significantly simplified if the images field allowed for passing multiple images in a single request.

Is there any plan in the near future to support multiple images in the images field? This enhancement would greatly streamline workflows and reduce the overhead in scenarios like the one described above.

Additionally, I’m just getting started and don’t have much experience yet, so it’s possible that I’m overlooking something that could make this process easier. If there’s a better approach or workaround I might have missed, I’d be grateful for any guidance.

Thank you beforehand!

Originally created by @pmedina-42 on GitHub (Jan 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8513 Currently, the /chat endpoint includes the images field, but it only supports a single image. While this is functional, it introduces an additional layer of complexity when performing RAG with images embedded in base64. For instance, if the content retriever returns multiple embeddings with the highest scores referencing different images, we need to manually reconstruct the full images (potentially missing lower-score records in case the embedding size isn't big enough to store the whole image in one record) and then make a separate /chat call for each retrieved image. Finally, all the responses must be summarized into one. This manual process could be significantly simplified if the images field allowed for passing multiple images in a single request. Is there any plan in the near future to support multiple images in the images field? This enhancement would greatly streamline workflows and reduce the overhead in scenarios like the one described above. Additionally, I’m just getting started and don’t have much experience yet, so it’s possible that I’m overlooking something that could make this process easier. If there’s a better approach or workaround I might have missed, I’d be grateful for any guidance. Thank you beforehand!

GiteaMirror added the feature request label 2026-04-22 11:31:02 -05:00

GiteaMirror closed this issue

2026-04-22 11:31:03 -05:00

GiteaMirror commented

2026-04-22 11:31:04 -05:00

@rick-github commented on GitHub (Jan 21, 2025):

ollama supports multiple images, but most models do not. Note how llava merges the two images and describes a kitten with the collar from the puppy image.

$ for i in minicpm-v:8b-2.6-q4_K_M moondream:1.8b-v2-fp16 llava ; do 
  echo $i ; 
  echo '{"model": "'$i'",
         "messages":[{
            "role":"user","content":"describe the animals shown in the images",
            "images": [
              "'"$(base64 puppy.jpg)"'",
              "'"$(base64 kitten.jpg)"'"
            ]
          }],
         "stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content ;
done

#minicpm-v:8b-2.6-q4_K_M
The first image shows a small white puppy sitting on what appears to be concrete steps. The puppy
has bright eyes and is wearing a red collar with a bell attached.

The second image depicts an orange kitten lying down, looking directly at the camera. It has large,
expressive greenish-yellow eyes and pointed ears typical of many cat breeds.

#moondream:1.8b-v2-fp16
In the image, there is a cute orange kitten sitting on top of an object that appears to be either a couch
or a bed. The kitten seems to be looking directly at the camera and has its eyes open wide, capturing
attention from viewers. The scene exudes warmth and playfulness as this adorable feline takes center
stage in the composition.

#llava
The image shows a small, kitten-like cat with a white coat and tan-colored ears. It has striking blue
eyes and is wearing a red collar with a tag. The cat appears to be sitting or lying down on what looks
like a wooden floor or deck.

@rick-github commented on GitHub (Jan 21, 2025): ollama supports multiple images, but most models do not. Note how llava merges the two images and describes a kitten with the collar from the puppy image. ```console $ for i in minicpm-v:8b-2.6-q4_K_M moondream:1.8b-v2-fp16 llava ; do echo $i ; echo '{"model": "'$i'", "messages":[{ "role":"user","content":"describe the animals shown in the images", "images": [ "'"$(base64 puppy.jpg)"'", "'"$(base64 kitten.jpg)"'" ] }], "stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content ; done #minicpm-v:8b-2.6-q4_K_M The first image shows a small white puppy sitting on what appears to be concrete steps. The puppy has bright eyes and is wearing a red collar with a bell attached. The second image depicts an orange kitten lying down, looking directly at the camera. It has large, expressive greenish-yellow eyes and pointed ears typical of many cat breeds. #moondream:1.8b-v2-fp16 In the image, there is a cute orange kitten sitting on top of an object that appears to be either a couch or a bed. The kitten seems to be looking directly at the camera and has its eyes open wide, capturing attention from viewers. The scene exudes warmth and playfulness as this adorable feline takes center stage in the composition. #llava The image shows a small, kitten-like cat with a white coat and tan-colored ears. It has striking blue eyes and is wearing a red collar with a tag. The cat appears to be sitting or lying down on what looks like a wooden floor or deck. ```

GiteaMirror commented

2026-04-22 11:31:05 -05:00

@joshuabolick commented on GitHub (Dec 8, 2025):

Greetings all, we have been struggling with this too. We are using the Gemma3:12b model, sending two images in the request to the generate endpoint via python api, and it appears very much like it is merging the two images when we are trying to compare them.

Is this definitely true that this cannot be done with current ollama using the Gemma3:12b model?

We have also been doing this calling directly out to Google API using this same model and same request and it works no problem so it has been very puzzling.

Please let me know and thank you!

@joshuabolick commented on GitHub (Dec 8, 2025): Greetings all, we have been struggling with this too. We are using the Gemma3:12b model, sending two images in the request to the generate endpoint via python api, and it appears very much like it is merging the two images when we are trying to compare them. Is this definitely true that this cannot be done with current ollama using the Gemma3:12b model? We have also been doing this calling directly out to Google API using this same model and same request and it works no problem so it has been very puzzling. Please let me know and thank you!

GiteaMirror commented

2026-04-22 11:31:06 -05:00

@rick-github commented on GitHub (Dec 8, 2025):

$ echo '{                
  "model": "gemma3:12b",
  "messages":[{
    "role":"user","content":"describe the animals shown in the images",
    "images": [
      "'"$(base64 image1.jpg)"'",
      "'"$(base64 image2.jpg)"'"
    ]
  }],
  "stream":false
}' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content
Here's a description of the animals in the images:

**Image 1: White Puppy**

*   **Type:** It appears to be a very young puppy, likely a mix. Given its fluffy white coat, it could have some Spitz breed in its heritage (like an American Eskimo Dog or Samoyed, but it's difficult to be certain).
*   **Appearance:** The puppy is small and fluffy, with a pristine white coat. It has a small, slightly upturned nose and dark eyes. It's wearing a red collar with a gold bell.
*   **Pose:** The puppy is sitting on a set of concrete steps, looking off to the side with a slightly melancholy expression.

**Image 2: Ginger Kitten**

*   **Type:** It is a kitten.
*   **Appearance:** The kitten has a vibrant ginger (orange/red) coat, with white markings around its face. It has large, expressive eyes, and its ears are large in proportion to its head.
*   **Pose:** The kitten is perched on a textured surface and looking directly into the camera.



Let me know if you're curious about anything else regarding these images!

@rick-github commented on GitHub (Dec 8, 2025): ```console $ echo '{ "model": "gemma3:12b", "messages":[{ "role":"user","content":"describe the animals shown in the images", "images": [ "'"$(base64 image1.jpg)"'", "'"$(base64 image2.jpg)"'" ] }], "stream":false }' | curl -s http://localhost:11434/api/chat -d @- | jq -r .message.content Here's a description of the animals in the images: **Image 1: White Puppy** * **Type:** It appears to be a very young puppy, likely a mix. Given its fluffy white coat, it could have some Spitz breed in its heritage (like an American Eskimo Dog or Samoyed, but it's difficult to be certain). * **Appearance:** The puppy is small and fluffy, with a pristine white coat. It has a small, slightly upturned nose and dark eyes. It's wearing a red collar with a gold bell. * **Pose:** The puppy is sitting on a set of concrete steps, looking off to the side with a slightly melancholy expression. **Image 2: Ginger Kitten** * **Type:** It is a kitten. * **Appearance:** The kitten has a vibrant ginger (orange/red) coat, with white markings around its face. It has large, expressive eyes, and its ears are large in proportion to its head. * **Pose:** The kitten is perched on a textured surface and looking directly into the camera. Let me know if you're curious about anything else regarding these images! ```

GiteaMirror commented

2026-04-22 11:31:06 -05:00

@joshuabolick commented on GitHub (Dec 8, 2025):

Okay thanks so it should work with this model then it sounds like?

We are using the generate endpoint, should that work the same way or?

@joshuabolick commented on GitHub (Dec 8, 2025): Okay thanks so it should work with this model then it sounds like? We are using the generate endpoint, should that work the same way or?

GiteaMirror commented

2026-04-22 11:31:07 -05:00

@rick-github commented on GitHub (Dec 8, 2025):

$ echo '{                
  "model": "gemma3:12b",
  "prompt":"describe the animals shown in the images",
  "images": [
    "'"$(base64 image1.jpg)"'",
    "'"$(base64 image2.jpg)"'"
  ],
  "stream":false
}' | curl -s http://localhost:11434/api/generate -d @- | jq -r .response
Here's a description of the animals in the images:

**Image 1: The Puppy**

*   **Type:** The puppy appears to be a Samoyed or a breed with similar characteristics.
*   **Appearance:** It's a fluffy, snow-white puppy. It has a cute, slightly worried expression and small, dark eyes. It's wearing a red collar with a bell.
*   **Pose:** It's sitting on what looks like a stone step.

**Image 2: The Kitten**

*   **Type:** It's an orange tabby kitten.
*   **Appearance:** The kitten is a beautiful orange tabby with a fluffy appearance. It has a sweet expression and large, curious eyes.
*   **Pose:** The kitten is sitting or crouching on a textured surface that appears to be part of a chair or cushion.



Let me know if you want me to describe anything else!

@rick-github commented on GitHub (Dec 8, 2025): ```console $ echo '{ "model": "gemma3:12b", "prompt":"describe the animals shown in the images", "images": [ "'"$(base64 image1.jpg)"'", "'"$(base64 image2.jpg)"'" ], "stream":false }' | curl -s http://localhost:11434/api/generate -d @- | jq -r .response Here's a description of the animals in the images: **Image 1: The Puppy** * **Type:** The puppy appears to be a Samoyed or a breed with similar characteristics. * **Appearance:** It's a fluffy, snow-white puppy. It has a cute, slightly worried expression and small, dark eyes. It's wearing a red collar with a bell. * **Pose:** It's sitting on what looks like a stone step. **Image 2: The Kitten** * **Type:** It's an orange tabby kitten. * **Appearance:** The kitten is a beautiful orange tabby with a fluffy appearance. It has a sweet expression and large, curious eyes. * **Pose:** The kitten is sitting or crouching on a textured surface that appears to be part of a chair or cushion. Let me know if you want me to describe anything else! ```

GiteaMirror commented

2026-04-22 11:31:08 -05:00

@joshuabolick commented on GitHub (Dec 9, 2025):

Thank you so much for your help!

So for our usage we are using the Python ollama library (just installed the latest v0.6.1) and also updated to the very latest version of ollama (0.13.2) running on our Unbuntu server which has 2 NVIDIA GPUs that look to be balancing the request load correctly.

Anyway I am manually testing this on our server just with 2 images and you can see our prompt below (I know its long, we are still adjusting it) but it feels like the base64 encoded images or the requests may be being cached or something? I have been searching and found some various ways to try to disable any caching that you can see in the request options, but so far I am still getting the same summary for both frame a and frame b (I have attached these images frame a is a dog and frame b is a bird). I also have confirmed at least a couple times the images we are sending are correct, and even then decoded and saved them again after just to confirm we are sending the right images which from everything I can tell we are sending the right images.

Also for our use case we are analyzing a video at certain frame intervals so are iterating through the frames to compare them so we are making these requests anywhere from 50 to 200+ times per run we do.

When testing locally on my MacBook (it runs slow but still works) if I tried enough times it would one of the times finally give me the frame b summary of the bird, but mostly every time I tried I was still getting frame a and frame b summary both describing the dog as shown in the logs below.

Please let me know, I am not sure if I am doing something wrong or if I have found some other issue or maybe how I can disable cache so this always works with the correct summaries for both images? Thank you so much for your help!

Below is the python code for how we are sending the request and also the logged response as well:

Python code:

Logged response:

2025-12-09 11:42:19,989 - root - INFO - frame_pair: ['frame_000001.jpg', 'frame_000002.jpg']
2025-12-09 11:42:19,989 - root - INFO - model: gemma3:12b
2025-12-09 11:42:19,989 - root - INFO - frame1_path: ../scene_analysis_output/frames/frame_000001.jpg
2025-12-09 11:42:19,989 - root - INFO - frame2_path: ../scene_analysis_output/frames/frame_000002.jpg
2025-12-09 11:42:19,990 - root - INFO - prompt_text=

SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

ROLE: You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group.

CRITICAL DIRECTIVE: You must act as a strict OCR (Optical Character Recognition) Validator. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

OUTPUT FORMAT: You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.

TASK: FRAME COMPARISON LOGIC

STEP 1: TEXT EXTRACTION (CRITICAL)
First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

STEP 2: VISUAL SUMMARIZATION
Briefly summarize the visual content (objects, setting, action) in the summary fields.

STEP 3: LOGIC GATES (Execute in Order)

GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)

Compare the extracted "frame_a_date_text" and "frame_b_date_text".
If the text exists on both but the DAY number is different (e.g., "Dec 01" vs "Dec 02"), you MUST output "same_scene: false".
Reasoning: "Date Mismatch." (STOP AND RETURN)

GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)

If one frame has a date/text overlay and the other does not, you MUST output "same_scene: false".
Reasoning: "Overlay Inconsistency." (STOP AND RETURN)

GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)

If both frames have a date overlay and the date strings are IDENTICAL, you MUST output "same_scene: true".
NOTE: This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)

If (and ONLY if) there are NO text overlays on either frame:
- BREAK if the location/setting changes to a completely new different location/setting.
- BREAK if one frame is solid color/static/artifact.
- KEEP if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

REQUIRED JSON OUTPUT SCHEMA

Respond ONLY with a single valid JSON object.

{
"frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
"frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
"frame_a_summary": "[Summary of visual content in Frame A]",
"frame_b_summary": "[Summary of visual content in Frame B]",
"reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
"same_scene": [boolean: true or false]
}

2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.started host='127.0.0.1' port=11434 local_address=None timeout=None socket_options=None
2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7effd3170730>
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.started request=<Request [b'POST']>
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.complete
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.started request=<Request [b'POST']>
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.complete
2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - receive_response_headers.started request=<Request [b'POST']>
2025-12-09 11:42:27,207 - httpcore.http11 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json; charset=utf-8'), (b'Date', b'Tue, 09 Dec 2025 18:42:27 GMT'), (b'Transfer-Encoding', b'chunked')])
2025-12-09 11:42:27,208 - httpx - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.started request=<Request [b'POST']>
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.complete
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.started
2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.complete
2025-12-09 11:42:27,208 - root - INFO - response=```json
{
"frame_a_date_text": "NONE",
"frame_b_date_text": "NONE",
"frame_a_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.",
"frame_b_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.",
"reasoning": "No date text found on either frame.",
"same_scene": true
}

@joshuabolick commented on GitHub (Dec 9, 2025): Thank you so much for your help! So for our usage we are using the Python ollama library (just installed the latest v0.6.1) and also updated to the very latest version of ollama (0.13.2) running on our Unbuntu server which has 2 NVIDIA GPUs that look to be balancing the request load correctly. Anyway I am manually testing this on our server just with 2 images and you can see our prompt below (I know its long, we are still adjusting it) but it feels like the base64 encoded images or the requests may be being cached or something? I have been searching and found some various ways to try to disable any caching that you can see in the request options, but so far I am still getting the same summary for both frame a and frame b (I have attached these images frame a is a dog and frame b is a bird). I also have confirmed at least a couple times the images we are sending are correct, and even then decoded and saved them again after just to confirm we are sending the right images which from everything I can tell we are sending the right images. ![Image](https://github.com/user-attachments/assets/ffe6e2f9-6917-4be1-a0d7-78ef24d6bfb0) ![Image](https://github.com/user-attachments/assets/6ea58831-3d67-4726-ad19-0bbec4e6f456) Also for our use case we are analyzing a video at certain frame intervals so are iterating through the frames to compare them so we are making these requests anywhere from 50 to 200+ times per run we do. When testing locally on my MacBook (it runs slow but still works) if I tried enough times it would one of the times finally give me the frame b summary of the bird, but mostly every time I tried I was still getting frame a and frame b summary both describing the dog as shown in the logs below. Please let me know, I am not sure if I am doing something wrong or if I have found some other issue or maybe how I can disable cache so this always works with the correct summaries for both images? Thank you so much for your help! Below is the python code for how we are sending the request and also the logged response as well: Python code: <img width="895" height="888" alt="Image" src="https://github.com/user-attachments/assets/a42e9866-4d45-46c9-b5ab-0d565e45daf9" /> Logged response: 2025-12-09 11:42:19,989 - root - INFO - frame_pair: ['frame_000001.jpg', 'frame_000002.jpg'] 2025-12-09 11:42:19,989 - root - INFO - model: gemma3:12b 2025-12-09 11:42:19,989 - root - INFO - frame1_path: ../scene_analysis_output/frames/frame_000001.jpg 2025-12-09 11:42:19,989 - root - INFO - frame2_path: ../scene_analysis_output/frames/frame_000002.jpg 2025-12-09 11:42:19,990 - root - INFO - prompt_text= ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } 2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.started host='127.0.0.1' port=11434 local_address=None timeout=None socket_options=None 2025-12-09 11:42:19,995 - httpcore.connection - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x7effd3170730> 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.started request=<Request [b'POST']> 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_headers.complete 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.started request=<Request [b'POST']> 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - send_request_body.complete 2025-12-09 11:42:19,995 - httpcore.http11 - DEBUG - receive_response_headers.started request=<Request [b'POST']> 2025-12-09 11:42:27,207 - httpcore.http11 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Type', b'application/json; charset=utf-8'), (b'Date', b'Tue, 09 Dec 2025 18:42:27 GMT'), (b'Transfer-Encoding', b'chunked')]) 2025-12-09 11:42:27,208 - httpx - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK" 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.started request=<Request [b'POST']> 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - receive_response_body.complete 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.started 2025-12-09 11:42:27,208 - httpcore.http11 - DEBUG - response_closed.complete 2025-12-09 11:42:27,208 - root - INFO - response=```json { "frame_a_date_text": "NONE", "frame_b_date_text": "NONE", "frame_a_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.", "frame_b_summary": "Close-up of a small, tan and white dog with floppy ears and a blue and white patterned collar, standing on a wooden deck.", "reasoning": "No date text found on either frame.", "same_scene": true } ```

GiteaMirror commented

2026-04-22 11:31:09 -05:00

@joshuabolick commented on GitHub (Dec 9, 2025):

Oh yes the one other thing to add, is our system can also just use an API key to call out directly to Google API using the same Gemma3:12b model, and when we do it this way the image summaries always appear to be correct. The code leading up to that call is pretty much all the same except for calling out to google you upload the images vs encode them for the Ollama request...

Please let me know and thanks again so much for your help!

@joshuabolick commented on GitHub (Dec 9, 2025): Oh yes the one other thing to add, is our system can also just use an API key to call out directly to Google API using the same Gemma3:12b model, and when we do it this way the image summaries always appear to be correct. The code leading up to that call is pretty much all the same except for calling out to google you upload the images vs encode them for the Ollama request... Please let me know and thanks again so much for your help!

GiteaMirror commented

2026-04-22 11:31:09 -05:00

@rick-github commented on GitHub (Dec 9, 2025):

It's much easier to test things if you supply the raw text. For example, I assume that the text starting with "SYSTEM INSTRUCTION" is the value of PROMPT_TEXT, but because you've pasted the text in without a markdown block, it's being rendered by the browser and I don't know if detail is being lost. Also, screenshotting the python code means I can't cut and paste to test it.

@rick-github commented on GitHub (Dec 9, 2025): It's much easier to test things if you supply the raw text. For example, I assume that the text starting with "SYSTEM INSTRUCTION" is the value of `PROMPT_TEXT`, but because you've pasted the text in without a markdown block, it's being rendered by the browser and I don't know if detail is being lost. Also, screenshotting the python code means I can't cut and paste to test it.

GiteaMirror commented

2026-04-22 11:31:10 -05:00

@joshuabolick commented on GitHub (Dec 9, 2025):

Thanks sorry about that, here it is:

frames_prompt_text: |
  -----
  ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group.

  **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

  **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.
  ---
  ### TASK: FRAME COMPARISON LOGIC

  **STEP 1: TEXT EXTRACTION (CRITICAL)**
  First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

  **STEP 2: VISUAL SUMMARIZATION**
  Briefly summarize the visual content (objects, setting, action) in the summary fields.

  **STEP 3: LOGIC GATES (Execute in Order)**

  **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)**
  * Compare the extracted "frame_a_date_text" and "frame_b_date_text".
  * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false".
  * *Reasoning:* "Date Mismatch." (STOP AND RETURN)

  **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)**
  * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false".
  * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN)

  **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)**
  * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true".
  * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

  **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)**
  * If (and ONLY if) there are NO text overlays on either frame:
      * **BREAK** if the location/setting changes to a completely new different location/setting.
      * **BREAK** if one frame is solid color/static/artifact.
      * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

  ---
  ### REQUIRED JSON OUTPUT SCHEMA

  Respond **ONLY** with a single valid JSON object.

  {
  "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
  "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
  "frame_a_summary": "[Summary of visual content in Frame A]",
  "frame_b_summary": "[Summary of visual content in Frame B]",
  "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
  "same_scene": [boolean: true or false]
  }

Also here is that python code:

# Function to encode an image to base64
def encode_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def call_ollama_api(frame_pair, output_folder):
    """
    Calls to local Ollama server running to use Gemma3 AI model to analyze image

    Args:
        frame_path: String path to the input frame image to analyze

    Returns:
        AI model frame image analysis response text
    """
    configuration_settings = get_config()

    # Initialize the Ollama client
    client = ollama.Client(configuration_settings.get(OLLAMA_URL))

    logging.info(f"frame_pair: {frame_pair}")
    logging.info(f"model: {configuration_settings.get(OLLAMA_MODEL_NAME)}")
    frame1_path = output_folder + "/frames/" + frame_pair[0]
    frame2_path = output_folder + "/frames/" + frame_pair[1]

    logging.info(f"frame1_path: {frame1_path}")
    logging.info(f"frame2_path: {frame2_path}")

    # Encode images to base64
    FrameA = encode_image_to_base64(frame1_path)
    FrameB = encode_image_to_base64(frame2_path)

    prompt = configuration_settings.get(PROMPT_TEXT)
    # prompt = "describe the animals shown in the images"
    logging.info(f"prompt={prompt}")

    # Make the generate request with the images
    try:
        response = ollama.generate(
            model=configuration_settings.get(OLLAMA_MODEL_NAME),
            prompt=prompt,
            images=[FrameA, FrameB],
            options={
                'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions
                'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)
            }
        )
        logging.info(f"response={response['response']}")

        return response['response']
    except Exception as e:
        logging.error(f"An error occurred: {e}")
        return f"An error occurred: {e}"

Thank you again so much for your help!

@joshuabolick commented on GitHub (Dec 9, 2025): Thanks sorry about that, here it is: ``` frames_prompt_text: | ----- ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } ``` Also here is that python code: ``` # Function to encode an image to base64 def encode_image_to_base64(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8') def call_ollama_api(frame_pair, output_folder): """ Calls to local Ollama server running to use Gemma3 AI model to analyze image Args: frame_path: String path to the input frame image to analyze Returns: AI model frame image analysis response text """ configuration_settings = get_config() # Initialize the Ollama client client = ollama.Client(configuration_settings.get(OLLAMA_URL)) logging.info(f"frame_pair: {frame_pair}") logging.info(f"model: {configuration_settings.get(OLLAMA_MODEL_NAME)}") frame1_path = output_folder + "/frames/" + frame_pair[0] frame2_path = output_folder + "/frames/" + frame_pair[1] logging.info(f"frame1_path: {frame1_path}") logging.info(f"frame2_path: {frame2_path}") # Encode images to base64 FrameA = encode_image_to_base64(frame1_path) FrameB = encode_image_to_base64(frame2_path) prompt = configuration_settings.get(PROMPT_TEXT) # prompt = "describe the animals shown in the images" logging.info(f"prompt={prompt}") # Make the generate request with the images try: response = ollama.generate( model=configuration_settings.get(OLLAMA_MODEL_NAME), prompt=prompt, images=[FrameA, FrameB], options={ 'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions 'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) } ) logging.info(f"response={response['response']}") return response['response'] except Exception as e: logging.error(f"An error occurred: {e}") return f"An error occurred: {e}" ``` Thank you again so much for your help!

GiteaMirror commented

2026-04-22 11:31:10 -05:00

@rick-github commented on GitHub (Dec 9, 2025):

Internally, the ollama server just tokenizes the images and prepends them to the start of the prompt. They don't have any identifiers so the model can be confused about which image is being asked about when the prompt targets a specific image. Adding img tags can help the model disambiguate image references:

--- prompt_text.orig	2025-12-09 22:47:28.947539440 +0100
+++ prompt_text	2025-12-09 22:44:38.379639080 +0100
@@ -1,6 +1,6 @@
 ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR
 
-**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group.
+**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group.
 
 **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions

This assertion is incorrect. All this does is reduce the size of the context to the minimum supported by the model, 2048 tokens in the case of image models. Ollama does cache prompting but does not maintain a memory of previous interactions. The cache will be invalidated at the point the prompt deviates from the previous prompt. Unloading the model will also do this, but:

'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)

keep_alive is not a generation option, it is a runner option:

response = ollama.generate(
            model=configuration_settings.get(OLLAMA_MODEL_NAME),
            prompt=prompt,
            images=[FrameA, FrameB],
            keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)
        )

$ ./8513.py 
```json
{
"frame_a_date_text": "NONE",
"frame_b_date_text": "NONE",
"frame_a_summary": "Frame A depicts a small, fluffy dog with brown and white fur, a pink nose, and a blue patterned collar. The dog is looking directly at the camera on a light-colored wooden deck.",
"frame_b_summary": "Frame B shows a colorful bird perched on a branch. The bird has blue and red plumage and a green back. The background is blurred.",
"reasoning": "No date text present on either frame. Location/setting is completely different (dog on deck vs bird on a branch).",
"same_scene": false
}
```

@rick-github commented on GitHub (Dec 9, 2025): Internally, the ollama server just tokenizes the images and prepends them to the start of the prompt. They don't have any identifiers so the model can be confused about which image is being asked about when the prompt targets a specific image. Adding `img` tags can help the model disambiguate image references: ```diff --- prompt_text.orig 2025-12-09 22:47:28.947539440 +0100 +++ prompt_text 2025-12-09 22:44:38.379639080 +0100 @@ -1,6 +1,6 @@ ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR -**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A and Frame B) and decide if they belong to the same scene group. +**ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. ``` > 'num_ctx': 0, # This guarantees the request is processed without memory of previous interactions This assertion is incorrect. All this does is reduce the size of the context to the minimum supported by the model, 2048 tokens in the case of image models. Ollama does cache prompting but does not maintain a memory of previous interactions. The cache will be invalidated at the point the prompt deviates from the previous prompt. Unloading the model will also do this, but: > 'keep_alive': '0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) `keep_alive` is not a generation option, it is a runner option: ```python response = ollama.generate( model=configuration_settings.get(OLLAMA_MODEL_NAME), prompt=prompt, images=[FrameA, FrameB], keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) ) ``` ````console $ ./8513.py ```json { "frame_a_date_text": "NONE", "frame_b_date_text": "NONE", "frame_a_summary": "Frame A depicts a small, fluffy dog with brown and white fur, a pink nose, and a blue patterned collar. The dog is looking directly at the camera on a light-colored wooden deck.", "frame_b_summary": "Frame B shows a colorful bird perched on a branch. The bird has blue and red plumage and a green back. The background is blurred.", "reasoning": "No date text present on either frame. Location/setting is completely different (dog on deck vs bird on a branch).", "same_scene": false } ``` ```` ```

GiteaMirror commented

2026-04-22 11:31:12 -05:00

@joshuabolick commented on GitHub (Dec 9, 2025):

Okay thank you very much!

So where you put the keep_alive there inside of the generate call is correct then?

Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result?

Thank you again for your help I very much appreciate it!!

@joshuabolick commented on GitHub (Dec 9, 2025): Okay thank you very much! So where you put the keep_alive there inside of the generate call is correct then? Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result? Thank you again for your help I very much appreciate it!!

GiteaMirror commented

2026-04-22 11:31:13 -05:00

@joshuabolick commented on GitHub (Dec 9, 2025):

Also just to confirm, when you say "The cache will be invalidated at the point the prompt deviates from the previous prompt." I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right?

If we are using the same prompt for each of these image pair comparisons will this cause us issues or?

@joshuabolick commented on GitHub (Dec 9, 2025): Also just to confirm, when you say "The cache will be invalidated at the point the prompt deviates from the previous prompt." I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right? If we are using the same prompt for each of these image pair comparisons will this cause us issues or?

GiteaMirror commented

2026-04-22 11:31:14 -05:00

@rick-github commented on GitHub (Dec 9, 2025):

So where you put the keep_alive there inside of the generate call is correct then?

Yes.

Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result?

Adding the image tags.

I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right?

Images are prepended to the prompt, so a new image will invalidate the entire prompt cache.

If we are using the same prompt for each of these image pair comparisons will this cause us issues or?

No issues.

@rick-github commented on GitHub (Dec 9, 2025): > So where you put the keep_alive there inside of the generate call is correct then? Yes. > Also do you think it was the image tags in the prompt or moving the keep_alive to the proper place or both that got us the accurate result? Adding the image tags. > I am assuming this means the prompt text alone and having new images in the request will not invalidate the cache right? Images are prepended to the prompt, so a new image will invalidate the entire prompt cache. > If we are using the same prompt for each of these image pair comparisons will this cause us issues or? No issues.

GiteaMirror commented

2026-04-22 11:31:14 -05:00

@rick-github commented on GitHub (Dec 9, 2025):

Let me correct myself on the image/prompt interaction. If the client is always sending the same amount of images with the same prompt, then the prompt cache will be used. The images themselves are a set of tokens that are processed in a separate batch are not subject to prompt caching. So while the text prompt will be re-used, there should be no issues with image bleedover. If there is, that would be a bug, and you should open a new issue to have it dealt with.

@rick-github commented on GitHub (Dec 9, 2025): Let me correct myself on the image/prompt interaction. If the client is always sending the same amount of images with the same prompt, then the prompt cache will be used. The images themselves are a set of tokens that are processed in a separate batch are not subject to prompt caching. So while the text prompt will be re-used, there should be no issues with image bleedover. If there is, that would be a bug, and you should open a new issue to have it dealt with.

GiteaMirror commented

2026-04-22 11:31:15 -05:00

@joshuabolick commented on GitHub (Dec 9, 2025):

Thank you again @rick-github this has been extremely helpful!

If you have a buy me a coffee thing or something I will buy you at least a couple! haha thanks again cheers! :)

@joshuabolick commented on GitHub (Dec 9, 2025): Thank you again @rick-github this has been extremely helpful! If you have a buy me a coffee thing or something I will buy you at least a couple! haha thanks again cheers! :)

GiteaMirror commented

2026-04-22 11:31:16 -05:00

@joshuabolick commented on GitHub (Dec 10, 2025):

Hey @rick-github, I just wanted to follow up. So making that change to add the tags for the images did definitely improve the results but I am still seeing it happen sometimes, especially since we are just iterating around through frame images from a video.

Also one thing I forgot to mention is for our comparisons, we are for example comparing frame 100 to frame 200 as that pair, then next iteration we are then comparing image 200 to image 300 and so on etc so on the next iteration we are sending that frame 200 again now as img-0 where it was img-1 on the previous iteration so I am not sure if that could somehow be potentially causing this issue?

But here attached are a couple image comparisons where you can see if seems to be using the same image for both and also sometimes it will miss the date text overlay which is definitely important for our process.

Anyway I guess just wanted to mention that detail about how we are making these requests and also just wondering if there is anything else I can try to prevent this from happening where sometimes it appears to be using the wrong images in the comparisons? On other iterations it works correctly so it is not always using the wrong images but from what I can tell sometimes it is.

Please let me know what you think and thanks again for all your help!

frames_prompt_text: |
  -----
  ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group.

  **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

  **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.
  ---
  ### TASK: FRAME COMPARISON LOGIC

  **STEP 1: TEXT EXTRACTION (CRITICAL)**
  First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

  **STEP 2: VISUAL SUMMARIZATION**
  Briefly summarize the visual content (objects, setting, action) in the summary fields.

  **STEP 3: LOGIC GATES (Execute in Order)**

  **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)**
  * Compare the extracted "frame_a_date_text" and "frame_b_date_text".
  * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false".
  * *Reasoning:* "Date Mismatch." (STOP AND RETURN)

  **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)**
  * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false".
  * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN)

  **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)**
  * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true".
  * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

  **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)**
  * If (and ONLY if) there are NO text overlays on either frame:
      * **BREAK** if the location/setting changes to a completely new different location/setting.
      * **BREAK** if one frame is solid color/static/artifact.
      * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

  ---
  ### REQUIRED JSON OUTPUT SCHEMA

  Respond **ONLY** with a single valid JSON object.

  {
  "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
  "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
  "frame_a_summary": "[Summary of visual content in Frame A]",
  "frame_b_summary": "[Summary of visual content in Frame B]",
  "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
  "same_scene": [boolean: true or false]
  }

@joshuabolick commented on GitHub (Dec 10, 2025): Hey @rick-github, I just wanted to follow up. So making that change to add the tags for the images did definitely improve the results but I am still seeing it happen sometimes, especially since we are just iterating around through frame images from a video. Also one thing I forgot to mention is for our comparisons, we are for example comparing frame 100 to frame 200 as that pair, then next iteration we are then comparing image 200 to image 300 and so on etc so on the next iteration we are sending that frame 200 again now as img-0 where it was img-1 on the previous iteration so I am not sure if that could somehow be potentially causing this issue? But here attached are a couple image comparisons where you can see if seems to be using the same image for both and also sometimes it will miss the date text overlay which is definitely important for our process. Anyway I guess just wanted to mention that detail about how we are making these requests and also just wondering if there is anything else I can try to prevent this from happening where sometimes it appears to be using the wrong images in the comparisons? On other iterations it works correctly so it is not always using the wrong images but from what I can tell sometimes it is. Please let me know what you think and thanks again for all your help! ``` frames_prompt_text: | ----- ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } ```

GiteaMirror commented

2026-04-22 11:31:17 -05:00

@joshuabolick commented on GitHub (Dec 10, 2025):

Also here attached are those images if you want to use them for testing, also let me know any other info or anything else I can do to help! Mainly it is just very curious because when we run switching over to call directly out to Google API with same model, this does not happen...

Sorry I had to remove those images but I can provide other examples if needed just let me know thank you!

@joshuabolick commented on GitHub (Dec 10, 2025): Also here attached are those images if you want to use them for testing, also let me know any other info or anything else I can do to help! Mainly it is just very curious because when we run switching over to call directly out to Google API with same model, this does not happen... Sorry I had to remove those images but I can provide other examples if needed just let me know thank you!

GiteaMirror commented

2026-04-22 11:31:17 -05:00

@joshuabolick commented on GitHub (Dec 29, 2025):

Hey @rick-github hope you are doing well and I just wanted to follow up.

We are still seeing this happen sometimes even with the image tags in the prompt, where it looks like when sending two images to describe and then compare in the Ollama request, sometimes it still gives the exact same description for both images and then says they are the same.

So from our testing so far this does seem like some sort of image bleedover, any other ideas or things I can try and test here? Or do you think I should go ahead and open a new issue to have it dealt with?

Please let me know when you get a chance and thanks again!

@joshuabolick commented on GitHub (Dec 29, 2025): Hey @rick-github hope you are doing well and I just wanted to follow up. We are still seeing this happen sometimes even with the image tags in the prompt, where it looks like when sending two images to describe and then compare in the Ollama request, sometimes it still gives the exact same description for both images and then says they are the same. So from our testing so far this does seem like some sort of image bleedover, any other ideas or things I can try and test here? Or do you think I should go ahead and open a new issue to have it dealt with? Please let me know when you get a chance and thanks again!

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#31247