[GH-ISSUE #13586] gemma3:12b seems to suffer from image bleed. #8944

Open
opened 2026-04-12 21:45:57 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @rick-github on GitHub (Dec 30, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13586

What is the issue?

@joshuabolick writes in https://github.com/ollama/ollama/issues/8513#issuecomment-3628884515:

Greetings all, we have been struggling with sending multiple images to ollama. We are using the Gemma3:12b model, sending two images in the request to the generate endpoint via python api, and it appears very much like it is merging the two images when we are trying to compare them.

The prompt has been modified from the original to include image tags ([img-0]) to try to help the model distinguish the two images as discussed here:

frames_prompt_text: |
  -----
  ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group.

  **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes.

  **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block.
  ---
  ### TASK: FRAME COMPARISON LOGIC

  **STEP 1: TEXT EXTRACTION (CRITICAL)**
  First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided.

  **STEP 2: VISUAL SUMMARIZATION**
  Briefly summarize the visual content (objects, setting, action) in the summary fields.

  **STEP 3: LOGIC GATES (Execute in Order)**

  **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)**
  * Compare the extracted "frame_a_date_text" and "frame_b_date_text".
  * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false".
  * *Reasoning:* "Date Mismatch." (STOP AND RETURN)

  **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)**
  * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false".
  * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN)

  **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)**
  * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true".
  * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN)

  **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)**
  * If (and ONLY if) there are NO text overlays on either frame:
      * **BREAK** if the location/setting changes to a completely new different location/setting.
      * **BREAK** if one frame is solid color/static/artifact.
      * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis.

  ---
  ### REQUIRED JSON OUTPUT SCHEMA

  Respond **ONLY** with a single valid JSON object.

  {
  "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]",
  "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]",
  "frame_a_summary": "[Summary of visual content in Frame A]",
  "frame_b_summary": "[Summary of visual content in Frame B]",
  "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]",
  "same_scene": [boolean: true or false]
  }

The code:

#!/usr/bin/env python3

import base64
import ollama
import logging

OLLAMA_MODEL_NAME = "OLLAMA_MODEL_NAME"
PROMPT_TEXT = "PROMPT_TEXT"
OLLAMA_URL = "OLLAMA_URL"

def get_config():
  with open("prompt_text", "r") as f:
    prompt = f.read()
  return {
    "OLLAMA_MODEL_NAME": "gemma3:12b",
    "PROMPT_TEXT": prompt,
    "OLLAMA_URL": "http://localhost:11434"
  }

# Function to encode an image to base64
def encode_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def call_ollama_api(frame_pair, output_folder):
    """
    Calls to local Ollama server running to use Gemma3 AI model to analyze image

    Args:
        frame_path: String path to the input frame image to analyze

    Returns:
        AI model frame image analysis response text
    """
    configuration_settings = get_config()

    # Initialize the Ollama client
    client = ollama.Client(configuration_settings.get(OLLAMA_URL))

    logging.info(f"frame_pair: {frame_pair}")
    logging.info(f"model: {configuration_settings.get(OLLAMA_MODEL_NAME)}")
    frame1_path = output_folder + "/frames/" + frame_pair[0]
    frame2_path = output_folder + "/frames/" + frame_pair[1]

    logging.info(f"frame1_path: {frame1_path}")
    logging.info(f"frame2_path: {frame2_path}")

    # Encode images to base64
    FrameA = encode_image_to_base64(frame1_path)
    FrameB = encode_image_to_base64(frame2_path)

    prompt = configuration_settings.get(PROMPT_TEXT)
    # prompt = "describe the animals shown in the images"
    logging.info(f"prompt={prompt}")

    # Make the generate request with the images
    try:
        response = ollama.generate(
            model=configuration_settings.get(OLLAMA_MODEL_NAME),
            prompt=prompt,
            images=[FrameA, FrameB],
            keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)
        )
        logging.info(f"response={response['response']}")

        return response['response']
    except Exception as e:
        logging.error(f"An error occurred: {e}")
        return f"An error occurred: {e}"

if __name__ == "__main__":
  r = call_ollama_api(["frame1.jpg", "frame2.jpg"], ".")
  print(r)

Making that change to add the tags for the images did definitely improve the results but I am still seeing it happen sometimes, especially since we are just iterating around through frame images from a video.

Also one thing I forgot to mention is for our comparisons, we are for example comparing frame 100 to frame 200 as that pair, then next iteration we are then comparing image 200 to image 300 and so on etc so on the next iteration we are sending that frame 200 again now as img-0 where it was img-1 on the previous iteration so I am not sure if that could somehow be potentially causing this issue?

But here attached are a couple image comparisons where you can see if seems to be using the same image for both and also sometimes it will miss the date text overlay which is definitely important for our process.

Anyway I guess just wanted to mention that detail about how we are making these requests and also just wondering if there is anything else I can try to prevent this from happening where sometimes it appears to be using the wrong images in the comparisons? On other iterations it works correctly so it is not always using the wrong images but from what I can tell sometimes it is.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

0.13.2

Originally created by @rick-github on GitHub (Dec 30, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13586 ### What is the issue? @joshuabolick writes in https://github.com/ollama/ollama/issues/8513#issuecomment-3628884515: Greetings all, we have been struggling with sending multiple images to ollama. We are using the Gemma3:12b model, sending two images in the request to the generate endpoint via python api, and it appears very much like it is merging the two images when we are trying to compare them. The prompt has been modified from the original to include image tags (`[img-0]`) to try to help the model distinguish the two images as discussed [here](https://github.com/ollama/ollama/issues/8513#issuecomment-3634463919): ``` frames_prompt_text: | ----- ### SYSTEM INSTRUCTION: SCENE GROUPING & OCR VALIDATOR **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group. **CRITICAL DIRECTIVE:** You must act as a strict **OCR (Optical Character Recognition) Validator**. The text overlay (specifically the DATE) is the absolute source of truth. Visual similarity of the room or characters is IRRELEVANT if the date text changes. **OUTPUT FORMAT:** You MUST output ONLY a single, valid JSON object. No markdown formatting outside the code block. --- ### TASK: FRAME COMPARISON LOGIC **STEP 1: TEXT EXTRACTION (CRITICAL)** First, transcribe any date/time overlay text found on Frame A and Frame B into the JSON fields provided. **STEP 2: VISUAL SUMMARIZATION** Briefly summarize the visual content (objects, setting, action) in the summary fields. **STEP 3: LOGIC GATES (Execute in Order)** **GATE 1: THE "DIFFERENT DATE" TRAP (Highest Priority - BREAK)** * Compare the extracted "frame_a_date_text" and "frame_b_date_text". * If the text exists on both but the **DAY number** is different (e.g., "Dec 01" vs "Dec 02"), you **MUST** output "same_scene: false". * *Reasoning:* "Date Mismatch." (STOP AND RETURN) **GATE 2: THE "MISSING OVERLAY" TRAP (High Priority - BREAK)** * If one frame has a date/text overlay and the other does not, you **MUST** output "same_scene: false". * *Reasoning:* "Overlay Inconsistency." (STOP AND RETURN) **GATE 3: THE "SAME DATE" OVERRIDE (High Priority - KEEP)** * If both frames have a date overlay and the date strings are **IDENTICAL**, you **MUST** output "same_scene: true". * **NOTE:** This overrides visual changes. If the date is exactly the same, it is the same scene. (STOP AND RETURN) **GATE 4: VISUAL ANALYSIS (Only if No Dates are Present)** * If (and ONLY if) there are NO text overlays on either frame: * **BREAK** if the location/setting changes to a completely new different location/setting. * **BREAK** if one frame is solid color/static/artifact. * **KEEP** if the setting is even remotely the same, or differences like movement, blur, camera flashes, or if overall unsure based on analysis. --- ### REQUIRED JSON OUTPUT SCHEMA Respond **ONLY** with a single valid JSON object. { "frame_a_date_text": "[Exact transcription of date text on Frame A. Write 'NONE' if no text.]", "frame_b_date_text": "[Exact transcription of date text on Frame B. Write 'NONE' if no text.]", "frame_a_summary": "[Summary of visual content in Frame A]", "frame_b_summary": "[Summary of visual content in Frame B]", "reasoning": "[Brief explanation. If dates differ, explicitly state: 'Date changed from X to Y'.]", "same_scene": [boolean: true or false] } ``` The code: ```python #!/usr/bin/env python3 import base64 import ollama import logging OLLAMA_MODEL_NAME = "OLLAMA_MODEL_NAME" PROMPT_TEXT = "PROMPT_TEXT" OLLAMA_URL = "OLLAMA_URL" def get_config(): with open("prompt_text", "r") as f: prompt = f.read() return { "OLLAMA_MODEL_NAME": "gemma3:12b", "PROMPT_TEXT": prompt, "OLLAMA_URL": "http://localhost:11434" } # Function to encode an image to base64 def encode_image_to_base64(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8') def call_ollama_api(frame_pair, output_folder): """ Calls to local Ollama server running to use Gemma3 AI model to analyze image Args: frame_path: String path to the input frame image to analyze Returns: AI model frame image analysis response text """ configuration_settings = get_config() # Initialize the Ollama client client = ollama.Client(configuration_settings.get(OLLAMA_URL)) logging.info(f"frame_pair: {frame_pair}") logging.info(f"model: {configuration_settings.get(OLLAMA_MODEL_NAME)}") frame1_path = output_folder + "/frames/" + frame_pair[0] frame2_path = output_folder + "/frames/" + frame_pair[1] logging.info(f"frame1_path: {frame1_path}") logging.info(f"frame2_path: {frame2_path}") # Encode images to base64 FrameA = encode_image_to_base64(frame1_path) FrameB = encode_image_to_base64(frame2_path) prompt = configuration_settings.get(PROMPT_TEXT) # prompt = "describe the animals shown in the images" logging.info(f"prompt={prompt}") # Make the generate request with the images try: response = ollama.generate( model=configuration_settings.get(OLLAMA_MODEL_NAME), prompt=prompt, images=[FrameA, FrameB], keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) ) logging.info(f"response={response['response']}") return response['response'] except Exception as e: logging.error(f"An error occurred: {e}") return f"An error occurred: {e}" if __name__ == "__main__": r = call_ollama_api(["frame1.jpg", "frame2.jpg"], ".") print(r) ``` Making that change to add the tags for the images did definitely improve the results but I am still seeing it happen sometimes, especially since we are just iterating around through frame images from a video. Also one thing I forgot to mention is for our comparisons, we are for example comparing frame 100 to frame 200 as that pair, then next iteration we are then comparing image 200 to image 300 and so on etc so on the next iteration we are sending that frame 200 again now as img-0 where it was img-1 on the previous iteration so I am not sure if that could somehow be potentially causing this issue? But here attached are a couple image comparisons where you can see if seems to be using the same image for both and also sometimes it will miss the date text overlay which is definitely important for our process. Anyway I guess just wanted to mention that detail about how we are making these requests and also just wondering if there is anything else I can try to prevent this from happening where sometimes it appears to be using the wrong images in the comparisons? On other iterations it works correctly so it is not always using the wrong images but from what I can tell sometimes it is. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version 0.13.2
GiteaMirror added the bug label 2026-04-12 21:45:57 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 30, 2025):

@joshuabolick Set OLLAMA_DEBUG=2 in the server environment and post the resulting log. Note that this will include the prompt so be aware of PII.

<!-- gh-comment-id:3698754759 --> @rick-github commented on GitHub (Dec 30, 2025): @joshuabolick Set `OLLAMA_DEBUG=2` in the server environment and post the resulting log. Note that this will include the prompt so be aware of PII.
Author
Owner

@joshuabolick commented on GitHub (Dec 30, 2025):

@joshuabolick Set OLLAMA_DEBUG=2 in the server environment and post the resulting log. Note that this will include the prompt so be aware of PII.

Hey @rick-github, sorry for the delay but I collected a bunch of example data with the ollama debug log data as well. The log for this Example 1 first video is a lot larger because it was a longer video but also because all of the ollama start up info in the log too (all in the zip files at the google drive link in comment below).

Also just to add we have not changed the prompt, if you look at the step3_ai_scene_analysis.html file in the html_results folder for each example, it shows each image comparison, sometimes it does work but often other times it does not and you can see where the image descriptions for Frame A or Frame B are not correct and describing the other frame being compared.

Lastly I am not positive this is only a problem with the gemma3:12b model, I an planning to do some further testing with some other models and I can provide that result data as well.

Please let me know any other questions I can clarify or other ways I can help and looking forward to hearing what you find!

Thanks again! Examples below and attached:

Example 1:
Image
Image
Image
Image
Image
Image
Image
Image
Image

Example 2:
Image
Image
Image
Image

Example 3:
Image
Image
Image
Image
Image

Example 4:
Image
Image
Image
Image
Image

Example 5:
Image
Image
Image
Image
Image
Image

Example 6:
Image
Image
Image
Image
Image
Image

Example 7:
Image
Image
Image
Image
Image
Image

<!-- gh-comment-id:3700515378 --> @joshuabolick commented on GitHub (Dec 30, 2025): > [@joshuabolick](https://github.com/joshuabolick) Set `OLLAMA_DEBUG=2` in the server environment and post the resulting log. Note that this will include the prompt so be aware of PII. Hey @rick-github, sorry for the delay but I collected a bunch of example data with the ollama debug log data as well. The log for this Example 1 first video is a lot larger because it was a longer video but also because all of the ollama start up info in the log too (all in the zip files at the google drive link in comment below). Also just to add we have not changed the prompt, if you look at the step3_ai_scene_analysis.html file in the html_results folder for each example, it shows each image comparison, sometimes it does work but often other times it does not and you can see where the image descriptions for Frame A or Frame B are not correct and describing the other frame being compared. Lastly I am not positive this is only a problem with the gemma3:12b model, I an planning to do some further testing with some other models and I can provide that result data as well. Please let me know any other questions I can clarify or other ways I can help and looking forward to hearing what you find! Thanks again! Examples below and attached: Example 1: <img width="1885" height="981" alt="Image" src="https://github.com/user-attachments/assets/31284b37-9727-4cf7-bcac-76407a5569b4" /> <img width="1874" height="1027" alt="Image" src="https://github.com/user-attachments/assets/e93ef83b-8412-4c80-90dc-39bbee1e1993" /> <img width="1881" height="1024" alt="Image" src="https://github.com/user-attachments/assets/92d2bb9c-40fe-4175-918c-bc5f55a75ec5" /> <img width="1872" height="1018" alt="Image" src="https://github.com/user-attachments/assets/5c7ce25b-9d38-45b9-8b7a-aa4563a0c42d" /> <img width="1879" height="1029" alt="Image" src="https://github.com/user-attachments/assets/3a37a6d4-44b4-429f-b385-2b71b456cbe7" /> <img width="1890" height="1017" alt="Image" src="https://github.com/user-attachments/assets/ad43733c-a118-47c2-88e1-c98f46d60408" /> <img width="1885" height="1027" alt="Image" src="https://github.com/user-attachments/assets/6a0bd412-5b66-40d2-8932-c207cb9f2478" /> <img width="1871" height="1024" alt="Image" src="https://github.com/user-attachments/assets/f076f64f-b9a4-42fa-a0c4-dc812ef4aac5" /> <img width="1866" height="1033" alt="Image" src="https://github.com/user-attachments/assets/510b427e-fcff-4fea-81a2-7f32daf957ef" /> Example 2: <img width="1879" height="1022" alt="Image" src="https://github.com/user-attachments/assets/842c5ce0-a2b9-49e6-9f1b-4e5b0f97b6da" /> <img width="1874" height="1017" alt="Image" src="https://github.com/user-attachments/assets/afeab548-d4d2-4d30-8fb1-35c34d2b698e" /> <img width="1875" height="1020" alt="Image" src="https://github.com/user-attachments/assets/2e0de2ed-790d-49f2-aaa7-f1b09a733d4f" /> <img width="1882" height="1017" alt="Image" src="https://github.com/user-attachments/assets/14cc75f3-4256-4125-aec5-efb14b5324f3" /> Example 3: <img width="1868" height="1027" alt="Image" src="https://github.com/user-attachments/assets/6718a4ff-d5e7-40de-b5fd-a34be76c4a37" /> <img width="1863" height="1028" alt="Image" src="https://github.com/user-attachments/assets/6696df02-42c1-4209-b318-6f5f34bcf125" /> <img width="1873" height="1015" alt="Image" src="https://github.com/user-attachments/assets/eb042622-0b01-4ae0-8e84-1493b8dbf738" /> <img width="1874" height="1028" alt="Image" src="https://github.com/user-attachments/assets/4d7385b0-3ae5-4f14-bb61-6d2cf6d9c56a" /> <img width="1873" height="1020" alt="Image" src="https://github.com/user-attachments/assets/092b9117-3671-4d61-8782-3663fc986dbc" /> Example 4: <img width="1859" height="1028" alt="Image" src="https://github.com/user-attachments/assets/b81b19ea-94e6-4e92-9d42-dd4905574742" /> <img width="1857" height="1014" alt="Image" src="https://github.com/user-attachments/assets/65782df2-be79-4495-a20d-109fd6b6e5c2" /> <img width="1866" height="1031" alt="Image" src="https://github.com/user-attachments/assets/ba500ad8-79b7-4607-be11-e2b33d0fdd52" /> <img width="1869" height="1027" alt="Image" src="https://github.com/user-attachments/assets/2cdd0f1a-874d-4048-b738-e285fb0b8147" /> <img width="1880" height="1037" alt="Image" src="https://github.com/user-attachments/assets/9d2a4971-6577-4555-b8fb-8ab1a55d7197" /> Example 5: <img width="1869" height="1027" alt="Image" src="https://github.com/user-attachments/assets/a60ce966-01a3-472e-a02e-31e7ba36caa8" /> <img width="1858" height="1034" alt="Image" src="https://github.com/user-attachments/assets/94bebfc0-26bc-4d50-859d-924c73ff6b61" /> <img width="1883" height="1035" alt="Image" src="https://github.com/user-attachments/assets/595b84f5-5461-4ad3-b246-d43247a63c06" /> <img width="1863" height="1026" alt="Image" src="https://github.com/user-attachments/assets/44b42620-5d8a-4ff7-9348-58b00003980e" /> <img width="1874" height="1027" alt="Image" src="https://github.com/user-attachments/assets/84b11302-89e7-4da6-9fe9-a5dc3d3ca133" /> <img width="1875" height="1023" alt="Image" src="https://github.com/user-attachments/assets/191f2b49-98e5-4aba-9f2e-0523813d79ed" /> Example 6: <img width="1860" height="1018" alt="Image" src="https://github.com/user-attachments/assets/804daa74-bf2a-45a7-8131-d9aec5e5b170" /> <img width="1864" height="1020" alt="Image" src="https://github.com/user-attachments/assets/009397e3-c010-4477-a38a-762a199aa725" /> <img width="1878" height="1019" alt="Image" src="https://github.com/user-attachments/assets/e554ffd3-ed2a-461c-a86e-cc815fa84cec" /> <img width="1874" height="1014" alt="Image" src="https://github.com/user-attachments/assets/a3cf1cc3-7d54-4a97-bc04-8f49c7b6cbb1" /> <img width="1878" height="1027" alt="Image" src="https://github.com/user-attachments/assets/c8940b73-8925-48c8-8cb7-3a98e305212a" /> <img width="1865" height="1022" alt="Image" src="https://github.com/user-attachments/assets/b1ae3a5a-2d39-4fbc-b272-843b324e3c45" /> Example 7: <img width="1867" height="1023" alt="Image" src="https://github.com/user-attachments/assets/ff34261f-4466-4053-a6cd-be311052efe3" /> <img width="1873" height="1027" alt="Image" src="https://github.com/user-attachments/assets/083d5a98-57c2-49ec-a5e4-9ad0a0f0de17" /> <img width="1868" height="1030" alt="Image" src="https://github.com/user-attachments/assets/d14c2b37-4008-4c84-a005-980957928087" /> <img width="1837" height="1032" alt="Image" src="https://github.com/user-attachments/assets/53585eb4-0b3c-475a-b474-a4b43b5ce99e" /> <img width="1876" height="1035" alt="Image" src="https://github.com/user-attachments/assets/ce4034ea-ebbb-43e8-9ae6-05c2c846a6e7" /> <img width="1874" height="1024" alt="Image" src="https://github.com/user-attachments/assets/3d1b59d6-c94c-4fe5-abcf-8aeeda734b7e" />
Author
Owner

@joshuabolick commented on GitHub (Jan 12, 2026):

A zip file with each example with html reports and the images as well can be downloaded at this link:

https://drive.google.com/drive/folders/1hqgjQX1eBTEzspnYChprAqKAKFG9n486?usp=sharing

Please let me know any trouble downloading or accessing any of this data or anything else I can help with etc thanks again!

<!-- gh-comment-id:3740960844 --> @joshuabolick commented on GitHub (Jan 12, 2026): A zip file with each example with html reports and the images as well can be downloaded at this link: https://drive.google.com/drive/folders/1hqgjQX1eBTEzspnYChprAqKAKFG9n486?usp=sharing Please let me know any trouble downloading or accessing any of this data or anything else I can help with etc thanks again!
Author
Owner

@joshuabolick commented on GitHub (Jan 13, 2026):

Also I wanted to add this is the python code and current settings for how we are currently making the request calls to the Ollama service:

        response = client.generate(
            model='gemma3:12b',
            prompt=prompt,
            images=[FrameA, FrameB],
            stream=false,
            keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds)
            options={
                'temperature': 0.3,
                'top_p': 0.3,
                'top_k': 15,
                'repeat_penalty': 0.0,
            }
        )
<!-- gh-comment-id:3745682936 --> @joshuabolick commented on GitHub (Jan 13, 2026): Also I wanted to add this is the python code and current settings for how we are currently making the request calls to the Ollama service: ``` response = client.generate( model='gemma3:12b', prompt=prompt, images=[FrameA, FrameB], stream=false, keep_alive='0s', # Use a string literal that specifies time unit (e.g., '0s' for 0 seconds) options={ 'temperature': 0.3, 'top_p': 0.3, 'top_k': 15, 'repeat_penalty': 0.0, } ) ```
Author
Owner

@JoannaWHN commented on GitHub (Jan 20, 2026):

I also encountered a similar problem. What I'm looking forward to is that in the future, multiple images and text can be mixed and input together.
I observed that the notebook provided by Google, the chat messages in this way:

content = []
content.append({"type": "text", "text": instruction})
for slice_number, ct_slice in enumerate(normalized_ct_volume_slices, 1):
  content.append({"type": "image", "image": _encode(ct_slice)})
  content.append({"type": "text", "text": f"SLICE {slice_number}"})
content.append({"type": "text", "text": query_text})

messages = [
    {
        "role": "user",
        "content": content
    }
]

View as:

[
    {
        "content": [
            {
                "text": "You are an instructor teaching medical students. You are analyzing a contiguous block of CT slices f...",
                "type": "text"
            },
            {
                "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aH...",
                "type": "image"
            },
            {
                "text": "SLICE 1",
                "type": "text"
            },
            {
                "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aH...",
                "type": "image"
            },
            {
                "text": "SLICE 2",
                "type": "text"
            },
            {
                "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aH...",
                "type": "image"
            },
            {
                "text": "SLICE 3",
                "type": "text"
            },
            {
                "text": "\n\nBased on the visual evidence in the slices provided above, is this image a good teaching example o...",
                "type": "text"
            }
        ],
        "role": "user"
    }
]
<!-- gh-comment-id:3771223181 --> @JoannaWHN commented on GitHub (Jan 20, 2026): I also encountered a similar problem. What I'm looking forward to is that in the future, multiple images and text can be mixed and input together. I observed that the notebook provided by Google, the chat messages in this way: ``` content = [] content.append({"type": "text", "text": instruction}) for slice_number, ct_slice in enumerate(normalized_ct_volume_slices, 1): content.append({"type": "image", "image": _encode(ct_slice)}) content.append({"type": "text", "text": f"SLICE {slice_number}"}) content.append({"type": "text", "text": query_text}) messages = [ { "role": "user", "content": content } ] ``` View as: ``` [ { "content": [ { "text": "You are an instructor teaching medical students. You are analyzing a contiguous block of CT slices f...", "type": "text" }, { "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aH...", "type": "image" }, { "text": "SLICE 1", "type": "text" }, { "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aH...", "type": "image" }, { "text": "SLICE 2", "type": "text" }, { "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aH...", "type": "image" }, { "text": "SLICE 3", "type": "text" }, { "text": "\n\nBased on the visual evidence in the slices provided above, is this image a good teaching example o...", "type": "text" } ], "role": "user" } ] ```
Author
Owner

@joshuabolick commented on GitHub (Jan 27, 2026):

Hello @rick-github, hope all is great with you!

I just wanted to check in, any chance any updates on this or potentially other changes to the request or prompt or anything else we could potentially try / test out?

Please let me know when you get a chance and thanks again!

<!-- gh-comment-id:3807005069 --> @joshuabolick commented on GitHub (Jan 27, 2026): Hello @rick-github, hope all is great with you! I just wanted to check in, any chance any updates on this or potentially other changes to the request or prompt or anything else we could potentially try / test out? Please let me know when you get a chance and thanks again!
Author
Owner

@rick-github commented on GitHub (Feb 11, 2026):

I was having another look at this today and I realized I had led you slightly astray when I suggested the initial prompt change:

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame
     images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same
     scene group.

The ollama server does sequential substitution of [img] tags, so the tags should not have the trailing numbers:

  **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame
     images in this request (Frame A [img] and Frame B [img]) and decide if they belong to the same
     scene group.

This will properly interleave the images and should hopefully give better results.

<!-- gh-comment-id:3881577168 --> @rick-github commented on GitHub (Feb 11, 2026): I was having another look at this today and I realized I had led you slightly astray when I suggested the initial prompt change: ``` **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img-0] and Frame B [img-1]) and decide if they belong to the same scene group. ``` The ollama server does sequential substitution of `[img]` tags, so the tags should not have the trailing numbers: ``` **ROLE:** You are a specialized video analysis system. Your task is to compare the two video frame images in this request (Frame A [img] and Frame B [img]) and decide if they belong to the same scene group. ``` This will properly interleave the images and should hopefully give better results.
Author
Owner

@joshuabolick commented on GitHub (Feb 17, 2026):

Okay thank you @rick-github! I am going to give this a try today and will let you know how it goes, provide some new test result data.

<!-- gh-comment-id:3915380536 --> @joshuabolick commented on GitHub (Feb 17, 2026): Okay thank you @rick-github! I am going to give this a try today and will let you know how it goes, provide some new test result data.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8944