[GH-ISSUE #9513] Granite-Vision : uploading a second image on the CLI is ignored and only the first one is processed #52713

Open
opened 2026-04-29 00:37:03 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @sammyf on GitHub (Mar 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9513

What is the issue?

On the cli, if you upload an image, granite will analyse it. If you then upload a second image without using /clear first it will ignore the second image and just analyse the first again.

This is only the case with granite-vision apparently (see test with moondream)

Image
Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.13

Originally created by @sammyf on GitHub (Mar 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9513 ### What is the issue? On the cli, if you upload an image, granite will analyse it. If you then upload a second image without using /clear first it will ignore the second image and just analyse the first again. This is only the case with granite-vision apparently (see test with moondream) ![Image](https://github.com/user-attachments/assets/24f5e6a5-fa62-4984-86ad-db58cb5ec948) ![Image](https://github.com/user-attachments/assets/bbb13498-3db0-4b1e-bc58-390143991b2f) ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.13
GiteaMirror added the bug label 2026-04-29 00:37:03 -05:00
Author
Owner

@jesus-talavera-ibm commented on GitHub (Mar 30, 2026):

This issue has been fixed.
The root cause was a bug in llama.cpp's multimodal tokenization (mtmd) where text chunks interspersed with image embeddings (specifically template delimiter tokens appearing between images) were not being correctly translated to text tokens. This caused the model to fail to properly distinguish images across conversation turns.

The fix was introduced in ollama/ollama#12552 (commit 4987f13d, Oct 2025) as part of a llama.cpp bump:

fix(mtmd): Correctly encode text chunks during mtmd tokenization
There can be text chunks that appear interspersed with the image embeddings that contain template delimiter tokens for some models. These need to be correctly translated to text tokens.

We verified that multi-turn image conversations with granite3.2-vision:2b work correctly on ollama v0.19.0. The model correctly identifies different images across 2+ turns without needing /clear.

If you're still experiencing this issue, please upgrade to ollama v0.19.0 or later.

<!-- gh-comment-id:4154422537 --> @jesus-talavera-ibm commented on GitHub (Mar 30, 2026): This issue has been fixed. The root cause was a bug in llama.cpp's multimodal tokenization (`mtmd`) where **text chunks interspersed with image embeddings** (specifically template delimiter tokens appearing between images) were not being correctly translated to text tokens. This caused the model to fail to properly distinguish images across conversation turns. The fix was introduced in [ollama/ollama#12552](https://github.com/ollama/ollama/pull/12552) (commit 4987f13d, Oct 2025) as part of a llama.cpp bump: > fix(mtmd): Correctly encode text chunks during mtmd tokenization > There can be text chunks that appear interspersed with the image embeddings that contain template delimiter tokens for some models. These need to be correctly translated to text tokens. We verified that multi-turn image conversations with `granite3.2-vision:2b` work correctly on ollama v0.19.0. The model correctly identifies different images across 2+ turns without needing `/clear`. If you're still experiencing this issue, please upgrade to ollama v0.19.0 or later.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52713