[PR #9557] [MERGED] ollamarunner: Improve multimodal input handling #75293

Closed
opened 2026-05-05 07:43:56 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/9557
Author: @jessegross
Created: 3/6/2025
Status: Merged
Merged: 3/7/2025
Merged by: @jessegross

Base: mainHead: jessegross/mulitmodal


📝 Commits (2)

  • d771a00 model: Don't unconditionally add special tokens
  • 2948c0a ollamarunner: Improve multimodal input handling

📊 Changes

8 files changed (+254 additions, -137 deletions)

View changed files

📝 llm/server.go (+1 -1)
📝 model/model.go (+63 -7)
📝 model/models/mllama/model.go (+72 -29)
📝 model/process_text.go (+3 -3)
📝 model/process_text_test.go (+7 -7)
📝 runner/ollamarunner/cache.go (+8 -12)
📝 runner/ollamarunner/cache_test.go (+41 -33)
📝 runner/ollamarunner/runner.go (+59 -45)

📄 Description

Various vision models have different requirements for how they receive their inputs. For example:

  • Mllama wants images together with text and the image embeddings don't themselves have positions or get stored in the main KV cache
  • Llava-style models feed in embeddings similar to tokens and images correspond to a varying number of tokens in the cache.

In addition, the strategy for providing inputs must support batching and multiple sequences, which are managed by the runner. At the same time, we want to keep data handling fully in the model so that new architectures are not bottlenecked by runner code which does not understand their particular requirements.

This provides a method for models to edit the input stream so that it meets their needs while still being in a format that the runner understands. This allows the runner to avoid special processing for different models.

In addition, this fixes a regression where non-vision models may try to incorrectly interpret images.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/9557 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 3/6/2025 **Status:** ✅ Merged **Merged:** 3/7/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/mulitmodal` --- ### 📝 Commits (2) - [`d771a00`](https://github.com/ollama/ollama/commit/d771a00438adee83f1fef1d91101c62e03e77928) model: Don't unconditionally add special tokens - [`2948c0a`](https://github.com/ollama/ollama/commit/2948c0a881ae525bf07f32e9407aeb2c659b89ee) ollamarunner: Improve multimodal input handling ### 📊 Changes **8 files changed** (+254 additions, -137 deletions) <details> <summary>View changed files</summary> 📝 `llm/server.go` (+1 -1) 📝 `model/model.go` (+63 -7) 📝 `model/models/mllama/model.go` (+72 -29) 📝 `model/process_text.go` (+3 -3) 📝 `model/process_text_test.go` (+7 -7) 📝 `runner/ollamarunner/cache.go` (+8 -12) 📝 `runner/ollamarunner/cache_test.go` (+41 -33) 📝 `runner/ollamarunner/runner.go` (+59 -45) </details> ### 📄 Description Various vision models have different requirements for how they receive their inputs. For example: - Mllama wants images together with text and the image embeddings don't themselves have positions or get stored in the main KV cache - Llava-style models feed in embeddings similar to tokens and images correspond to a varying number of tokens in the cache. In addition, the strategy for providing inputs must support batching and multiple sequences, which are managed by the runner. At the same time, we want to keep data handling fully in the model so that new architectures are not bottlenecked by runner code which does not understand their particular requirements. This provides a method for models to edit the input stream so that it meets their needs while still being in a format that the runner understands. This allows the runner to avoid special processing for different models. In addition, this fixes a regression where non-vision models may try to incorrectly interpret images. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 07:43:56 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#75293