[PR #6393] [CLOSED] Paligemma Support #12099

Closed
opened 2026-04-12 23:49:33 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/6393
Author: @royjhan
Created: 8/16/2024
Status: Closed

Base: mainHead: paligemma-support


📝 Commits (6)

📊 Changes

3 files changed (+186 additions, -4 deletions)

View changed files

📝 llm/ext_server/server.cpp (+71 -2)
📝 llm/patches/06-embeddings.diff (+2 -2)
llm/patches/12-paligemma.diff (+113 -0)

📄 Description

This PR is for implementing Paligemma support within Ollama using GGML. Paligemma is a one-shot image-text to text model from Google.

The main features of this PR are the pre-processing of image-text prompts to Paligemma, incorporating non-causal attention on prompt processing, and merging image features to input features within llama.cpp for combined decoding. This modifies /examples/llava to adapt to the missing projector in Paligemma. No changes were made to clip.cpp for image embeddings.

  • server.cpp
    server.cpp now checks for images and determines which prompt-processing path to take depending on model architecture. Paligemma utilizes image placeholder tokens <image> as substitutes to image embeddings prepended to the text prompt. The input to the model should be in the format <image>...<image><bos> + text + \n. The text with the placeholders is then embedded. We include the image embeddings previously retrieved from the image encoder as part of the model context, which is the swapped with the placeholder embeddings within the building of the compute graph.

  • llama.cpp
    The patch to llama.cpp utilizes the text inputs (including the placeholders) to embed the entire input. The image placeholder tokens then have their data exchanged with the correct image embeddings passed in within the model context. Checks are now also made to ensure that logits are reserved within non-causal decodes if the decode includes images as well.

USAGE:

To run Paligemma, you need to first create GGUF files for both the vision encoder and language model. This can be done using this surgery file here. Create a Modelfile referencing both GGUF files without a template to run.

Pull this model directly from ollama using ollama pull jyan1/paligemma-mix-224. Take a look at the model here

Build and run this PR

If you do not have a clone of this repository already

git clone https://github.com/ollama/ollama.git

Build and serve

cd ollama
git fetch -a
git checkout paligemma-support
go generate ./...
go build .
./ollama serve

You can now query Paligemma either from the CLI or via HTTP request.

CLI Example

Using another terminal window

./ollama run jyan1/paligemma-mix-224

Input

>>> What is in this image? /path/to/my/puppy.jpg

Output

Added image '/path/to/my/puppy.jpg'
A brown dog wearing a floral shirt and lei stands proudly next to a clear blue 
pool. The dog's mouth is open, its paw rests on the edge of the water, and its 
eyes are focused on the horizon. The pool water is crystal clear, and the palm 
trees in the distance provide shade for the dog. A black leash connects the dog 
to its owner, and a flower lei is around the dog's neck. The dog's fur is brown, 
and its nose is black. The tree behind the pool is tall and slender, and the 
fence surrounding the pool is made of metal posts.

/path/to/my/puppy.jpg for reference :)

This PR includes support for a one-shot one-image image-text prompt to Ollama. The supported models include Paligemma models with the -224 suffix, referencing the rescaled image size that they use.

TODO:

  • Include multi-image prompt processing, which involves appending additional image tokens to the start of the text input, as well as including an array of image embeddings to the model context
  • Utilizing model configurations to append the right number of placeholder tokens in server.cpp and set-up correct dimensions
  • Ensure that prompt is not truncated due to batch size as Paligemma wants the entire input, including all text and images, to be provided as a single batch. Note for Paligemma-...-448 and Paligemma-...-896 models, there are 1024 and 4096 image tokens per image in the input which is already larger than the batch_size set in llama.cpp

Thanks everyone,
Josh and Roy


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/6393 **Author:** [@royjhan](https://github.com/royjhan) **Created:** 8/16/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `paligemma-support` --- ### 📝 Commits (6) - [`7de230f`](https://github.com/ollama/ollama/commit/7de230f0054fed21f3d7e782027b716575075d74) paligemma patch - [`c631633`](https://github.com/ollama/ollama/commit/c631633bce132cd06dbe613c85071c38e1f2da2d) paligemma demo works - [`e6802df`](https://github.com/ollama/ollama/commit/e6802df9061852d000d07d3e5f902a20f0068714) fixed patches, llava - [`a33e56c`](https://github.com/ollama/ollama/commit/a33e56cddb2f437248222dc046a2148d699b2f1c) uses input prompt - [`80eef7c`](https://github.com/ollama/ollama/commit/80eef7c7b179f64f8ba205d74d5730d9a00f866c) changes - [`a6d30ec`](https://github.com/ollama/ollama/commit/a6d30ecefe71502a8a493031737ae5df32e7a5f3) working causal attention ### 📊 Changes **3 files changed** (+186 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `llm/ext_server/server.cpp` (+71 -2) 📝 `llm/patches/06-embeddings.diff` (+2 -2) ➕ `llm/patches/12-paligemma.diff` (+113 -0) </details> ### 📄 Description This PR is for implementing [Paligemma](https://huggingface.co/collections/google/paligemma-release-6643a9ffbf57de2ae0448dda) support within Ollama using GGML. Paligemma is a one-shot image-text to text model from Google. The main features of this PR are the pre-processing of image-text prompts to Paligemma, incorporating non-causal attention on prompt processing, and merging image features to input features within llama.cpp for combined decoding. This modifies /examples/llava to adapt to the missing projector in Paligemma. No changes were made to clip.cpp for image embeddings. - `server.cpp` `server.cpp` now checks for images and determines which prompt-processing path to take depending on model architecture. Paligemma utilizes image placeholder tokens `<image>` as substitutes to image embeddings prepended to the text prompt. The input to the model should be in the format `<image>...<image><bos> + text + \n`. The text with the placeholders is then embedded. We include the image embeddings previously retrieved from the image encoder as part of the model context, which is the swapped with the placeholder embeddings within the building of the compute graph. - `llama.cpp` The patch to `llama.cpp` utilizes the text inputs (including the placeholders) to embed the entire input. The image placeholder tokens then have their data exchanged with the correct image embeddings passed in within the model context. Checks are now also made to ensure that logits are reserved within non-causal decodes if the decode includes images as well. ### USAGE: To run Paligemma, you need to first create GGUF files for both the vision encoder and language model. This can be done using this surgery file [here](https://gist.github.com/joshyan1/d5eb3e58fd51680fcba9b1d87f8b3ebf). Create a `Modelfile` referencing both GGUF files without a template to run. Pull this model directly from [ollama](https://ollama.com/) using `ollama pull jyan1/paligemma-mix-224`. Take a look at the model [here](https://ollama.com/jyan1/paligemma-mix-224) #### Build and run this PR If you do not have a clone of this repository already ``` git clone https://github.com/ollama/ollama.git ``` Build and serve ```shell cd ollama git fetch -a git checkout paligemma-support go generate ./... go build . ./ollama serve ``` You can now query Paligemma either from the CLI or via HTTP request. #### CLI Example Using another terminal window ``` ./ollama run jyan1/paligemma-mix-224 ``` Input ``` >>> What is in this image? /path/to/my/puppy.jpg ``` Output ``` Added image '/path/to/my/puppy.jpg' A brown dog wearing a floral shirt and lei stands proudly next to a clear blue pool. The dog's mouth is open, its paw rests on the edge of the water, and its eyes are focused on the horizon. The pool water is crystal clear, and the palm trees in the distance provide shade for the dog. A black leash connects the dog to its owner, and a flower lei is around the dog's neck. The dog's fur is brown, and its nose is black. The tree behind the pool is tall and slender, and the fence surrounding the pool is made of metal posts. ``` <img src="https://github.com/user-attachments/assets/2f8e4cc5-ad5c-4f29-b74e-5d34fb8d8a98" width="360" height="480"> /path/to/my/puppy.jpg for reference :) This PR includes support for a one-shot one-image image-text prompt to Ollama. The supported models include Paligemma models with the `-224` suffix, referencing the rescaled image size that they use. ### TODO: - [ ] Include multi-image prompt processing, which involves appending additional image tokens to the start of the text input, as well as including an array of image embeddings to the model context - [ ] Utilizing model configurations to append the right number of placeholder tokens in `server.cpp` and set-up correct dimensions - [ ] Ensure that prompt is not truncated due to batch size as Paligemma wants the entire input, including all text and images, to be provided as a single batch. Note for `Paligemma-...-448` and `Paligemma-...-896` models, there are 1024 and 4096 image tokens per image in the input which is already larger than the `batch_size` set in llama.cpp Thanks everyone, Josh and Roy --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-12 23:49:33 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#12099