[PR #14916] [CLOSED] mlx: add multimodal pipeline infrastructure for vision and audio models #61612

Closed
opened 2026-04-29 16:40:45 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14916
Author: @dhiltgen
Created: 3/17/2026
Status: Closed

Base: mainHead: mlx-multimodal-pipeline


📝 Commits (1)

  • 81ebbd5 mlx: add multimodal pipeline infrastructure for vision and audio models

📊 Changes

4 files changed (+183 additions, -3 deletions)

View changed files

📝 x/mlxrunner/client.go (+9 -0)
📝 x/mlxrunner/model/base/base.go (+60 -0)
📝 x/mlxrunner/pipeline.go (+112 -2)
📝 x/mlxrunner/runner.go (+2 -1)

📄 Description

Add a modality-agnostic multimodal framework to the MLX runner:

  • MultimodalModel interface: extends Model with EncodeMultimodal() for preprocessing raw bytes (images, audio) and Prefill() for multimodal embedding construction and chunked forward passes.
  • Pipeline support: parse [img-N] placeholder tags in prompts, route multimodal inputs through model-specific encoding, and handle multimodal prefill with KV cache awareness.
  • Client plumbing: forward image/audio data from completion requests through to the runner pipeline.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14916 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 3/17/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `mlx-multimodal-pipeline` --- ### 📝 Commits (1) - [`81ebbd5`](https://github.com/ollama/ollama/commit/81ebbd56c9bc41b4f035d1f21a88fa8c8b38b9c2) mlx: add multimodal pipeline infrastructure for vision and audio models ### 📊 Changes **4 files changed** (+183 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `x/mlxrunner/client.go` (+9 -0) 📝 `x/mlxrunner/model/base/base.go` (+60 -0) 📝 `x/mlxrunner/pipeline.go` (+112 -2) 📝 `x/mlxrunner/runner.go` (+2 -1) </details> ### 📄 Description Add a modality-agnostic multimodal framework to the MLX runner: - MultimodalModel interface: extends Model with EncodeMultimodal() for preprocessing raw bytes (images, audio) and Prefill() for multimodal embedding construction and chunked forward passes. - Pipeline support: parse [img-N] placeholder tags in prompts, route multimodal inputs through model-specific encoding, and handle multimodal prefill with KV cache awareness. - Client plumbing: forward image/audio data from completion requests through to the runner pipeline. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:40:45 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61612