[PR #7453] [MERGED] runner.go: Don't set cross attention before sending embeddings #38299

Closed
opened 2026-04-22 22:58:20 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/7453
Author: @jessegross
Created: 10/31/2024
Status: Merged
Merged: 10/31/2024
Merged by: @jessegross

Base: mainHead: jessegross/cross_attn


📝 Commits (1)

  • 42e5133 runner.go: Don't set cross attention before sending embeddings

📊 Changes

2 files changed (+23 additions, -9 deletions)

View changed files

📝 llama/runner/image.go (+11 -0)
📝 llama/runner/runner.go (+12 -9)

📄 Description

Currently if an input has embeddings at any point then we will set cross attention to true from the beginning. This means that any tokens before the embeddings are sent will incorrectly have cross attention layers applied.

This only sets cross attention when we have an embedding, either previously in this sequence or in the cache. It also makes cross attention capable of supporting parallelism at the runner level, though the mllama implementation doesn't support that yet.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/7453 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 10/31/2024 **Status:** ✅ Merged **Merged:** 10/31/2024 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/cross_attn` --- ### 📝 Commits (1) - [`42e5133`](https://github.com/ollama/ollama/commit/42e5133d9bac494ac8f62c18773719ed330805e4) runner.go: Don't set cross attention before sending embeddings ### 📊 Changes **2 files changed** (+23 additions, -9 deletions) <details> <summary>View changed files</summary> 📝 `llama/runner/image.go` (+11 -0) 📝 `llama/runner/runner.go` (+12 -9) </details> ### 📄 Description Currently if an input has embeddings at any point then we will set cross attention to true from the beginning. This means that any tokens before the embeddings are sent will incorrectly have cross attention layers applied. This only sets cross attention when we have an embedding, either previously in this sequence or in the cache. It also makes cross attention capable of supporting parallelism at the runner level, though the mllama implementation doesn't support that yet. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-22 22:58:20 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#38299