[PR #11959] [MERGED] disable output_all #13669

Closed
opened 2026-04-13 00:32:23 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11959
Author: @mxyng
Created: 8/19/2025
Status: Merged
Merged: 8/19/2025
Merged by: @mxyng

Base: mainHead: mxyng/disable-output-all


📝 Commits (1)

📊 Changes

3 files changed (+25 additions, -3 deletions)

View changed files

📝 llama/llama.cpp/src/llama-context.cpp (+1 -2)
📝 llama/patches/0019-Enable-CUDA-Graphs-for-gemma3n.patch (+1 -1)
llama/patches/0023-decode-disable-output_all.patch (+23 -0)

📄 Description

explicitly disable output_all since we're using cparams.embeddings slightly differently than intended. with output_all=true, hidden states is not truncated to just the last position of each sequence and blocks use of ggml_cuda_mul_mat_vec_q for inputs > 8. instead it falls back to ggml_cuda_op_mul_mat_cublas which allocates temporary buffers to hold dequantized tensors. this can be problematic when the quantized tensor is token_embd.weight which for models such as gemma2:2b allocates >2GB


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11959 **Author:** [@mxyng](https://github.com/mxyng) **Created:** 8/19/2025 **Status:** ✅ Merged **Merged:** 8/19/2025 **Merged by:** [@mxyng](https://github.com/mxyng) **Base:** `main` ← **Head:** `mxyng/disable-output-all` --- ### 📝 Commits (1) - [`c5481f5`](https://github.com/ollama/ollama/commit/c5481f527d7274625f6bfffa135e7ab3ef7b8150) disable output_all ### 📊 Changes **3 files changed** (+25 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `llama/llama.cpp/src/llama-context.cpp` (+1 -2) 📝 `llama/patches/0019-Enable-CUDA-Graphs-for-gemma3n.patch` (+1 -1) ➕ `llama/patches/0023-decode-disable-output_all.patch` (+23 -0) </details> ### 📄 Description explicitly disable `output_all` since we're using `cparams.embeddings` slightly differently than intended. with `output_all=true`, hidden states is not truncated to just the last position of each sequence and blocks use of `ggml_cuda_mul_mat_vec_q` for inputs > 8. instead it falls back to `ggml_cuda_op_mul_mat_cublas` which allocates temporary buffers to hold dequantized tensors. this can be problematic when the quantized tensor is `token_embd.weight` which for models such as gemma2:2b allocates >2GB --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:32:23 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#13669