[PR #5246] [MERGED] llm: speed up gguf decoding by a lot #11723

Closed
opened 2026-04-12 23:37:09 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/5246
Author: @bmizerany
Created: 6/24/2024
Status: Merged
Merged: 6/25/2024
Merged by: @bmizerany

Base: mainHead: bmizerany/apishowperf


📝 Commits (1)

  • 280d632 llm: speed up gguf decoding by a lot

📊 Changes

13 files changed (+263 additions, -69 deletions)

View changed files

📝 llm/ggla.go (+11 -2)
📝 llm/ggml.go (+18 -7)
llm/ggml_test.go (+1 -0)
📝 llm/gguf.go (+92 -38)
📝 llm/memory_test.go (+11 -8)
📝 llm/server.go (+8 -3)
📝 server/images.go (+1 -1)
📝 server/model.go (+3 -3)
📝 server/routes.go (+16 -3)
📝 server/sched.go (+1 -1)
📝 server/sched_test.go (+3 -3)
util/bufioutil/buffer_seeker.go (+34 -0)
util/bufioutil/buffer_seeker_test.go (+64 -0)

📄 Description

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  • Too many allocations when decoding strings
  • Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/5246 **Author:** [@bmizerany](https://github.com/bmizerany) **Created:** 6/24/2024 **Status:** ✅ Merged **Merged:** 6/25/2024 **Merged by:** [@bmizerany](https://github.com/bmizerany) **Base:** `main` ← **Head:** `bmizerany/apishowperf` --- ### 📝 Commits (1) - [`280d632`](https://github.com/ollama/ollama/commit/280d632982719310cfe896aa7facbb4b20eea817) llm: speed up gguf decoding by a lot ### 📊 Changes **13 files changed** (+263 additions, -69 deletions) <details> <summary>View changed files</summary> 📝 `llm/ggla.go` (+11 -2) 📝 `llm/ggml.go` (+18 -7) ➕ `llm/ggml_test.go` (+1 -0) 📝 `llm/gguf.go` (+92 -38) 📝 `llm/memory_test.go` (+11 -8) 📝 `llm/server.go` (+8 -3) 📝 `server/images.go` (+1 -1) 📝 `server/model.go` (+3 -3) 📝 `server/routes.go` (+16 -3) 📝 `server/sched.go` (+1 -1) 📝 `server/sched_test.go` (+3 -3) ➕ `util/bufioutil/buffer_seeker.go` (+34 -0) ➕ `util/bufioutil/buffer_seeker_test.go` (+64 -0) </details> ### 📄 Description Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-12 23:37:09 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#11723