[PR #401] [MERGED] subprocess llama.cpp server #15408

Closed
opened 2026-04-16 04:58:24 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/401
Author: @BruceMacD
Created: 8/23/2023
Status: Merged
Merged: 8/30/2023
Merged by: @BruceMacD

Base: mainHead: brucemacd/server-shell


📝 Commits (10+)

📊 Changes

37 files changed (+958 additions, -43928 deletions)

View changed files

📝 .gitignore (+0 -1)
.gitmodules (+3 -0)
📝 api/types.go (+7 -18)
📝 app/src/index.ts (+1 -1)
📝 docs/development.md (+7 -5)
📝 go.mod (+2 -2)
📝 go.sum (+3 -2)
llm/ggml-alloc.c (+0 -575)
llm/ggml-alloc.h (+0 -48)
llm/ggml-cuda.cu (+0 -6497)
llm/ggml-cuda.h (+0 -63)
llm/ggml-metal.h (+0 -106)
llm/ggml-metal.m (+0 -1180)
llm/ggml-metal.metal (+0 -2000)
llm/ggml-mpi.c (+0 -244)
llm/ggml-mpi.h (+0 -67)
llm/ggml-opencl.cpp (+0 -1893)
llm/ggml-opencl.h (+0 -53)
llm/ggml.c (+0 -18722)
llm/ggml.h (+0 -1780)

...and 17 more files

📄 Description

This is a pretty big change that moves llama.cpp from a library within cgo to an external process that we manage.

Why?

  • This makes building for multiple platforms easier (no more windows cgo incompatibilities)
  • We can fallback to non-gpu runners when needed
  • Approximately ~200ms faster on average in my tests
  • Way less code in our repo
  • Maybe easier to manage our build matrix

Minor Breaking Changes

  • Generate response no longer includes sample account or sample duration. These metrics are not included in the response from the llama.cpp server.
  • Only one LoRA adapter is supported at a time. The llama.cpp server isn't built for this at the moment. Allowing multiple seems like it would be a pretty simple PR to open with llama.cpp.

Features

  • Use the existing loading logic to manage a llama.cpp server
  • Package in llama.cpp CPU and GPU runtimes in the Go binary
  • Removes vendored llama.cpp code
  • No more cgo

There's a lot of changes in this PR, here are the files to look at:

  • llm/llama.go
  • llm/llama_generate.go
  • llm/llama_generate_darwin.go
  • api/types.go
  • app/src/index.ts
  • server/routes.go

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/401 **Author:** [@BruceMacD](https://github.com/BruceMacD) **Created:** 8/23/2023 **Status:** ✅ Merged **Merged:** 8/30/2023 **Merged by:** [@BruceMacD](https://github.com/BruceMacD) **Base:** `main` ← **Head:** `brucemacd/server-shell` --- ### 📝 Commits (10+) - [`e2cb384`](https://github.com/ollama/ollama/commit/e2cb384c92317f040ed0811f24b804f8678dfe02) prototype - [`13364ad`](https://github.com/ollama/ollama/commit/13364ad5c2a9c373c3b0dd7a6c1ddce9089afa8d) remove c code - [`cbfcf55`](https://github.com/ollama/ollama/commit/cbfcf551c61ee907c37828e74f58d63e7ce804e3) pack llama.cpp - [`d370f5b`](https://github.com/ollama/ollama/commit/d370f5bc42dd552c1e263a2b497d6a2f24701ff4) fix params - [`a3aec93`](https://github.com/ollama/ollama/commit/a3aec93fff8e8ead64a520f8f41f0bd6d56b5543) use request context for llama_cpp - [`e37f2e8`](https://github.com/ollama/ollama/commit/e37f2e89e2505d31bbf32ef38768ae9d19adaa8c) lora - [`1bb681c`](https://github.com/ollama/ollama/commit/1bb681ca36caafcd5066d9aee01b434d2b1dca72) let llama_cpp decide the number of threads to use - [`d34da41`](https://github.com/ollama/ollama/commit/d34da4104a83134fb1f20826d5d40ecda12f20f8) stop llama runner when app stops - [`a215aab`](https://github.com/ollama/ollama/commit/a215aabe5a55d0b01d3e31f5cf84d94380513b44) multiple runners - [`71a68d4`](https://github.com/ollama/ollama/commit/71a68d495d528fa5047c1b4d32a45200e796eda9) restore prompt num keep ### 📊 Changes **37 files changed** (+958 additions, -43928 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+0 -1) ➕ `.gitmodules` (+3 -0) 📝 `api/types.go` (+7 -18) 📝 `app/src/index.ts` (+1 -1) 📝 `docs/development.md` (+7 -5) 📝 `go.mod` (+2 -2) 📝 `go.sum` (+3 -2) ➖ `llm/ggml-alloc.c` (+0 -575) ➖ `llm/ggml-alloc.h` (+0 -48) ➖ `llm/ggml-cuda.cu` (+0 -6497) ➖ `llm/ggml-cuda.h` (+0 -63) ➖ `llm/ggml-metal.h` (+0 -106) ➖ `llm/ggml-metal.m` (+0 -1180) ➖ `llm/ggml-metal.metal` (+0 -2000) ➖ `llm/ggml-mpi.c` (+0 -244) ➖ `llm/ggml-mpi.h` (+0 -67) ➖ `llm/ggml-opencl.cpp` (+0 -1893) ➖ `llm/ggml-opencl.h` (+0 -53) ➖ `llm/ggml.c` (+0 -18722) ➖ `llm/ggml.h` (+0 -1780) _...and 17 more files_ </details> ### 📄 Description This is a pretty big change that moves llama.cpp from a library within cgo to an external process that we manage. Why? - This makes building for multiple platforms easier (no more windows cgo incompatibilities) - We can fallback to non-gpu runners when needed - Approximately ~200ms faster on average in my tests - Way less code in our repo - Maybe easier to manage our build matrix Minor Breaking Changes - Generate response no longer includes sample account or sample duration. These metrics are not included in the response from the llama.cpp server. - Only one LoRA adapter is supported at a time. The llama.cpp server isn't built for this at the moment. Allowing multiple seems like it would be a pretty simple PR to open with llama.cpp. Features - Use the existing loading logic to manage a llama.cpp server - Package in llama.cpp CPU and GPU runtimes in the Go binary - Removes vendored llama.cpp code - No more cgo There's a lot of changes in this PR, here are the files to look at: - llm/llama.go - llm/llama_generate.go - llm/llama_generate_darwin.go - api/types.go - app/src/index.ts - server/routes.go --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 04:58:24 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#15408