[PR #13752] [CLOSED] feat: Add per-GPU VRAM overhead configuration #14374

Closed
opened 2026-04-13 00:52:13 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13752
Author: @Mikec78660
Created: 1/16/2026
Status: Closed

Base: mainHead: feature/per-gpu-vram-overhead


📝 Commits (3)

  • eafc0c0 feat: Add per-GPU VRAM overhead configuration
  • 33c628c feat: enhance ollama ps to show layer distribution per GPU
  • 771d0b2 feat: enhance ollama ps to show layer distribution per GPU

📊 Changes

10 files changed (+737 additions, -49 deletions)

View changed files

📝 api/types.go (+42 -8)
📝 cmd/cmd.go (+21 -2)
📝 envconfig/config.go (+133 -2)
📝 envconfig/config_test.go (+98 -0)
📝 llm/server.go (+145 -13)
📝 ml/backend/ggml/ggml.go (+46 -5)
📝 runner/llamarunner/runner.go (+43 -6)
📝 server/routes.go (+28 -7)
📝 server/sched.go (+177 -6)
📝 x/imagegen/server.go (+4 -0)

📄 Description

Description

Add per-GPU VRAM overhead configuration to allow different memory reservations per GPU.

Problem

Users with multiple GPUs (e.g., different performance tiers like RTX 3090 + Tesla P40) need to configure different overhead values to optimize layer distribution. Currently OLLAMA_GPU_OVERHEAD applies to all GPUs globally.

Solution

  • Parse OLLAMA_GPU_OVERHEAD in format device:bytes,... (e.g., 0:6GB,1:0)
  • Map user-specified device indices (0, 1, ...) to actual GPU UUIDs
  • Fallback to direct UUID matching for explicit device identification
  • Apply overhead per-GPU in layer allocation logic

Usage

# Reserve 6GB on first GPU, none on second
export OLLAMA_GPU_OVERHEAD="0:6GB,1:0"

# Or use UUIDs for explicit identification
export OLLAMA_GPU_OVERHEAD="GPU-d7786e1e-...:6GB,GPU-e6985af3-...:0"

Testing

  • Unit tests: go test ./envconfig/... -v
  • Manual testing with CUDA GPUs verified ✓

Changes

  • envconfig/config.go: Add parseBytes() and GpuOverheadMap() functions
  • envconfig/config_test.go: Add unit tests for new functionality
  • llm/server.go: Apply per-GPU overhead in layer allocation
  • server/sched.go: Log per-GPU overhead values

Checklist

  • Code follows project style
  • Unit tests pass
  • Backwards compatible (single value still works)
  • No breaking changes
  • Supports both index-based and UUID-based device IDs

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13752 **Author:** [@Mikec78660](https://github.com/Mikec78660) **Created:** 1/16/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feature/per-gpu-vram-overhead` --- ### 📝 Commits (3) - [`eafc0c0`](https://github.com/ollama/ollama/commit/eafc0c075cf5ae75e9dd5bc70b1357bd35256389) feat: Add per-GPU VRAM overhead configuration - [`33c628c`](https://github.com/ollama/ollama/commit/33c628c0262f8b64843b6811a7ce5f6771867ea3) feat: enhance ollama ps to show layer distribution per GPU - [`771d0b2`](https://github.com/ollama/ollama/commit/771d0b256217b20733b6468a0519647a0cd23cc7) feat: enhance ollama ps to show layer distribution per GPU ### 📊 Changes **10 files changed** (+737 additions, -49 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+42 -8) 📝 `cmd/cmd.go` (+21 -2) 📝 `envconfig/config.go` (+133 -2) 📝 `envconfig/config_test.go` (+98 -0) 📝 `llm/server.go` (+145 -13) 📝 `ml/backend/ggml/ggml.go` (+46 -5) 📝 `runner/llamarunner/runner.go` (+43 -6) 📝 `server/routes.go` (+28 -7) 📝 `server/sched.go` (+177 -6) 📝 `x/imagegen/server.go` (+4 -0) </details> ### 📄 Description ## Description Add per-GPU VRAM overhead configuration to allow different memory reservations per GPU. ## Problem Users with multiple GPUs (e.g., different performance tiers like RTX 3090 + Tesla P40) need to configure different overhead values to optimize layer distribution. Currently `OLLAMA_GPU_OVERHEAD` applies to all GPUs globally. ## Solution - Parse `OLLAMA_GPU_OVERHEAD` in format `device:bytes,...` (e.g., `0:6GB,1:0`) - Map user-specified device indices (0, 1, ...) to actual GPU UUIDs - Fallback to direct UUID matching for explicit device identification - Apply overhead per-GPU in layer allocation logic ## Usage ```bash # Reserve 6GB on first GPU, none on second export OLLAMA_GPU_OVERHEAD="0:6GB,1:0" # Or use UUIDs for explicit identification export OLLAMA_GPU_OVERHEAD="GPU-d7786e1e-...:6GB,GPU-e6985af3-...:0" ``` ## Testing - Unit tests: `go test ./envconfig/... -v` ✓ - Manual testing with CUDA GPUs verified ✓ ## Changes - `envconfig/config.go`: Add `parseBytes()` and `GpuOverheadMap()` functions - `envconfig/config_test.go`: Add unit tests for new functionality - `llm/server.go`: Apply per-GPU overhead in layer allocation - `server/sched.go`: Log per-GPU overhead values ## Checklist - [x] Code follows project style - [x] Unit tests pass - [x] Backwards compatible (single value still works) - [x] No breaking changes - [x] Supports both index-based and UUID-based device IDs --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:52:13 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#14374