[PR #14120] [CLOSED] server: account for OLLAMA_NUM_PARALLEL in VRAM-based default context length #61221

Closed
opened 2026-04-29 16:17:49 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14120
Author: @4RH1T3CT0R7
Created: 2/6/2026
Status: Closed

Base: mainHead: main


📝 Commits (1)

  • c6df318 Add tests and logic to adjust defaultNumCtx based on parallelism

📊 Changes

2 files changed (+71 additions, -0 deletions)

View changed files

📝 server/routes.go (+6 -0)
📝 server/routes_options_test.go (+65 -0)

📄 Description

  • Fix VRAM-based default context length to divide by OLLAMA_NUM_PARALLEL, preventing VRAM exhaustion when parallel > 1
  • The KV cache allocates NumCtx * NumParallel but the tiered defaults only considered total VRAM, ignoring the parallelism multiplier
  • Add tests for the new behavior

Problem

Fixes #14116, #14088, #14073

The tiered VRAM-based default context lengths (4K / 32K / 256K) introduced in 0.15.5 don't account for OLLAMA_NUM_PARALLEL. Since KV cache is allocated as
NumCtx * NumParallel (llm/server.go:175), setting NUM_PARALLEL > 1 causes VRAM exhaustion and model loading failures.

Example: 24 GiB GPU with NUM_PARALLEL=4 gets default ctx=32768, resulting in KV cache = 32768 * 4 = 131072 tokens worth of VRAM — far exceeding what fits.

Fix

Divide s.defaultNumCtx by numParallel at startup, immediately after tier selection. This only affects the VRAM-based default — explicit settings via
OLLAMA_CONTEXT_LENGTH, model config, or API request are unchanged.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14120 **Author:** [@4RH1T3CT0R7](https://github.com/4RH1T3CT0R7) **Created:** 2/6/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `main` --- ### 📝 Commits (1) - [`c6df318`](https://github.com/ollama/ollama/commit/c6df31859c31c475306188d4697b67563b20b4f8) Add tests and logic to adjust defaultNumCtx based on parallelism ### 📊 Changes **2 files changed** (+71 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `server/routes.go` (+6 -0) 📝 `server/routes_options_test.go` (+65 -0) </details> ### 📄 Description - Fix VRAM-based default context length to divide by OLLAMA_NUM_PARALLEL, preventing VRAM exhaustion when parallel > 1 - The KV cache allocates NumCtx * NumParallel but the tiered defaults only considered total VRAM, ignoring the parallelism multiplier - Add tests for the new behavior Problem Fixes #14116, #14088, #14073 The tiered VRAM-based default context lengths (4K / 32K / 256K) introduced in 0.15.5 don't account for OLLAMA_NUM_PARALLEL. Since KV cache is allocated as NumCtx * NumParallel (llm/server.go:175), setting NUM_PARALLEL > 1 causes VRAM exhaustion and model loading failures. Example: 24 GiB GPU with NUM_PARALLEL=4 gets default ctx=32768, resulting in KV cache = 32768 * 4 = 131072 tokens worth of VRAM — far exceeding what fits. Fix Divide s.defaultNumCtx by numParallel at startup, immediately after tier selection. This only affects the VRAM-based default — explicit settings via OLLAMA_CONTEXT_LENGTH, model config, or API request are unchanged. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:17:49 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61221