[GH-ISSUE #13494] Integer underflow in GPU memory calculation causes "unable to allocate CUDA buffer" error when loading small models #55410

New Issue

GiteaMirror · 2026-04-29T09:08:11-05:00

GiteaMirror commented

2026-04-29 09:08:11 -05:00

Originally created by @takumi-ricoh on GitHub (Dec 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13494

Integer underflow in GPU memory calculation causes "unable to allocate CUDA buffer" error when loading small models

Summary

When loading a small 8B Q4_K_M model after two large models (120B and 20B) are already loaded, Ollama fails with "unable to allocate CUDA buffer" error due to an integer underflow bug in GPU memory calculation. The bug causes Ollama to incorrectly believe a nearly-full GPU has ~17 exabytes of free memory.

Environment

Ollama version: v0.12.11 (also reproduced on v0.13.4)
OS: Linux
GPUs: 3x NVIDIA L40S (48GB each)
Docker deployment

Configuration

OLLAMA_KEEP_ALIVE=-1
OLLAMA_MAX_LOADED_MODELS=3
OLLAMA_SCHED_SPREAD=0
OLLAMA_DEBUG=1

Steps to Reproduce

Load first large model (120B model, e.g., a quantized 120B model):

ollama run gpt-oss:120b "Hello"
# Model loads successfully (~80GB VRAM used)

Load second model (20B model, e.g., a safeguard/moderation model):

ollama run your-safeguard-model:20b "Hello"
# Model loads successfully (~39GB VRAM used)
# Total: 2 models loaded

Attempt to load small 8B Q4_K_M model:

ollama run your-8b-model:q4_k_m "Hello"
# FAILS with error below

Expected Behavior

Since OLLAMA_MAX_LOADED_MODELS=3, Ollama should either:

Auto-unload one of the existing models to make room, OR
Load the small model on GPU with sufficient free space (GPU 1 or GPU 2)

Actual Behavior

Error: 500 Internal Server Error: llama runner process has terminated:
error loading model: unable to allocate CUDA0 buffer

GPU state before error:

GPU 0: 357 MiB free, 45.7 GB used
GPU 1: 21 GB free, 25 GB used
GPU 2: 1.4 GB free, 44.7 GB used

Debug Logs Analysis

With OLLAMA_DEBUG=1, the logs reveal integer underflow:

time=2025-12-16T07:36:37.183Z level=INFO msg="updated VRAM" gpu=GPU-5320d871... available="357.9 MiB"
time=2025-12-16T07:36:37.183Z level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5320d871... "available layer vram"="17179869183.3 GiB"
time=2025-12-16T07:36:37.183Z level=INFO msg=load request="GPULayers:33[ID:GPU-5320d871... Layers:33(0..32)]"

Key evidence:

GPU 0 has only 357.9 MiB free
But Ollama calculates 17,179,869,183.3 GiB (~17 exabytes) of "available layer vram"
All 33 layers are assigned to GPU 0 despite insufficient space
Actual CUDA allocation fails

Root Cause

Integer underflow occurs in llm/server.go around lines 914-919 in the buildLayout function:

reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph
if gl[i].FreeMemory > reserved {
    gl[i].FreeMemory -= reserved
} else {
    gl[i].FreeMemory = 0
}

Issue: When this code is executed multiple times for the same GPU (when memory.GPUs contains multiple entries for the same device), the second iteration causes underflow:

First iteration: FreeMemory = 357.9 MiB, reserved = 457 MiB + 17.5 GiB
- Since 357.9 MiB < reserved, set FreeMemory = 0
Second iteration (same GPU): FreeMemory = 0, tries to subtract reserved again
- Although the if-else prevents direct underflow, the value displayed in logs shows 17,179,869,183.3 GiB
- This suggests the Graph memory calculation results in incorrect values being used

The value 17,179,869,183.3 GiB ≈ 2^64 - 17.5 GiB, indicating uint64 underflow when subtracting graph memory from a GPU with insufficient free space.

Why Large Models Don't Trigger This Bug

When loading another large model (e.g., gpt-oss:120b → another 70B+ model):

The new model is obviously too large to fit
Auto-unload logic kicks in before memory calculation
Clean state prevents integer underflow condition

When loading a small 8B Q4_K_M model after large models:

OLLAMA_MAX_LOADED_MODELS still has capacity (2/3 used)
Small 8B model appears to fit (only ~4.4GB needed), so no auto-unload occurs
GPU memory calculation proceeds
Integer underflow bug is triggered
Ollama incorrectly assigns all layers to nearly-full GPU 0
Actual allocation fails

This is an ironic situation: smaller models are more likely to encounter this bug than larger models.

Suggested Fix

Ensure gl[i].FreeMemory calculations are performed only once per GPU, or add bounds checking to prevent underflow when FreeMemory is already 0 or insufficient.

Possible fix in llm/server.go around line 914-919:

reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph
if gl[i].FreeMemory > reserved {
    gl[i].FreeMemory -= reserved
} else {
    gl[i].FreeMemory = 0
    // Prevent further processing on this GPU if it has no available memory
}

Or ensure that each GPU is only processed once in the loop by tracking which GPUs have already been processed.

Workaround

Option 1: Use SCHED_SPREAD

Set OLLAMA_SCHED_SPREAD=1 to spread models across all GPUs, which reduces the likelihood of hitting this bug (though doesn't eliminate it):

OLLAMA_SCHED_SPREAD=1

Option 2: Manual unload

Manually unload models before loading new ones:

ollama stop gpt-oss:120b
ollama run your-8b-model:q4_k_m "Hello"

Additional Notes

This bug affects scenarios where:

Multiple large models are loaded sequentially
OLLAMA_MAX_LOADED_MODELS allows one more model
The next model is small enough that auto-unload doesn't trigger
One GPU is nearly full but has a small amount of free memory

The bug is triggered more frequently with OLLAMA_SCHED_SPREAD=0 (pack strategy) than with OLLAMA_SCHED_SPREAD=1 (spread strategy), but can occur in both cases.

Originally created by @takumi-ricoh on GitHub (Dec 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13494 # Integer underflow in GPU memory calculation causes "unable to allocate CUDA buffer" error when loading small models ## Summary When loading a small 8B Q4_K_M model after two large models (120B and 20B) are already loaded, Ollama fails with "unable to allocate CUDA buffer" error due to an integer underflow bug in GPU memory calculation. The bug causes Ollama to incorrectly believe a nearly-full GPU has ~17 exabytes of free memory. ## Environment - Ollama version: v0.12.11 (also reproduced on v0.13.4) - OS: Linux - GPUs: 3x NVIDIA L40S (48GB each) - Docker deployment ## Configuration ```bash OLLAMA_KEEP_ALIVE=-1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_SCHED_SPREAD=0 OLLAMA_DEBUG=1 ``` ## Steps to Reproduce 1. Load first large model (120B model, e.g., a quantized 120B model): ```bash ollama run gpt-oss:120b "Hello" # Model loads successfully (~80GB VRAM used) ``` 2. Load second model (20B model, e.g., a safeguard/moderation model): ```bash ollama run your-safeguard-model:20b "Hello" # Model loads successfully (~39GB VRAM used) # Total: 2 models loaded ``` 3. Attempt to load small 8B Q4_K_M model: ```bash ollama run your-8b-model:q4_k_m "Hello" # FAILS with error below ``` ## Expected Behavior Since `OLLAMA_MAX_LOADED_MODELS=3`, Ollama should either: 1. Auto-unload one of the existing models to make room, OR 2. Load the small model on GPU with sufficient free space (GPU 1 or GPU 2) ## Actual Behavior ``` Error: 500 Internal Server Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer ``` GPU state before error: ``` GPU 0: 357 MiB free, 45.7 GB used GPU 1: 21 GB free, 25 GB used GPU 2: 1.4 GB free, 44.7 GB used ``` ## Debug Logs Analysis With `OLLAMA_DEBUG=1`, the logs reveal integer underflow: ``` time=2025-12-16T07:36:37.183Z level=INFO msg="updated VRAM" gpu=GPU-5320d871... available="357.9 MiB" time=2025-12-16T07:36:37.183Z level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5320d871... "available layer vram"="17179869183.3 GiB" time=2025-12-16T07:36:37.183Z level=INFO msg=load request="GPULayers:33[ID:GPU-5320d871... Layers:33(0..32)]" ``` **Key evidence:** - GPU 0 has only **357.9 MiB** free - But Ollama calculates **17,179,869,183.3 GiB** (~17 exabytes) of "available layer vram" - All 33 layers are assigned to GPU 0 despite insufficient space - Actual CUDA allocation fails ## Root Cause Integer underflow occurs in `llm/server.go` around lines 914-919 in the `buildLayout` function: ```go reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph if gl[i].FreeMemory > reserved { gl[i].FreeMemory -= reserved } else { gl[i].FreeMemory = 0 } ``` **Issue:** When this code is executed multiple times for the same GPU (when `memory.GPUs` contains multiple entries for the same device), the second iteration causes underflow: 1. First iteration: `FreeMemory = 357.9 MiB`, `reserved = 457 MiB + 17.5 GiB` - Since `357.9 MiB < reserved`, set `FreeMemory = 0` 2. Second iteration (same GPU): `FreeMemory = 0`, tries to subtract `reserved` again - Although the `if-else` prevents direct underflow, the value displayed in logs shows `17,179,869,183.3 GiB` - This suggests the `Graph` memory calculation results in incorrect values being used The value `17,179,869,183.3 GiB` ≈ `2^64 - 17.5 GiB`, indicating uint64 underflow when subtracting graph memory from a GPU with insufficient free space. ## Why Large Models Don't Trigger This Bug When loading another large model (e.g., `gpt-oss:120b` → another 70B+ model): - The new model is obviously too large to fit - Auto-unload logic kicks in **before** memory calculation - Clean state prevents integer underflow condition When loading a small 8B Q4_K_M model after large models: - `OLLAMA_MAX_LOADED_MODELS` still has capacity (2/3 used) - Small 8B model appears to fit (only ~4.4GB needed), so no auto-unload occurs - GPU memory calculation proceeds - Integer underflow bug is triggered - Ollama incorrectly assigns all layers to nearly-full GPU 0 - Actual allocation fails **This is an ironic situation: smaller models are more likely to encounter this bug than larger models.** ## Suggested Fix Ensure `gl[i].FreeMemory` calculations are performed only once per GPU, or add bounds checking to prevent underflow when `FreeMemory` is already 0 or insufficient. Possible fix in `llm/server.go` around line 914-919: ```go reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph if gl[i].FreeMemory > reserved { gl[i].FreeMemory -= reserved } else { gl[i].FreeMemory = 0 // Prevent further processing on this GPU if it has no available memory } ``` Or ensure that each GPU is only processed once in the loop by tracking which GPUs have already been processed. ## Workaround ### Option 1: Use SCHED_SPREAD Set `OLLAMA_SCHED_SPREAD=1` to spread models across all GPUs, which reduces the likelihood of hitting this bug (though doesn't eliminate it): ```bash OLLAMA_SCHED_SPREAD=1 ``` ### Option 2: Manual unload Manually unload models before loading new ones: ```bash ollama stop gpt-oss:120b ollama run your-8b-model:q4_k_m "Hello" ``` ## Additional Notes This bug affects scenarios where: - Multiple large models are loaded sequentially - `OLLAMA_MAX_LOADED_MODELS` allows one more model - The next model is small enough that auto-unload doesn't trigger - One GPU is nearly full but has a small amount of free memory The bug is triggered more frequently with `OLLAMA_SCHED_SPREAD=0` (pack strategy) than with `OLLAMA_SCHED_SPREAD=1` (spread strategy), but can occur in both cases.

GiteaMirror added the bug label 2026-04-29 09:08:11 -05:00

GiteaMirror closed this issue

2026-04-29 09:08:12 -05:00

GiteaMirror commented

2026-04-29 09:08:13 -05:00

@jessegross commented on GitHub (Dec 17, 2025):

Please include the log from when this happens on 0.13.4.

@jessegross commented on GitHub (Dec 17, 2025): Please include the log from when this happens on 0.13.4.

GiteaMirror commented

2026-04-29 09:08:14 -05:00

@takumi-ricoh commented on GitHub (Dec 19, 2025):

ollama_bug_logs_v0.13.4.txt

Response to GitHub Issue #13494

Reproduced on v0.13.4

I've reproduced this issue on Ollama v0.13.4. Here are the logs showing the integer underflow bug:

Environment

Ollama version: v0.13.4 (Docker)
GPUs: 3x NVIDIA L40S (48GB each)
Other GPU consumers: LocalAI, vLLM, Whisper running on the same system (consuming some GPU 0 memory)

Configuration:

OLLAMA_DEBUG=1
OLLAMA_KEEP_ALIVE=-1
OLLAMA_MAX_LOADED_MODELS=3
OLLAMA_SCHED_SPREAD=0

Steps to Reproduce

Have other GPU-consuming services running (LocalAI, vLLM, etc.) that use some GPU memory
Load a large model (e.g., 120B model using ~80GB VRAM across GPUs)
Load a second model (e.g., 8B model with 64K context using ~39GB VRAM)
Attempt to load a third small model (8B Q4_K_M)

Key Log Evidence

Integer underflow detected - GPU reports ~17 exabytes of "available" VRAM:

time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869183.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869165.8 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="17.5 GiB"

The value 17179869183.3 GiB ≈ 2^64 - 17.5 GiB bytes, indicating uint64 underflow.

Resulting error:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4403.49 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4617396480
llama_model_load: error loading model: unable to allocate CUDA0 buffer
time=2025-12-19T10:36:22.622Z level=INFO source=sched.go:470 msg="Load failed" model=... error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer"

GPU State Before Error

GPU 0: 378 MiB free (nearly full - partially used by LocalAI/vLLM/Whisper)
GPU 1: 1.5 GB free
GPU 2: 21 GB free

Despite GPU 2 having 21GB free, the scheduler incorrectly assigned all layers to GPU 0 (which only had 378 MiB free) due to the underflow bug.

Root Cause Hypothesis

The integer underflow occurs in the GPU memory calculation when:

Multiple models are already loaded, and/or other GPU consumers (LocalAI, vLLM, etc.) are using GPU memory
FreeMemory is less than reserved (MinimumMemory + Overhead + Graph)
The subtraction results in uint64 underflow, producing ~17 exabytes

This bug is more likely to trigger in environments where Ollama shares GPUs with other services.

Workaround

Setting OLLAMA_SCHED_SPREAD=1 reduces the likelihood of triggering this bug by spreading models across GPUs more evenly.

@takumi-ricoh commented on GitHub (Dec 19, 2025): [ollama_bug_logs_v0.13.4.txt](https://github.com/user-attachments/files/24256096/ollama_bug_logs_v0.13.4.txt) # Response to GitHub Issue #13494 ## Reproduced on v0.13.4 I've reproduced this issue on **Ollama v0.13.4**. Here are the logs showing the integer underflow bug: ### Environment - Ollama version: v0.13.4 (Docker) - GPUs: 3x NVIDIA L40S (48GB each) - Other GPU consumers: LocalAI, vLLM, Whisper running on the same system (consuming some GPU 0 memory) - Configuration: ``` OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_SCHED_SPREAD=0 ``` ### Steps to Reproduce 1. Have other GPU-consuming services running (LocalAI, vLLM, etc.) that use some GPU memory 2. Load a large model (e.g., 120B model using ~80GB VRAM across GPUs) 3. Load a second model (e.g., 8B model with 64K context using ~39GB VRAM) 4. Attempt to load a third small model (8B Q4_K_M) ### Key Log Evidence **Integer underflow detected** - GPU reports ~17 exabytes of "available" VRAM: ``` time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869183.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869165.8 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="17.5 GiB" ``` The value `17179869183.3 GiB` ≈ `2^64 - 17.5 GiB` bytes, indicating uint64 underflow. **Resulting error:** ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4403.49 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 4617396480 llama_model_load: error loading model: unable to allocate CUDA0 buffer time=2025-12-19T10:36:22.622Z level=INFO source=sched.go:470 msg="Load failed" model=... error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer" ``` ### GPU State Before Error ``` GPU 0: 378 MiB free (nearly full - partially used by LocalAI/vLLM/Whisper) GPU 1: 1.5 GB free GPU 2: 21 GB free ``` Despite GPU 2 having 21GB free, the scheduler incorrectly assigned all layers to GPU 0 (which only had 378 MiB free) due to the underflow bug. ### Root Cause Hypothesis The integer underflow occurs in the GPU memory calculation when: 1. Multiple models are already loaded, and/or other GPU consumers (LocalAI, vLLM, etc.) are using GPU memory 2. `FreeMemory` is less than `reserved` (MinimumMemory + Overhead + Graph) 3. The subtraction results in uint64 underflow, producing ~17 exabytes This bug is more likely to trigger in environments where Ollama shares GPUs with other services. ### Workaround Setting `OLLAMA_SCHED_SPREAD=1` reduces the likelihood of triggering this bug by spreading models across GPUs more evenly.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#55410