Multi-GPU (AMD) Performance Regression in Ollama with ROCm 6.3.1 #5646

New Issue

GiteaMirror · 2025-11-12T13:05:29-06:00

GiteaMirror commented

2025-11-12 13:05:29 -06:00

Originally created by @konian71 on GitHub (Jan 30, 2025).

What is the issue?

Description

For the past two weeks, my second and third GPUs have been freezing during inference when running Ollama. After the crash, the VRAM remains full, but the GPUs stop processing. However, the system itself remains stable, and a soft reboot is no longer possible—only a hard reboot restores functionality.

Additionally, GPU performance has drastically decreased. Previously, Qwen2.5-Coder 32B ran at ~17 tokens/sec, but now it barely reaches 4 tokens/sec. Even models that fully fit into VRAM are underperforming.

System Information

CPU: AMD Ryzen Threadripper 3960X
RAM: 256 GB DDR4-3200
GPUs: 3 x AMD Radeon RX 7900 XTX
OS: Ubuntu 24.04 LTS Server
Ollama Version: (latest, post-update)
ROCm Version: 6.3.1 (with ROCm-SMI 5.7.0)

Key Findings & Symptoms

GPU Workload Distribution is Broken
GPUs are recognized but remain idle during inference.
Power consumption stays under 100W per GPU, which is far too low.
GPU clock speeds (sclk) remain at 0 MHz, meaning no actual computation occurs.
CPU utilization is very high (~80-95%), even when models should be running entirely in VRAM.

Specific Models Trigger Freezes
Mistral-Large frequently crashes the GPUs, requiring a hard reboot.
Llama3.3-70B is extremely slow but at least remains stable.
DeepSeek R1 32B only uses ~50% of VRAM, yet the GPUs remain idle.

Multi-GPU Scaling is Failing
Performance does not improve with multiple GPUs.
Even when a model fully fits into VRAM, Ollama does not utilize GPU compute units.
Disabling multiple GPUs (CUDA_VISIBLE_DEVICES=0) sometimes improves stability.

Possible ROCm Regression
Previously (ROCm 5.7.1 or earlier), everything worked fine.
After Ollama updated, ROCm was also upgraded to 6.3.1 automatically.
It is unclear whether this issue is caused by Ollama’s inference engine or a ROCm 6.3.1 regression.
Downgrading ROCm is not trivial, as Ollama depends on its installed version.

Troubleshooting Attempts
Setting performance mode to compute (rocm-smi --setperflevel compute) → No effect
Manually setting GPU clocks (rocm-smi --setclk OD) → No effect
Checking GPU activity with rocm-smi -a → Compute Units remain inactive
Running Ollama with one GPU only (CUDA_VISIBLE_DEVICES=0) → Minor improvement, but still slow
Testing alternative models (DeepSeek R1 32B, Llama 3.3-70B, Qwen2.5-Coder 32B) → All underperform significantly
Checking alternative inference engines (vLLM, llama.cpp) → Pending tests

Next Steps & Questions
Is this a known issue with ROCm 6.3.1?
Has Ollama introduced a regression in its ROCm backend?
Would a downgrade to ROCm 5.7.1 restore previous performance?
Would switching to vLLM resolve the Multi-GPU scaling issues?
Are there workarounds to force GPU utilization properly under Ollama?
Any insights or suggestions would be greatly appreciated!

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.5.5 up to latest

Originally created by @konian71 on GitHub (Jan 30, 2025). ### What is the issue? *Description* For the past two weeks, my second and third GPUs have been freezing during inference when running Ollama. After the crash, the VRAM remains full, but the GPUs stop processing. However, the system itself remains stable, and a soft reboot is no longer possible—only a hard reboot restores functionality. Additionally, GPU performance has drastically decreased. Previously, Qwen2.5-Coder 32B ran at ~17 tokens/sec, but now it barely reaches 4 tokens/sec. Even models that fully fit into VRAM are underperforming. System Information CPU: AMD Ryzen Threadripper 3960X RAM: 256 GB DDR4-3200 GPUs: 3 x AMD Radeon RX 7900 XTX OS: Ubuntu 24.04 LTS Server Ollama Version: (latest, post-update) ROCm Version: 6.3.1 (with ROCm-SMI 5.7.0) *Key Findings & Symptoms* **GPU Workload Distribution is Broken** GPUs are recognized but remain idle during inference. Power consumption stays under 100W per GPU, which is far too low. GPU clock speeds (sclk) remain at 0 MHz, meaning no actual computation occurs. CPU utilization is very high (~80-95%), even when models should be running entirely in VRAM. **Specific Models Trigger Freezes** Mistral-Large frequently crashes the GPUs, requiring a hard reboot. Llama3.3-70B is extremely slow but at least remains stable. DeepSeek R1 32B only uses ~50% of VRAM, yet the GPUs remain idle. **Multi-GPU Scaling is Failing** Performance does not improve with multiple GPUs. Even when a model fully fits into VRAM, Ollama does not utilize GPU compute units. Disabling multiple GPUs (CUDA_VISIBLE_DEVICES=0) sometimes improves stability. **Possible ROCm Regressio**n Previously (ROCm 5.7.1 or earlier), everything worked fine. After Ollama updated, ROCm was also upgraded to 6.3.1 automatically. It is unclear whether this issue is caused by Ollama’s inference engine or a ROCm 6.3.1 regression. Downgrading ROCm is not trivial, as Ollama depends on its installed version. **Troubleshooting Attempts** Setting performance mode to compute (rocm-smi --setperflevel compute) → No effect Manually setting GPU clocks (rocm-smi --setclk OD) → No effect Checking GPU activity with rocm-smi -a → Compute Units remain inactive Running Ollama with one GPU only (CUDA_VISIBLE_DEVICES=0) → Minor improvement, but still slow Testing alternative models (DeepSeek R1 32B, Llama 3.3-70B, Qwen2.5-Coder 32B) → All underperform significantly Checking alternative inference engines (vLLM, llama.cpp) → Pending tests **Next Steps & Questions** Is this a known issue with ROCm 6.3.1? Has Ollama introduced a regression in its ROCm backend? Would a downgrade to ROCm 5.7.1 restore previous performance? Would switching to vLLM resolve the Multi-GPU scaling issues? Are there workarounds to force GPU utilization properly under Ollama? Any insights or suggestions would be greatly appreciated! ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.5.5 up to latest

GiteaMirror added the bug gpu amd labels 2025-11-12 13:05:29 -06:00

GiteaMirror commented

2025-11-12 13:05:29 -06:00

@rick-github commented on GitHub (Jan 30, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Jan 30, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2025-11-12 13:05:29 -06:00

@konian71 commented on GitHub (Jan 30, 2025):

Here it is:
ollama.log

root@ki: ollama ps
NAME                               ID              SIZE     PROCESSOR    UNTIL
qwen2.5-coder:32b-instruct-q8_0    f37bbf27ec01    54 GB    100% GPU     8 seconds from now

root@ki: rocm-smi
========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU% 
0    45.0c           80.0W   527Mhz  1249Mhz  0%   auto  283.0W   68%   28%  
1    41.0c           68.0W   27Mhz   1249Mhz  0%   auto  283.0W   70%   0%   
2    41.0c           68.0W   26Mhz   1249Mhz  0%   auto  283.0W   70%   0%   
====================================================================================
=============================== End of ROCm SMI Log ================================

Speed:
response_token/s = 3,84

Funny detail
Fans of all GPUs stopped during inference

@konian71 commented on GitHub (Jan 30, 2025): **Here it is:** [ollama.log](https://github.com/user-attachments/files/18609359/ollama.log) ``` root@ki: ollama ps NAME ID SIZE PROCESSOR UNTIL qwen2.5-coder:32b-instruct-q8_0 f37bbf27ec01 54 GB 100% GPU 8 seconds from now root@ki: rocm-smi ========================= ROCm System Management Interface ========================= =================================== Concise Info =================================== GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 45.0c 80.0W 527Mhz 1249Mhz 0% auto 283.0W 68% 28% 1 41.0c 68.0W 27Mhz 1249Mhz 0% auto 283.0W 70% 0% 2 41.0c 68.0W 26Mhz 1249Mhz 0% auto 283.0W 70% 0% ==================================================================================== =============================== End of ROCm SMI Log ================================ ``` **Speed:** response_token/s = 3,84 **Funny detail** Fans of all GPUs stopped during inference

GiteaMirror commented

2025-11-12 13:05:30 -06:00

@konian71 commented on GitHub (Feb 2, 2025):

I removed the second and third GPU from my system and now it is running stable. With only one GPU, my qwen2.5-coder:32b-instruct-q8_0 produces about 4 tokens per second - even though 50% of the processing is done by the CPU. After reinstalling, I did not install ROCm; instead, ollama uses its own AMD library. With three GPUs, I got 17 tokens per second on qwen2.5-coder:32b-instruct-q8_0, but not without errors according to the log file. However, if I reinstall the AMD drivers, my performance will probably drop to 4 tokens per second, despite 100% GPU utilization. Now, with one GPU, the utilization is about 50% GPU and 50% CPU, which gives 4 tokens/second.

I tested mistral-large with one GPU and it ran slow but stable. 2 GPUs are already failing and inference terminates unexpectedly and one (the second) GPU gets stuck on shutdown with the following error: [drm] evicting device resources failed and AMDGPU 000:23:00:0: andgpu: Failed to disallow df cstate

@konian71 commented on GitHub (Feb 2, 2025): I removed the second and third GPU from my system and now it is running stable. With only one GPU, my qwen2.5-coder:32b-instruct-q8_0 produces about 4 tokens per second - even though 50% of the processing is done by the CPU. After reinstalling, I did not install ROCm; instead, ollama uses its own AMD library. With three GPUs, I got 17 tokens per second on qwen2.5-coder:32b-instruct-q8_0, but not without errors according to the log file. However, if I reinstall the AMD drivers, my performance will probably drop to 4 tokens per second, despite 100% GPU utilization. Now, with one GPU, the utilization is about 50% GPU and 50% CPU, which gives 4 tokens/second. I tested mistral-large with one GPU and it ran slow but stable. 2 GPUs are already failing and inference terminates unexpectedly and one (the second) GPU gets stuck on shutdown with the following error: _**[drm] evicting device resources failed and AMDGPU 000:23:00:0: andgpu: Failed to disallow df cstate**_

GiteaMirror commented

2025-11-12 13:05:30 -06:00

@konian71 commented on GitHub (Feb 2, 2025):

Hi all,

Just wanted to give a quick update: I reconnected all three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal setup (just ran the Ollama install script, no additional ROCm) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU.

The following LLMs are now running stable:

qwen2.5-coder:32b-instruct.q8_0
llama3.3:70b-instruct_q8_0
deepseek-R1:70b-llama-distill_q8_0
deepseek-R1:32b-qwen-distill_q8_0
phi4:14b-q8_0

mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is.

The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone.

@konian71 commented on GitHub (Feb 2, 2025): Hi all, Just wanted to give a quick update: I reconnected all **three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal** setup (_just ran the Ollama install script, no additional ROCm_) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU. The following LLMs are now running stable: - qwen2.5-coder:32b-instruct.q8_0 - llama3.3:70b-instruct_q8_0 - deepseek-R1:70b-llama-distill_q8_0 - deepseek-R1:32b-qwen-distill_q8_0 - phi4:14b-q8_0 mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is. The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone.

GiteaMirror commented

2025-11-12 13:05:30 -06:00

@melroy89 commented on GitHub (Feb 3, 2025):

A bit off-topic, but shouldn't Ollama not update to ROCm 6.3.2 directly?

@melroy89 commented on GitHub (Feb 3, 2025): A bit off-topic, but shouldn't Ollama not update to [ROCm 6.3.2](https://rocm.docs.amd.com/en/docs-6.3.2/about/release-notes.html) directly?

GiteaMirror commented

2025-11-12 13:05:30 -06:00

@QyInvoLing commented on GitHub (Feb 12, 2025):

Hi all,

Just wanted to give a quick update: I reconnected all three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal setup (just ran the Ollama install script, no additional ROCm) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU.

The following LLMs are now running stable:

qwen2.5-coder:32b-instruct.q8_0

llama3.3:70b-instruct_q8_0

deepseek-R1:70b-llama-distill_q8_0

deepseek-R1:32b-qwen-distill_q8_0

phi4:14b-q8_0

mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is.

The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone.

How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx?

@QyInvoLing commented on GitHub (Feb 12, 2025): > Hi all, > > Just wanted to give a quick update: I reconnected all **three AMD 7900XTX GPUs on my Ubuntu 24.04 minimal** setup (_just ran the Ollama install script, no additional ROCm_) and inference has been running for four hours without crashes. Also noticed that the output is now consistently ~10-17 tokens/sec for a LLM with ~70GB in VRAM and 0% CPU. > > The following LLMs are now running stable: > > * qwen2.5-coder:32b-instruct.q8_0 > * llama3.3:70b-instruct_q8_0 > * deepseek-R1:70b-llama-distill_q8_0 > * deepseek-R1:32b-qwen-distill_q8_0 > * phi4:14b-q8_0 > > mistral-large:123b-instruct_q8_0 was 100% stable with one GPU, but not with two or three, so it's probably an issue with the AMD multi-GPU driver. Since my desired LLMs are now running fine, I'll leave the setup as is. > > The problem is not completely solved, but I wanted to share these findings with you. Maybe it will help someone. How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx?

GiteaMirror commented

2025-11-12 13:05:31 -06:00

@konian71 commented on GitHub (Feb 13, 2025):

How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx?

70 billion parameters are too much for 72GB of VRAM, which makes it slow, with CPU/GPU utilization at 7%/93%. I estimate around 4 tokens per second. I can't test it properly at the moment because my setup currently has only two GPUs.

@konian71 commented on GitHub (Feb 13, 2025): > How many tokens can you get when running deepseek-R1:70b-llama-distill_q8_0 using three 7900xtx? 70 billion parameters are too much for 72GB of VRAM, which makes it slow, with CPU/GPU utilization at 7%/93%. I estimate around 4 tokens per second. I can't test it properly at the moment because my setup currently has only two GPUs.

GiteaMirror commented

2025-11-12 13:05:31 -06:00

@JoshuaBowerman commented on GitHub (Feb 19, 2025):

I'm also seeing similar issues after updating with a 7900 XTX. Massively reduced performance, including with pytorch. I'm guessing this is a problem with a newer version of ROCM or maybe an AMDGPU driver change. I'm not seeing any reduction in graphics performance, so I think the hardware is fine. Also seeing high CPU utilization even for extremely small models that are 100% in VRAM.

@JoshuaBowerman commented on GitHub (Feb 19, 2025): I'm also seeing similar issues after updating with a 7900 XTX. Massively reduced performance, including with pytorch. I'm guessing this is a problem with a newer version of ROCM or maybe an AMDGPU driver change. I'm not seeing any reduction in graphics performance, so I think the hardware is fine. Also seeing high CPU utilization even for extremely small models that are 100% in VRAM.

GiteaMirror commented

2025-11-12 13:05:31 -06:00

@JoshuaBowerman commented on GitHub (Feb 20, 2025):

It seems that for me, Ollama was claiming to be using the GPU when it was actually using the CPU. ollama ps showed 100% GPU when the model was definitely loaded into ram and running on the cpu. rocm-smi showed 0 VRAM or GPU usage during inference, and ram and cpu was being used by Ollama.

I'm on arch, uninstalling the ollama package and installing ollama-rocm instead fixed the issue for me. Not sure why ollama claimed to be using the GPU when it was definitely not.

You're on Ubuntu and my issue is likely unrelated but I figured I'd add this comment here anyways in case someone comes across this with the same issue as me.

@JoshuaBowerman commented on GitHub (Feb 20, 2025): It seems that for me, Ollama was claiming to be using the GPU when it was actually using the CPU. `ollama ps` showed 100% GPU when the model was definitely loaded into ram and running on the cpu. `rocm-smi` showed 0 VRAM or GPU usage during inference, and ram and cpu was being used by Ollama. I'm on arch, uninstalling the `ollama` package and installing `ollama-rocm` instead fixed the issue for me. Not sure why ollama claimed to be using the GPU when it was definitely not. You're on Ubuntu and my issue is likely unrelated but I figured I'd add this comment here anyways in case someone comes across this with the same issue as me.

GiteaMirror commented

2025-11-12 13:05:31 -06:00

@rick-github commented on GitHub (Feb 20, 2025):

The %GPU displayed by ollama ps is figured out based on the GPU detected before the runner is started. When ollama starts a runner and finds it's not available, it falls back to the CPU runner but doesn't update the %GPU value.

@rick-github commented on GitHub (Feb 20, 2025): The %GPU displayed by `ollama ps` is figured out based on the GPU detected before the runner is started. When ollama starts a runner and finds it's not available, it falls back to the CPU runner but doesn't update the %GPU value.

GiteaMirror referenced this issue

2025-11-12 15:32:24 -06:00

[PR #5646] [MERGED] app: also clean up tempdir runners on install #10606

Sign in to join this conversation.

Branches Tags

main

jessegross/batching

hoyyeva/fix-launch-app-process-reap

pdevine/manifest-list

hoyyeva/launch-page-update

codex/fix-codex-model-metadata-warning

hoyyeva/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#5646