[GH-ISSUE #5949] Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000 #50226

New Issue

GiteaMirror · 2026-04-28T14:46:46-05:00

GiteaMirror commented

2026-04-28 14:46:46 -05:00

Originally created by @renbuarl on GitHub (Jul 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5949

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

OS: Linux 6.5.0-44-generic #44~22.04.1-Ubuntu

GPU:

AMD Radeon RX 7900 XTX (24 GiB VRAM)

Ollama version: 0.2.8

ROCm module version: 6.7.0
amdgpu-install_6.1.60103-1_all.deb

Model: Meta-Llama-3.1-8B-Instruct-Q8_0

While testing the Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model, I encountered an out of memory error well before reaching the maximum context size of 128k for the model. The model crashes after processing approximately 28,000 tokens, regardless of whether using one GPU with 24GB of memory (nctx = 30,000) or three GPUs with a combined memory of 72GB (nctx = 120,000).

Error:
Jul 25 12:39:17 ailab ollama[683]: CUDA error: out of memory
Jul 25 12:39:17 ailab ollama[683]: current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:291
Jul 25 12:39:17 ailab ollama[683]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Jul 25 12:39:17 ailab ollama[683]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error"

There might be similar issues, but out of memory errors with multiple GPUs have not been reported yet.

OS

Linux

GPU

AMD

CPU

Intel

Ollama version

0.2.8

Originally created by @renbuarl on GitHub (Jul 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5949 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? OS: Linux 6.5.0-44-generic #44~22.04.1-Ubuntu GPU: AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) Ollama version: 0.2.8 ROCm module version: 6.7.0 amdgpu-install_6.1.60103-1_all.deb Model: Meta-Llama-3.1-8B-Instruct-Q8_0 While testing the Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model, I encountered an out of memory error well before reaching the maximum context size of 128k for the model. The model crashes after processing approximately 28,000 tokens, regardless of whether using one GPU with 24GB of memory (nctx = 30,000) or three GPUs with a combined memory of 72GB (nctx = 120,000). Error: Jul 25 12:39:17 ailab ollama[683]: CUDA error: out of memory Jul 25 12:39:17 ailab ollama[683]: current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:291 Jul 25 12:39:17 ailab ollama[683]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Jul 25 12:39:17 ailab ollama[683]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error" There might be similar issues, but out of memory errors with multiple GPUs have not been reported yet. ### OS Linux ### GPU AMD ### CPU Intel ### Ollama version 0.2.8

GiteaMirror added the memory bug amd labels 2026-04-28 14:46:47 -05:00

GiteaMirror commented

2026-04-28 14:46:48 -05:00

@rick-github commented on GitHub (Jul 25, 2024):

Server logs would help with diagnosis. Sounds similar to https://github.com/ollama/ollama/issues/5913, there's a workaround in the comments.

@rick-github commented on GitHub (Jul 25, 2024): Server logs would help with diagnosis. Sounds similar to https://github.com/ollama/ollama/issues/5913, there's a workaround in the comments.

GiteaMirror commented

2026-04-28 14:46:48 -05:00

@renbuarl commented on GitHub (Jul 25, 2024):

Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is simply offloading to CPU when there are available GPUs and sufficient VRAM. This is an obvious bug.

@renbuarl commented on GitHub (Jul 25, 2024): Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is simply offloading to CPU when there are available GPUs and sufficient VRAM. This is an obvious bug.

GiteaMirror commented

2026-04-28 14:46:49 -05:00

@rick-github commented on GitHub (Jul 25, 2024):

Server logs would help with diagnosis.

@rick-github commented on GitHub (Jul 25, 2024): Server logs would help with diagnosis.

GiteaMirror commented

2026-04-28 14:46:49 -05:00

@renbuarl commented on GitHub (Jul 25, 2024):

journal.txt

@renbuarl commented on GitHub (Jul 25, 2024): [journal.txt](https://github.com/user-attachments/files/16377575/journal.txt)

GiteaMirror commented

2026-04-28 14:46:49 -05:00

@dhiltgen commented on GitHub (Jul 26, 2024):

The bug here is likely we're not properly adjusting the prediction for the large context size.

@dhiltgen commented on GitHub (Jul 26, 2024): The bug here is likely we're not properly adjusting the prediction for the large context size.

GiteaMirror commented

2026-04-28 14:46:50 -05:00

@rick-github commented on GitHub (Jul 26, 2024):

I did a little experiment, loaded the same model multiple times with different versions of ollama. ollama always made the same calculations, but as versions from 0.1.40 to 0.3.0, the VRAM usage from the llama server went from 5156MiB to 5214MiB. Not a lot, but when llama.cpp is using 23.9 of 24G (https://github.com/ollama/ollama/issues/5913), it may be enough to push things over the edge.

@rick-github commented on GitHub (Jul 26, 2024): I did a little experiment, loaded the same model multiple times with different versions of ollama. ollama always made the same calculations, but as versions from 0.1.40 to 0.3.0, the VRAM usage from the llama server went from 5156MiB to 5214MiB. Not a lot, but when llama.cpp is using 23.9 of 24G (https://github.com/ollama/ollama/issues/5913), it may be enough to push things over the edge.

GiteaMirror commented

2026-04-28 14:46:50 -05:00

@Speedway1 commented on GitHub (Jul 28, 2024):

Just a head's up that the problem of generating garbage with >1 AMD GPUs is still an issue. Something broke a few versions of Ollama ago because it used to work. It's also specific to Ollama because llama.cpp for the same models where VRAM usage >24GB works well, the load it shared across GPUs without any issues. But with Ollama as soon as more than 1 GPU is needed, garbage is produced.

We're still trying to get some helpful information to feed back to the team here to get this fixed but I would set expectations that even if this report OOM is fixed, it's still not going to run across multiple cards successfully. Best to limit to 1 GPU and CPU RAM which seems to work.

ATM we're downgraded our multi-GPU AMD boxes to be multiple Ollamas running on single GPUs separated by port number. E.g. a 2 GPU box will have 2 instances of Ollama runnins, with two different port numbers. Each Ollama instance is strictred to 1 GPU only and of course can use CPU if needed. Run multi-card jobs on NVIDIA which is better supported both at the OS level and within Ollama.

Where we absolutely must use multi-card AMD GPUs, we're using llama.cpp and its' OpenAI API compatible server. Runs across all GPUs no problem provided the it's compiled with the LLAMA_CUDA_NO_PEER_COPY=1 flag. But this means that Ollama's wonderful LLM swapping support is missing, so it ties a machine down to serving that one LLM only. E.g. Llama3 70B. We adjust LLM routing accordingly in this instance.

Hope these comments help. It won't be long before the problems with multi-AMD cards are fixed. Just a matter of getting the correct diagnostics which is elusive at the moment.

@Speedway1 commented on GitHub (Jul 28, 2024): Just a head's up that the problem of generating garbage with >1 AMD GPUs is still an issue. Something broke a few versions of Ollama ago because it used to work. It's also specific to Ollama because llama.cpp for the same models where VRAM usage >24GB works well, the load it shared across GPUs without any issues. But with Ollama as soon as more than 1 GPU is needed, garbage is produced. We're still trying to get some helpful information to feed back to the team here to get this fixed but I would set expectations that even if this report OOM is fixed, it's still not going to run across multiple cards successfully. Best to limit to 1 GPU and CPU RAM which seems to work. ATM we're downgraded our multi-GPU AMD boxes to be multiple Ollamas running on single GPUs separated by port number. E.g. a 2 GPU box will have 2 instances of Ollama runnins, with two different port numbers. Each Ollama instance is strictred to 1 GPU only and of course can use CPU if needed. Run multi-card jobs on NVIDIA which is better supported both at the OS level and within Ollama. Where we absolutely must use multi-card AMD GPUs, we're using llama.cpp and its' OpenAI API compatible server. Runs across all GPUs no problem provided the it's compiled with the LLAMA_CUDA_NO_PEER_COPY=1 flag. But this means that Ollama's wonderful LLM swapping support is missing, so it ties a machine down to serving that one LLM only. E.g. Llama3 70B. We adjust LLM routing accordingly in this instance. Hope these comments help. It won't be long before the problems with multi-AMD cards are fixed. Just a matter of getting the correct diagnostics which is elusive at the moment.

GiteaMirror commented

2026-04-28 14:46:50 -05:00

@renbuarl commented on GitHub (Jul 30, 2024):

Speedway1, thank you for your message!
However, it seems that the issue is not with ollama, but with llama.cpp.
I built the latest release of llama.cpp #b3488 following the methodology described in
https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu (Thanks to the author!)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make -j4 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32000 --host '192.168.0.5' --port 8081 -ngl 99

When the real context is more than 10k,

The same error occurs as in ollama:

CUDA error: out of memory
current device: 2, in function alloc at ggml/src/ggml-cuda.cu:291
ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
ggml/src/ggml-cuda.cu:101: CUDA error

@renbuarl commented on GitHub (Jul 30, 2024): Speedway1, thank you for your message! However, it seems that the issue is not with ollama, but with llama.cpp. I built the latest release of llama.cpp #b3488 following the methodology described in https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu (Thanks to the author!) git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make -j4 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100 ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32000 --host '192.168.0.5' --port 8081 -ngl 99 When the real context is more than 10k, The same error occurs as in ollama: CUDA error: out of memory current device: 2, in function alloc at ggml/src/ggml-cuda.cu:291 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) ggml/src/ggml-cuda.cu:101: CUDA error

GiteaMirror commented

2026-04-28 14:46:51 -05:00

@renbuarl commented on GitHub (Jul 30, 2024):

https://github.com/ggerganov/llama.cpp/issues/8766

@renbuarl commented on GitHub (Jul 30, 2024): https://github.com/ggerganov/llama.cpp/issues/8766

GiteaMirror commented

2026-04-28 14:46:51 -05:00

@Speedway1 commented on GitHub (Jul 30, 2024):

Hi @renbuarl , I think that the problem there is your massive context length. It takes a lot of VRAM. Here is a simple bit of bash that we run when loading up LLMs on AMD to monitor the consumption, it's handy to have open in a window!

while true; do rocm-smi; sleep 1; done

At the moment we have Mixtral Large, quantised to Q2_k running on 2x Radeon 7900 XTX (2x24GB) with the following:

llama.cpp/server -m /home/tmp/Mistral-Large-Instruct-2407_q2_k.gguf -ngl 89 -n 1500 -c 1500 --host 0.0.0.0 --port 2600 -a mistral
For those worried about the security: This is behind a firewall on our Dev VPN hence the open listen.

This is the SMI output:


======================================================== Concise Info ========================================================
Device  Node  IDs              Temp    Power   Partitions          SCLK     MCLK     Fan     Perf  PwrCap       VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)   (Mem, Compute, ID)                                                            
==============================================================================================================================
0       1     0x744c,   55924  68.0°C  175.0W  N/A, N/A, 0         1667Mhz  1249Mhz  40.0%   auto  327.0W       99%    49%   
1       2     0x744c,   27211  70.0°C  178.0W  N/A, N/A, 0         1725Mhz  1249Mhz  41.96%  auto  327.0W       98%    49%   
2       3     0x164e,   33198  39.0°C  41.04W  N/A, N/A, 0         None     1800Mhz  0%      auto  Unsupported  17%    0%    
==============================================================================================================================

(Currently the machine is busy as you can see).

This is on llama.cpp. We cannot get Ollama to work across the cards at the moment.

Not sure if any of this is useful to you, but hoping that maybe some of it is.

@Speedway1 commented on GitHub (Jul 30, 2024): Hi @renbuarl , I think that the problem there is your massive context length. It takes a lot of VRAM. Here is a simple bit of bash that we run when loading up LLMs on AMD to monitor the consumption, it's handy to have open in a window! while true; do rocm-smi; sleep 1; done At the moment we have Mixtral Large, quantised to Q2_k running on 2x Radeon 7900 XTX (2x24GB) with the following: `llama.cpp/server -m /home/tmp/Mistral-Large-Instruct-2407_q2_k.gguf -ngl 89 -n 1500 -c 1500 --host 0.0.0.0 --port 2600 -a mistral ` For those worried about the security: This is behind a firewall on our Dev VPN hence the open listen. This is the SMI output: ``` ======================================================== Concise Info ======================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) ============================================================================================================================== 0 1 0x744c, 55924 68.0°C 175.0W N/A, N/A, 0 1667Mhz 1249Mhz 40.0% auto 327.0W 99% 49% 1 2 0x744c, 27211 70.0°C 178.0W N/A, N/A, 0 1725Mhz 1249Mhz 41.96% auto 327.0W 98% 49% 2 3 0x164e, 33198 39.0°C 41.04W N/A, N/A, 0 None 1800Mhz 0% auto Unsupported 17% 0% ============================================================================================================================== ``` (Currently the machine is busy as you can see). This is on llama.cpp. We cannot get Ollama to work across the cards at the moment. Not sure if any of this is useful to you, but hoping that maybe some of it is.

GiteaMirror commented

2026-04-28 14:46:51 -05:00

@renbuarl commented on GitHub (Jul 31, 2024):

Hi @renbuarl , I think that the problem there is your massive context length.

Great advice to use the '--flash-attn' option.

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn

Maximum vram consumption is 68.88 GB with a real context of 32k, and there is no 'CUDA error: out of memory'.

@renbuarl commented on GitHub (Jul 31, 2024): > Hi @renbuarl , I think that the problem there is your massive context length. Great advice to use the '--flash-attn' option. ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn Maximum vram consumption is 68.88 GB with a real context of 32k, and there is no 'CUDA error: out of memory'.

GiteaMirror commented

2026-04-28 14:46:51 -05:00

@renbuarl commented on GitHub (Jul 31, 2024):

The bug here is likely we're not properly adjusting the prediction for the large context size.

What do we have?

When launching without the --flash-attn option for llama-server

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99
The average VRAM consumption is 68.40 GB but we crash with 'CUDA error: out of memory' with relatively small actual context.

When launching with the --flash-attn option for llama-server it works perfectly
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn
The average VRAM consumption is 58.56 GB.

@renbuarl commented on GitHub (Jul 31, 2024): > The bug here is likely we're not properly adjusting the prediction for the large context size. What do we have? When launching without the --flash-attn option for llama-server ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 The average VRAM consumption is 68.40 GB but we crash with 'CUDA error: out of memory' with relatively small actual context. When launching with the --flash-attn option for llama-server it works perfectly ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn The average VRAM consumption is 58.56 GB.

GiteaMirror commented

2026-04-28 14:46:52 -05:00

@ott2 commented on GitHub (Aug 1, 2024):

Specifying --no-kv-offload bypasses this error for me, even with the default 128K context. Otherwise using context -c 70689 or larger results in an out of memory error.

Background:

Totally different setup here (M1 Mac, 32GB RAM), but I'm also seeing repeatable memory outs with this model. This happens when the context size goes beyond a threshold value.

In my case

./llama-cli -m models/Meta-Llama-3.1-8B-Instruct-
Q8_0.gguf -c 70688 -p sing

works fine, but changing that to -c 70689 results in

ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

With a word-level diff, the only significant difference I can see in the logs is the pair of lines:

llama_kv_cache_init:      Metal KV buffer size =  8836.00 MiB
llama_new_context_with_model:      Metal compute buffer size =  4588.07 MiB

versus

llama_kv_cache_init:      Metal KV buffer size =    8840.00 MiB
llama_new_context_with_model:      Metal compute buffer size =  4590.13 MiB

The context window increases the Metal KV buffer size until it hits the mysterious maximum value of 8836MiB as the limit in my case (a bit more than 8GB). Specifying --no-kv-offload seems to effectively switch off using the GPU, but at least allows inference to go ahead.

@ott2 commented on GitHub (Aug 1, 2024): Specifying `--no-kv-offload` bypasses this error for me, even with the default 128K context. Otherwise using context `-c 70689` or larger results in an out of memory error. Background: Totally different setup here (M1 Mac, 32GB RAM), but I'm also seeing repeatable memory outs with this model. This happens when the context size goes beyond a threshold value. In my case ``` ./llama-cli -m models/Meta-Llama-3.1-8B-Instruct- Q8_0.gguf -c 70688 -p sing ``` works fine, but changing that to `-c 70689` results in ``` ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) ``` With a word-level diff, the only significant difference I can see in the logs is the pair of lines: ``` llama_kv_cache_init: Metal KV buffer size = 8836.00 MiB llama_new_context_with_model: Metal compute buffer size = 4588.07 MiB ``` versus ``` llama_kv_cache_init: Metal KV buffer size = 8840.00 MiB llama_new_context_with_model: Metal compute buffer size = 4590.13 MiB ``` The context window increases the Metal KV buffer size until it hits the mysterious maximum value of 8836MiB as the limit in my case (a bit more than 8GB). Specifying `--no-kv-offload` seems to effectively switch off using the GPU, but at least allows inference to go ahead.

GiteaMirror commented

2026-04-28 14:46:52 -05:00

@DevElCuy commented on GitHub (Oct 3, 2024):

The bug here is likely we're not properly adjusting the prediction for the large context size.

Just learned today that memory allocation is related to the context size param. Any way to make it dynamic? We are talking about MAX context size so no need to allocate a lot of VRAM that we are hardly ever using

@DevElCuy commented on GitHub (Oct 3, 2024): > The bug here is likely we're not properly adjusting the prediction for the large context size. Just learned today that memory allocation is related to the context size param. Any way to make it dynamic? We are talking about MAX context size so no need to allocate a lot of VRAM that we are hardly ever using

GiteaMirror commented

2026-04-28 14:46:53 -05:00

@dhiltgen commented on GitHub (Oct 17, 2024):

@develCuy we're tracking dynamic context size management via #1005

@dhiltgen commented on GitHub (Oct 17, 2024): @develCuy we're tracking dynamic context size management via #1005

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#50226