[GH-ISSUE #11317] Ollama/GGML: System Freezes, Crashes, and BSODs Due to ggml_host_malloc (Pinned Memory Allocation) Failures on GPUs (AMD/NVIDIA) #53980

New Issue

GiteaMirror · 2026-04-29T05:02:38-05:00

GiteaMirror commented

2026-04-29 05:02:38 -05:00

Originally created by @d3f4ul7U53R on GitHub (Jul 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11317

What is the issue?

Problem Description:
The Ollama application frequently causes system-wide freezes, application crashes, and, in severe cases, Blue Screens of Death (BSODs) on both Windows and Linux when utilizing GPU acceleration (both AMD/ROCm and NVIDIA/CUDA). These occurrences consistently point to low-level memory allocation failures, specifically related to ggml_host_malloc (also known as "pinned memory").

Observed Behavior:

Symptoms: System-wide freezes, application crashes, and BSODs (on Windows).

Trigger: Problems are observed during model loading and, primarily, during active model inference/interaction (chatting). One user also reported a crash when attempting to stop the Ollama server (ollama stop), suggesting memory management or deallocation issues during shutdown.

Affected Models: The behavior has been observed with various models, including gemma:2b, llama3, Mistral-Nemo-Instruct-2407.Q5_K_M.gguf, and DeepSeek R1 Distill Qwen 7B.

Interactions: The issue tends to manifest after a variable number of interactions (e.g., "after two questions" for Llama3.2), often starting around the 5th interaction, depending on the model and system.

Technical Analysis and Suspected Cause:
Based on error logs (which consistently display messages such as ggml_cuda_host_malloc: failed to allocate X MiB of pinned memory: out of memory or invalid argument), the problem appears to be directly related to the allocation and management of host pinned memory by the GGML backend. This memory is crucial for efficient, high-speed data transfers between the CPU and GPU.

When the ggml_host_malloc function (or its specific variants like ggml_cuda_host_malloc or ggml_hip_host_malloc) fails—whether due to system limits, RAM fragmentation, or potential driver/implementation issues—it leads to severe system instability. As these are low-level memory allocation failures, they result in freezes and BSODs.

The "large spike in GPU memory usage upon closing" observed by one user further suggests the possibility of memory leaks or inefficient deallocation of pinned memory, leading to its accumulation over time and subsequent failures, even after active use.

User 4:

OS: Windows

CPU: Intel Core i7 quad core

RAM: 16GB (5GB used by OS)

GPU: NVIDIA GeForce GTX 1050 (4GB VRAM)

Ollama Version: Not specified in the report.

Model(s) Used: Gemma3:1b, Llama3.2

Symptoms: Complete computer freeze (with Gemma), BSODs after 2 questions (with Llama3.2), BSOD when trying to stop the server (ollama stop llama3.2), large spike in GPU memory usage upon closing.

External Discussion: Ollama for Windows freezes or crashes Windows

Proposed Mitigation / Potential Solution:
We believe a software solution can be implemented. We suggest adding a configuration option (e.g., a new environment variable or a llama.cpp flag) that allows users to disable the use of host pinned memory (i.e., ggml_cuda_host_malloc and ggml_hip_host_malloc) for GPU operations.

Explanation of Trade-off:
While this might result in slightly slower CPU-GPU transfers (as pageable memory would be used instead of pinned memory), we believe this approach would significantly improve system stability and make Ollama usable for many users currently experiencing critical crashes.

Potential Intervention Point (GGML/Ollama Developers):
The bool host_buffer field within the ggml_backend_dev_caps structure provides the most promising avenue for a software solution.

Option 1 (Ideal, llama.cpp/GGML upstream): Add a configuration option (via an environment variable or a backend initialization parameter) to force host_buffer to false for GPU backends. This would cause host memory allocations for CPU-GPU transfers to fall back to regular pageable memory, which, although slower, could circumvent the pinned memory exhaustion/freeze issues.

This would require a change within the C/C++ implementation of the GGML CUDA/ROCm backends to respect this flag and allocate "common" host memory instead of pinned memory.

Option 2 (More Complex, Ollama): Within Ollama's Go code, after obtaining device properties (ggml_backend_dev_get_props), it would be possible to inspect the value of props.caps.host_buffer. If it's true and the user has set a flag (e.g., OLLAMA_DISABLE_PINNED_MEMORY=true), Ollama could attempt to force the use of the CPU backend for layers that would normally go to the GPU (or a different buffer type that doesn't use pinned memory), even if a GPU is present. However, this would be a workaround and less elegant than Option 1, which would address the root cause at the GGML level.

link 4 user.txt
log user 1.txt
log user 2.txt
log user 3.txt

o aplicativo Ollama trava com frequência, congela todo o sistema ou causa telas azuis da morte (BSODs) no Windows e Linux ao usar a aceleração de GPU (AMD/ROCm e NVIDIA/CUDA).

esses problemas apontam consistentemente para falhas de alocação de memória, especificamente relacionadas ao .ggml_host_malloc

Comportamento observado:
Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows).

Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação.

Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B).

dados de usuarios:
Comportamento observado:
Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows).

Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação.

Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B).

Análise técnica e causa suspeita:
Com base nos erros e nos sintomas, o problema parece estar relacionado à alocação e gerenciamento da memória do host fixada (memória fixada no host) pelo back-end GGML, que é fundamental para transferências eficientes de dados CPU-GPU.ggml_host_malloc

Quando falha (devido a limites do sistema, fragmentação ou problemas de driver), isso leva a uma grave instabilidade do sistema, incluindo congelamentos e BSODs, pois são falhas de alocação de memória de baixo nível.ggml_host_malloc

O "pico de uso de memória da GPU no fechamento" observado por um usuário sugere ainda possíveis vazamentos de memória ou desalocação ineficiente da memória fixada, levando ao seu acúmulo ao longo do tempo e falhas subsequentes.

Solução/Mitigação Proposta:
Sugira adicionar uma opção de configuração (por exemplo, uma nova variável de ambiente ou um sinalizador de llama.cpp) que permita aos usuários desabilitar o uso da memória do host fixada (ou seja, ggml_cuda_host_malloc e ggml_hip_host_malloc) para operações de GPU.

Explique a compensação: embora isso possa resultar em transferências CPU-GPU um pouco mais lentas (já que a memória paginável seria usada), melhoraria significativamente a estabilidade e tornaria o aplicativo utilizável para muitos usuários que estão enfrentando falhas críticas.

Aponte para a estrutura e seu campo como um ponto de entrada potencial para a implementação de tal sinalizador.ggml_backend_dev_capshost_buffer

O Potencial Ponto de Intervenção
O campo dentro de é o que nos dá a maior esperança para uma solução de software.bool host_bufferggml_backend_dev_caps

Se um backend de GPU tem host_buffer = true e está causando problemas, uma solução potencial seria:

Opção 1 (Ideal, GGML upstream): Adicionar uma opção de configuração (via variável de ambiente ou parâmetro de inicialização do backend) que permita forçar host_buffer a false para backends de GPU. Isso faria com que as alocações de memória host para transferências CPU-GPU caíssem para memória paginável comum, o que, embora mais lento, poderia contornar os problemas de esgotamento/congelamento de memória fixada.

Isso exigiria uma mudança dentro da implementação C/C++ dos backends CUDA/ROCm do GGML para respeitar essa flag e alocar memória host "comum" em vez de fixada.

Opção 2 (Mais complexa, Ollama): No código Go do Ollama, após obter as propriedades do dispositivo (), seria possível inspecionar o valor de . Se for e o usuário tiver definido uma flag (e.g., ), o Ollama poderia tentar forçar o uso do backend de CPU para camadas que normalmente iriam para a GPU (ou um tipo de buffer diferente que não use memória fixada), mesmo que a GPU esteja presente. No entanto, isso seria um paliativo e menos elegante que a Opção 1, que corrigiria a causa raiz no nível do GGML.ggml_backend_dev_get_propsprops.caps.host_buffertrueOLLAMA_DISABLE_PINNED_MEMORY=true

Relevant log output

llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_cuda_host_malloc: failed to allocate 2.51 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:      ROCm0 compute buffer size =    22.01 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     2.51 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2

llama_new_context_with_model: n_ctx      = 19456
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument
llama_kv_cache_init:        CPU KV buffer size =  1900.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1140.00 MiB
llama_new_context_with_model: KV self size  = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB


llama_new_context_with_model: n_ctx      = 19456
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument
llama_kv_cache_init:        CPU KV buffer size =  1900.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1140.00 MiB
llama_new_context_with_model: KV self size  = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

0.9.2

Originally created by @d3f4ul7U53R on GitHub (Jul 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11317 ### What is the issue? Problem Description: The Ollama application frequently causes system-wide freezes, application crashes, and, in severe cases, Blue Screens of Death (BSODs) on both Windows and Linux when utilizing GPU acceleration (both AMD/ROCm and NVIDIA/CUDA). These occurrences consistently point to low-level memory allocation failures, specifically related to ggml_host_malloc (also known as "pinned memory"). Observed Behavior: Symptoms: System-wide freezes, application crashes, and BSODs (on Windows). Trigger: Problems are observed during model loading and, primarily, during active model inference/interaction (chatting). One user also reported a crash when attempting to stop the Ollama server (ollama stop), suggesting memory management or deallocation issues during shutdown. Affected Models: The behavior has been observed with various models, including gemma:2b, llama3, Mistral-Nemo-Instruct-2407.Q5_K_M.gguf, and DeepSeek R1 Distill Qwen 7B. Interactions: The issue tends to manifest after a variable number of interactions (e.g., "after two questions" for Llama3.2), often starting around the 5th interaction, depending on the model and system. Technical Analysis and Suspected Cause: Based on error logs (which consistently display messages such as ggml_cuda_host_malloc: failed to allocate X MiB of pinned memory: out of memory or invalid argument), the problem appears to be directly related to the allocation and management of host pinned memory by the GGML backend. This memory is crucial for efficient, high-speed data transfers between the CPU and GPU. When the ggml_host_malloc function (or its specific variants like ggml_cuda_host_malloc or ggml_hip_host_malloc) fails—whether due to system limits, RAM fragmentation, or potential driver/implementation issues—it leads to severe system instability. As these are low-level memory allocation failures, they result in freezes and BSODs. The "large spike in GPU memory usage upon closing" observed by one user further suggests the possibility of memory leaks or inefficient deallocation of pinned memory, leading to its accumulation over time and subsequent failures, even after active use. User 4: OS: Windows CPU: Intel Core i7 quad core RAM: 16GB (5GB used by OS) GPU: NVIDIA GeForce GTX 1050 (4GB VRAM) Ollama Version: Not specified in the report. Model(s) Used: Gemma3:1b, Llama3.2 Symptoms: Complete computer freeze (with Gemma), BSODs after 2 questions (with Llama3.2), BSOD when trying to stop the server (ollama stop llama3.2), large spike in GPU memory usage upon closing. External Discussion: [Ollama for Windows freezes or crashes Windows](https://www.reddit.com/r/ollama/comments/1jfz86g/ollama_for_windows_freezes_or_crashes_windows/) Proposed Mitigation / Potential Solution: We believe a software solution can be implemented. We suggest adding a configuration option (e.g., a new environment variable or a llama.cpp flag) that allows users to disable the use of host pinned memory (i.e., ggml_cuda_host_malloc and ggml_hip_host_malloc) for GPU operations. Explanation of Trade-off: While this might result in slightly slower CPU-GPU transfers (as pageable memory would be used instead of pinned memory), we believe this approach would significantly improve system stability and make Ollama usable for many users currently experiencing critical crashes. Potential Intervention Point (GGML/Ollama Developers): The bool host_buffer field within the ggml_backend_dev_caps structure provides the most promising avenue for a software solution. Option 1 (Ideal, llama.cpp/GGML upstream): Add a configuration option (via an environment variable or a backend initialization parameter) to force host_buffer to false for GPU backends. This would cause host memory allocations for CPU-GPU transfers to fall back to regular pageable memory, which, although slower, could circumvent the pinned memory exhaustion/freeze issues. This would require a change within the C/C++ implementation of the GGML CUDA/ROCm backends to respect this flag and allocate "common" host memory instead of pinned memory. Option 2 (More Complex, Ollama): Within Ollama's Go code, after obtaining device properties (ggml_backend_dev_get_props), it would be possible to inspect the value of props.caps.host_buffer. If it's true and the user has set a flag (e.g., OLLAMA_DISABLE_PINNED_MEMORY=true), Ollama could attempt to force the use of the CPU backend for layers that would normally go to the GPU (or a different buffer type that doesn't use pinned memory), even if a GPU is present. However, this would be a workaround and less elegant than Option 1, which would address the root cause at the GGML level. [link 4 user.txt](https://github.com/user-attachments/files/21093380/link.4.user.txt) [log user 1.txt](https://github.com/user-attachments/files/21093379/log.user.1.txt) [log user 2.txt](https://github.com/user-attachments/files/21093382/log.user.2.txt) [log user 3.txt](https://github.com/user-attachments/files/21093381/log.user.3.txt) o aplicativo Ollama trava com frequência, congela todo o sistema ou causa telas azuis da morte (BSODs) no Windows e Linux ao usar a aceleração de GPU (AMD/ROCm e NVIDIA/CUDA). esses problemas apontam consistentemente para falhas de alocação de memória, especificamente relacionadas ao .ggml_host_malloc Comportamento observado: Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows). Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação. Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B). dados de usuarios: Comportamento observado: Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows). Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação. Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B). Análise técnica e causa suspeita: Com base nos erros e nos sintomas, o problema parece estar relacionado à alocação e gerenciamento da memória do host fixada (memória fixada no host) pelo back-end GGML, que é fundamental para transferências eficientes de dados CPU-GPU.ggml_host_malloc Quando falha (devido a limites do sistema, fragmentação ou problemas de driver), isso leva a uma grave instabilidade do sistema, incluindo congelamentos e BSODs, pois são falhas de alocação de memória de baixo nível.ggml_host_malloc O "pico de uso de memória da GPU no fechamento" observado por um usuário sugere ainda possíveis vazamentos de memória ou desalocação ineficiente da memória fixada, levando ao seu acúmulo ao longo do tempo e falhas subsequentes. Solução/Mitigação Proposta: Sugira adicionar uma opção de configuração (por exemplo, uma nova variável de ambiente ou um sinalizador de llama.cpp) que permita aos usuários desabilitar o uso da memória do host fixada (ou seja, ggml_cuda_host_malloc e ggml_hip_host_malloc) para operações de GPU. Explique a compensação: embora isso possa resultar em transferências CPU-GPU um pouco mais lentas (já que a memória paginável seria usada), melhoraria significativamente a estabilidade e tornaria o aplicativo utilizável para muitos usuários que estão enfrentando falhas críticas. Aponte para a estrutura e seu campo como um ponto de entrada potencial para a implementação de tal sinalizador.ggml_backend_dev_capshost_buffer O Potencial Ponto de Intervenção O campo dentro de é o que nos dá a maior esperança para uma solução de software.bool host_bufferggml_backend_dev_caps Se um backend de GPU tem host_buffer = true e está causando problemas, uma solução potencial seria: Opção 1 (Ideal, GGML upstream): Adicionar uma opção de configuração (via variável de ambiente ou parâmetro de inicialização do backend) que permita forçar host_buffer a false para backends de GPU. Isso faria com que as alocações de memória host para transferências CPU-GPU caíssem para memória paginável comum, o que, embora mais lento, poderia contornar os problemas de esgotamento/congelamento de memória fixada. Isso exigiria uma mudança dentro da implementação C/C++ dos backends CUDA/ROCm do GGML para respeitar essa flag e alocar memória host "comum" em vez de fixada. Opção 2 (Mais complexa, Ollama): No código Go do Ollama, após obter as propriedades do dispositivo (), seria possível inspecionar o valor de . Se for e o usuário tiver definido uma flag (e.g., ), o Ollama poderia tentar forçar o uso do backend de CPU para camadas que normalmente iriam para a GPU (ou um tipo de buffer diferente que não use memória fixada), mesmo que a GPU esteja presente. No entanto, isso seria um paliativo e menos elegante que a Opção 1, que corrigiria a causa raiz no nível do GGML.ggml_backend_dev_get_propsprops.caps.host_buffertrueOLLAMA_DISABLE_PINNED_MEMORY=true ### Relevant log output ```shell llama_new_context_with_model: KV self size = 1152.00 MiB, K (f16): 576.00 MiB, V (f16): 576.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_cuda_host_malloc: failed to allocate 2.51 MiB of pinned memory: hipErrorOutOfMemory llama_new_context_with_model: ROCm0 compute buffer size = 22.01 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 2.51 MiB llama_new_context_with_model: graph nodes = 453 llama_new_context_with_model: graph splits = 2 llama_new_context_with_model: n_ctx = 19456 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument llama_kv_cache_init: CPU KV buffer size = 1900.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1140.00 MiB llama_new_context_with_model: KV self size = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB llama_new_context_with_model: n_ctx = 19456 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument llama_kv_cache_init: CPU KV buffer size = 1900.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1140.00 MiB llama_new_context_with_model: KV self size = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB ``` ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version 0.9.2

GiteaMirror added the bug needs more info labels 2026-04-29 05:02:39 -05:00

GiteaMirror closed this issue

2026-04-29 05:02:40 -05:00

GiteaMirror commented

2026-04-29 05:02:41 -05:00

@rick-github commented on GitHub (Jul 7, 2025):

A full log will improve debugging.

@rick-github commented on GitHub (Jul 7, 2025): A full log will improve debugging.

GiteaMirror commented

2026-04-29 05:02:41 -05:00

@d3f4ul7U53R commented on GitHub (Jul 7, 2025):

Um log completo melhorará a depuração.

logs are available in .txt files above

@d3f4ul7U53R commented on GitHub (Jul 7, 2025): > Um log completo melhorará a depuração. logs are available in .txt files above

GiteaMirror commented

2026-04-29 05:02:42 -05:00

@rick-github commented on GitHub (Jul 7, 2025):

Only one of the .txt files is related to ollama, and it shows a successful completion.

A failure to allocate pinned memory is not fatal in and of itself. A pinned memory buffer is used to speed up data transfer between system memory and GPU memory, but ollama will fall back to using non-pinned memory if the alloc fails. What it does indicate is that the system has very little memory available for page-locking.

You can turn off memory pinning by setting GGML_CUDA_NO_PINNED=1 in the server environment. This will result in no more warning messages about pinned memory in the logs.

Generally, a BSOD or system freeze is more usually caused by a problem with the system, not by an application. ollama will drive a system pretty hard, so if there's a borderline memory chip or a temperature-sensitive component, that can trigger a system wide fault. Try running some hardware checks or system burn-in software to see if the issue is independent of the applications being run.

You can also test to see if the GPU is a part of the problem by leaving it out of the process of running inference, by setting num_gpu to zero.

$ ollama run gemma:2b
>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> hello
Hello! 👋 It's great to hear from you. What can I do for you today? 😊

>>> /bye 
$ ollama ps
NAME        ID              SIZE      PROCESSOR    UNTIL   
gemma:2b    b50d6c999e59    1.8 GB    100% CPU     Forever

@rick-github commented on GitHub (Jul 7, 2025): Only one of the .txt files is related to ollama, and it shows a successful completion. A failure to allocate pinned memory is not fatal in and of itself. A pinned memory buffer is used to speed up data transfer between system memory and GPU memory, but ollama will fall back to using non-pinned memory if the alloc fails. What it does indicate is that the system has very little memory available for page-locking. You can turn off memory pinning by setting `GGML_CUDA_NO_PINNED=1` in the server environment. This will result in no more warning messages about pinned memory in the logs. Generally, a BSOD or system freeze is more usually caused by a problem with the system, not by an application. ollama will drive a system pretty hard, so if there's a borderline memory chip or a temperature-sensitive component, that can trigger a system wide fault. Try running some hardware checks or system burn-in software to see if the issue is independent of the applications being run. You can also test to see if the GPU is a part of the problem by leaving it out of the process of running inference, by setting `num_gpu` to zero. ```console $ ollama run gemma:2b >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> hello Hello! 👋 It's great to hear from you. What can I do for you today? 😊 >>> /bye $ ollama ps NAME ID SIZE PROCESSOR UNTIL gemma:2b b50d6c999e59 1.8 GB 100% CPU Forever ```

Sign in to join this conversation.

Branches Tags

main

hoyyeva/opencode-image-modality

hoyyeva/anthropic-renderer-local-image-path

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#53980