[GH-ISSUE #11317] Ollama/GGML: System Freezes, Crashes, and BSODs Due to ggml_host_malloc (Pinned Memory Allocation) Failures on GPUs (AMD/NVIDIA) #53980

Closed
opened 2026-04-29 05:02:38 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @d3f4ul7U53R on GitHub (Jul 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11317

What is the issue?

Problem Description:
The Ollama application frequently causes system-wide freezes, application crashes, and, in severe cases, Blue Screens of Death (BSODs) on both Windows and Linux when utilizing GPU acceleration (both AMD/ROCm and NVIDIA/CUDA). These occurrences consistently point to low-level memory allocation failures, specifically related to ggml_host_malloc (also known as "pinned memory").

Observed Behavior:

Symptoms: System-wide freezes, application crashes, and BSODs (on Windows).

Trigger: Problems are observed during model loading and, primarily, during active model inference/interaction (chatting). One user also reported a crash when attempting to stop the Ollama server (ollama stop), suggesting memory management or deallocation issues during shutdown.

Affected Models: The behavior has been observed with various models, including gemma:2b, llama3, Mistral-Nemo-Instruct-2407.Q5_K_M.gguf, and DeepSeek R1 Distill Qwen 7B.

Interactions: The issue tends to manifest after a variable number of interactions (e.g., "after two questions" for Llama3.2), often starting around the 5th interaction, depending on the model and system.

Technical Analysis and Suspected Cause:
Based on error logs (which consistently display messages such as ggml_cuda_host_malloc: failed to allocate X MiB of pinned memory: out of memory or invalid argument), the problem appears to be directly related to the allocation and management of host pinned memory by the GGML backend. This memory is crucial for efficient, high-speed data transfers between the CPU and GPU.

When the ggml_host_malloc function (or its specific variants like ggml_cuda_host_malloc or ggml_hip_host_malloc) fails—whether due to system limits, RAM fragmentation, or potential driver/implementation issues—it leads to severe system instability. As these are low-level memory allocation failures, they result in freezes and BSODs.

The "large spike in GPU memory usage upon closing" observed by one user further suggests the possibility of memory leaks or inefficient deallocation of pinned memory, leading to its accumulation over time and subsequent failures, even after active use.

User 4:

OS: Windows

CPU: Intel Core i7 quad core

RAM: 16GB (5GB used by OS)

GPU: NVIDIA GeForce GTX 1050 (4GB VRAM)

Ollama Version: Not specified in the report.

Model(s) Used: Gemma3:1b, Llama3.2

Symptoms: Complete computer freeze (with Gemma), BSODs after 2 questions (with Llama3.2), BSOD when trying to stop the server (ollama stop llama3.2), large spike in GPU memory usage upon closing.

External Discussion: Ollama for Windows freezes or crashes Windows

Proposed Mitigation / Potential Solution:
We believe a software solution can be implemented. We suggest adding a configuration option (e.g., a new environment variable or a llama.cpp flag) that allows users to disable the use of host pinned memory (i.e., ggml_cuda_host_malloc and ggml_hip_host_malloc) for GPU operations.

Explanation of Trade-off:
While this might result in slightly slower CPU-GPU transfers (as pageable memory would be used instead of pinned memory), we believe this approach would significantly improve system stability and make Ollama usable for many users currently experiencing critical crashes.

Potential Intervention Point (GGML/Ollama Developers):
The bool host_buffer field within the ggml_backend_dev_caps structure provides the most promising avenue for a software solution.

Option 1 (Ideal, llama.cpp/GGML upstream): Add a configuration option (via an environment variable or a backend initialization parameter) to force host_buffer to false for GPU backends. This would cause host memory allocations for CPU-GPU transfers to fall back to regular pageable memory, which, although slower, could circumvent the pinned memory exhaustion/freeze issues.

This would require a change within the C/C++ implementation of the GGML CUDA/ROCm backends to respect this flag and allocate "common" host memory instead of pinned memory.

Option 2 (More Complex, Ollama): Within Ollama's Go code, after obtaining device properties (ggml_backend_dev_get_props), it would be possible to inspect the value of props.caps.host_buffer. If it's true and the user has set a flag (e.g., OLLAMA_DISABLE_PINNED_MEMORY=true), Ollama could attempt to force the use of the CPU backend for layers that would normally go to the GPU (or a different buffer type that doesn't use pinned memory), even if a GPU is present. However, this would be a workaround and less elegant than Option 1, which would address the root cause at the GGML level.

link 4 user.txt
log user 1.txt
log user 2.txt
log user 3.txt

o aplicativo Ollama trava com frequência, congela todo o sistema ou causa telas azuis da morte (BSODs) no Windows e Linux ao usar a aceleração de GPU (AMD/ROCm e NVIDIA/CUDA).

esses problemas apontam consistentemente para falhas de alocação de memória, especificamente relacionadas ao .ggml_host_malloc

Comportamento observado:
Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows).

Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação.

Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B).

dados de usuarios:
Comportamento observado:
Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows).

Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação.

Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B).

Análise técnica e causa suspeita:
Com base nos erros e nos sintomas, o problema parece estar relacionado à alocação e gerenciamento da memória do host fixada (memória fixada no host) pelo back-end GGML, que é fundamental para transferências eficientes de dados CPU-GPU.ggml_host_malloc

Quando falha (devido a limites do sistema, fragmentação ou problemas de driver), isso leva a uma grave instabilidade do sistema, incluindo congelamentos e BSODs, pois são falhas de alocação de memória de baixo nível.ggml_host_malloc

O "pico de uso de memória da GPU no fechamento" observado por um usuário sugere ainda possíveis vazamentos de memória ou desalocação ineficiente da memória fixada, levando ao seu acúmulo ao longo do tempo e falhas subsequentes.

Solução/Mitigação Proposta:
Sugira adicionar uma opção de configuração (por exemplo, uma nova variável de ambiente ou um sinalizador de llama.cpp) que permita aos usuários desabilitar o uso da memória do host fixada (ou seja, ggml_cuda_host_malloc e ggml_hip_host_malloc) para operações de GPU.

Explique a compensação: embora isso possa resultar em transferências CPU-GPU um pouco mais lentas (já que a memória paginável seria usada), melhoraria significativamente a estabilidade e tornaria o aplicativo utilizável para muitos usuários que estão enfrentando falhas críticas.

Aponte para a estrutura e seu campo como um ponto de entrada potencial para a implementação de tal sinalizador.ggml_backend_dev_capshost_buffer

O Potencial Ponto de Intervenção
O campo dentro de é o que nos dá a maior esperança para uma solução de software.bool host_bufferggml_backend_dev_caps

Se um backend de GPU tem host_buffer = true e está causando problemas, uma solução potencial seria:

Opção 1 (Ideal, GGML upstream): Adicionar uma opção de configuração (via variável de ambiente ou parâmetro de inicialização do backend) que permita forçar host_buffer a false para backends de GPU. Isso faria com que as alocações de memória host para transferências CPU-GPU caíssem para memória paginável comum, o que, embora mais lento, poderia contornar os problemas de esgotamento/congelamento de memória fixada.

Isso exigiria uma mudança dentro da implementação C/C++ dos backends CUDA/ROCm do GGML para respeitar essa flag e alocar memória host "comum" em vez de fixada.

Opção 2 (Mais complexa, Ollama): No código Go do Ollama, após obter as propriedades do dispositivo (), seria possível inspecionar o valor de . Se for e o usuário tiver definido uma flag (e.g., ), o Ollama poderia tentar forçar o uso do backend de CPU para camadas que normalmente iriam para a GPU (ou um tipo de buffer diferente que não use memória fixada), mesmo que a GPU esteja presente. No entanto, isso seria um paliativo e menos elegante que a Opção 1, que corrigiria a causa raiz no nível do GGML.ggml_backend_dev_get_propsprops.caps.host_buffertrueOLLAMA_DISABLE_PINNED_MEMORY=true

Relevant log output

llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
ggml_cuda_host_malloc: failed to allocate 2.51 MiB of pinned memory: hipErrorOutOfMemory
llama_new_context_with_model:      ROCm0 compute buffer size =    22.01 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     2.51 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2

llama_new_context_with_model: n_ctx      = 19456
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument
llama_kv_cache_init:        CPU KV buffer size =  1900.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1140.00 MiB
llama_new_context_with_model: KV self size  = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB


llama_new_context_with_model: n_ctx      = 19456
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument
llama_kv_cache_init:        CPU KV buffer size =  1900.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  1140.00 MiB
llama_new_context_with_model: KV self size  = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

0.9.2

Originally created by @d3f4ul7U53R on GitHub (Jul 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11317 ### What is the issue? Problem Description: The Ollama application frequently causes system-wide freezes, application crashes, and, in severe cases, Blue Screens of Death (BSODs) on both Windows and Linux when utilizing GPU acceleration (both AMD/ROCm and NVIDIA/CUDA). These occurrences consistently point to low-level memory allocation failures, specifically related to ggml_host_malloc (also known as "pinned memory"). Observed Behavior: Symptoms: System-wide freezes, application crashes, and BSODs (on Windows). Trigger: Problems are observed during model loading and, primarily, during active model inference/interaction (chatting). One user also reported a crash when attempting to stop the Ollama server (ollama stop), suggesting memory management or deallocation issues during shutdown. Affected Models: The behavior has been observed with various models, including gemma:2b, llama3, Mistral-Nemo-Instruct-2407.Q5_K_M.gguf, and DeepSeek R1 Distill Qwen 7B. Interactions: The issue tends to manifest after a variable number of interactions (e.g., "after two questions" for Llama3.2), often starting around the 5th interaction, depending on the model and system. Technical Analysis and Suspected Cause: Based on error logs (which consistently display messages such as ggml_cuda_host_malloc: failed to allocate X MiB of pinned memory: out of memory or invalid argument), the problem appears to be directly related to the allocation and management of host pinned memory by the GGML backend. This memory is crucial for efficient, high-speed data transfers between the CPU and GPU. When the ggml_host_malloc function (or its specific variants like ggml_cuda_host_malloc or ggml_hip_host_malloc) fails—whether due to system limits, RAM fragmentation, or potential driver/implementation issues—it leads to severe system instability. As these are low-level memory allocation failures, they result in freezes and BSODs. The "large spike in GPU memory usage upon closing" observed by one user further suggests the possibility of memory leaks or inefficient deallocation of pinned memory, leading to its accumulation over time and subsequent failures, even after active use. User 4: OS: Windows CPU: Intel Core i7 quad core RAM: 16GB (5GB used by OS) GPU: NVIDIA GeForce GTX 1050 (4GB VRAM) Ollama Version: Not specified in the report. Model(s) Used: Gemma3:1b, Llama3.2 Symptoms: Complete computer freeze (with Gemma), BSODs after 2 questions (with Llama3.2), BSOD when trying to stop the server (ollama stop llama3.2), large spike in GPU memory usage upon closing. External Discussion: [Ollama for Windows freezes or crashes Windows](https://www.reddit.com/r/ollama/comments/1jfz86g/ollama_for_windows_freezes_or_crashes_windows/) Proposed Mitigation / Potential Solution: We believe a software solution can be implemented. We suggest adding a configuration option (e.g., a new environment variable or a llama.cpp flag) that allows users to disable the use of host pinned memory (i.e., ggml_cuda_host_malloc and ggml_hip_host_malloc) for GPU operations. Explanation of Trade-off: While this might result in slightly slower CPU-GPU transfers (as pageable memory would be used instead of pinned memory), we believe this approach would significantly improve system stability and make Ollama usable for many users currently experiencing critical crashes. Potential Intervention Point (GGML/Ollama Developers): The bool host_buffer field within the ggml_backend_dev_caps structure provides the most promising avenue for a software solution. Option 1 (Ideal, llama.cpp/GGML upstream): Add a configuration option (via an environment variable or a backend initialization parameter) to force host_buffer to false for GPU backends. This would cause host memory allocations for CPU-GPU transfers to fall back to regular pageable memory, which, although slower, could circumvent the pinned memory exhaustion/freeze issues. This would require a change within the C/C++ implementation of the GGML CUDA/ROCm backends to respect this flag and allocate "common" host memory instead of pinned memory. Option 2 (More Complex, Ollama): Within Ollama's Go code, after obtaining device properties (ggml_backend_dev_get_props), it would be possible to inspect the value of props.caps.host_buffer. If it's true and the user has set a flag (e.g., OLLAMA_DISABLE_PINNED_MEMORY=true), Ollama could attempt to force the use of the CPU backend for layers that would normally go to the GPU (or a different buffer type that doesn't use pinned memory), even if a GPU is present. However, this would be a workaround and less elegant than Option 1, which would address the root cause at the GGML level. [link 4 user.txt](https://github.com/user-attachments/files/21093380/link.4.user.txt) [log user 1.txt](https://github.com/user-attachments/files/21093379/log.user.1.txt) [log user 2.txt](https://github.com/user-attachments/files/21093382/log.user.2.txt) [log user 3.txt](https://github.com/user-attachments/files/21093381/log.user.3.txt) o aplicativo Ollama trava com frequência, congela todo o sistema ou causa telas azuis da morte (BSODs) no Windows e Linux ao usar a aceleração de GPU (AMD/ROCm e NVIDIA/CUDA). esses problemas apontam consistentemente para falhas de alocação de memória, especificamente relacionadas ao .ggml_host_malloc Comportamento observado: Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows). Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação. Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B). dados de usuarios: Comportamento observado: Sintomas: Congelamentos em todo o sistema, falhas de aplicativos e BSODs (no Windows). Gatilho: Ocorre durante o carregamento do modelo e, especialmente, durante a inferência/interação ativa com o modelo. Um usuário relatou uma falha ao parar o servidor, sugerindo problemas de gerenciamento de memória durante a desalocação. Modelos afetados: Observado com vários modelos (por exemplo, Lhama 3, Gemma 3B). Análise técnica e causa suspeita: Com base nos erros e nos sintomas, o problema parece estar relacionado à alocação e gerenciamento da memória do host fixada (memória fixada no host) pelo back-end GGML, que é fundamental para transferências eficientes de dados CPU-GPU.ggml_host_malloc Quando falha (devido a limites do sistema, fragmentação ou problemas de driver), isso leva a uma grave instabilidade do sistema, incluindo congelamentos e BSODs, pois são falhas de alocação de memória de baixo nível.ggml_host_malloc O "pico de uso de memória da GPU no fechamento" observado por um usuário sugere ainda possíveis vazamentos de memória ou desalocação ineficiente da memória fixada, levando ao seu acúmulo ao longo do tempo e falhas subsequentes. Solução/Mitigação Proposta: Sugira adicionar uma opção de configuração (por exemplo, uma nova variável de ambiente ou um sinalizador de llama.cpp) que permita aos usuários desabilitar o uso da memória do host fixada (ou seja, ggml_cuda_host_malloc e ggml_hip_host_malloc) para operações de GPU. Explique a compensação: embora isso possa resultar em transferências CPU-GPU um pouco mais lentas (já que a memória paginável seria usada), melhoraria significativamente a estabilidade e tornaria o aplicativo utilizável para muitos usuários que estão enfrentando falhas críticas. Aponte para a estrutura e seu campo como um ponto de entrada potencial para a implementação de tal sinalizador.ggml_backend_dev_capshost_buffer O Potencial Ponto de Intervenção O campo dentro de é o que nos dá a maior esperança para uma solução de software.bool host_bufferggml_backend_dev_caps Se um backend de GPU tem host_buffer = true e está causando problemas, uma solução potencial seria: Opção 1 (Ideal, GGML upstream): Adicionar uma opção de configuração (via variável de ambiente ou parâmetro de inicialização do backend) que permita forçar host_buffer a false para backends de GPU. Isso faria com que as alocações de memória host para transferências CPU-GPU caíssem para memória paginável comum, o que, embora mais lento, poderia contornar os problemas de esgotamento/congelamento de memória fixada. Isso exigiria uma mudança dentro da implementação C/C++ dos backends CUDA/ROCm do GGML para respeitar essa flag e alocar memória host "comum" em vez de fixada. Opção 2 (Mais complexa, Ollama): No código Go do Ollama, após obter as propriedades do dispositivo (), seria possível inspecionar o valor de . Se for e o usuário tiver definido uma flag (e.g., ), o Ollama poderia tentar forçar o uso do backend de CPU para camadas que normalmente iriam para a GPU (ou um tipo de buffer diferente que não use memória fixada), mesmo que a GPU esteja presente. No entanto, isso seria um paliativo e menos elegante que a Opção 1, que corrigiria a causa raiz no nível do GGML.ggml_backend_dev_get_propsprops.caps.host_buffertrueOLLAMA_DISABLE_PINNED_MEMORY=true ### Relevant log output ```shell llama_new_context_with_model: KV self size = 1152.00 MiB, K (f16): 576.00 MiB, V (f16): 576.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_cuda_host_malloc: failed to allocate 2.51 MiB of pinned memory: hipErrorOutOfMemory llama_new_context_with_model: ROCm0 compute buffer size = 22.01 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 2.51 MiB llama_new_context_with_model: graph nodes = 453 llama_new_context_with_model: graph splits = 2 llama_new_context_with_model: n_ctx = 19456 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument llama_kv_cache_init: CPU KV buffer size = 1900.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1140.00 MiB llama_new_context_with_model: KV self size = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB llama_new_context_with_model: n_ctx = 19456 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 1900.00 MiB of pinned memory: invalid argument llama_kv_cache_init: CPU KV buffer size = 1900.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 1140.00 MiB llama_new_context_with_model: KV self size = 3040.00 MiB, K (f16): 1520.00 MiB, V (f16): 1520.00 MiB ``` ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version 0.9.2
GiteaMirror added the bugneeds more info labels 2026-04-29 05:02:39 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 7, 2025):

A full log will improve debugging.

<!-- gh-comment-id:3043264490 --> @rick-github commented on GitHub (Jul 7, 2025): A full log will improve debugging.
Author
Owner

@d3f4ul7U53R commented on GitHub (Jul 7, 2025):

Um log completo melhorará a depuração.

logs are available in .txt files above

<!-- gh-comment-id:3043539078 --> @d3f4ul7U53R commented on GitHub (Jul 7, 2025): > Um log completo melhorará a depuração. logs are available in .txt files above
Author
Owner

@rick-github commented on GitHub (Jul 7, 2025):

Only one of the .txt files is related to ollama, and it shows a successful completion.

A failure to allocate pinned memory is not fatal in and of itself. A pinned memory buffer is used to speed up data transfer between system memory and GPU memory, but ollama will fall back to using non-pinned memory if the alloc fails. What it does indicate is that the system has very little memory available for page-locking.

You can turn off memory pinning by setting GGML_CUDA_NO_PINNED=1 in the server environment. This will result in no more warning messages about pinned memory in the logs.

Generally, a BSOD or system freeze is more usually caused by a problem with the system, not by an application. ollama will drive a system pretty hard, so if there's a borderline memory chip or a temperature-sensitive component, that can trigger a system wide fault. Try running some hardware checks or system burn-in software to see if the issue is independent of the applications being run.

You can also test to see if the GPU is a part of the problem by leaving it out of the process of running inference, by setting num_gpu to zero.

$ ollama run gemma:2b
>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> hello
Hello! 👋 It's great to hear from you. What can I do for you today? 😊

>>> /bye 
$ ollama ps
NAME        ID              SIZE      PROCESSOR    UNTIL   
gemma:2b    b50d6c999e59    1.8 GB    100% CPU     Forever    
<!-- gh-comment-id:3044327932 --> @rick-github commented on GitHub (Jul 7, 2025): Only one of the .txt files is related to ollama, and it shows a successful completion. A failure to allocate pinned memory is not fatal in and of itself. A pinned memory buffer is used to speed up data transfer between system memory and GPU memory, but ollama will fall back to using non-pinned memory if the alloc fails. What it does indicate is that the system has very little memory available for page-locking. You can turn off memory pinning by setting `GGML_CUDA_NO_PINNED=1` in the server environment. This will result in no more warning messages about pinned memory in the logs. Generally, a BSOD or system freeze is more usually caused by a problem with the system, not by an application. ollama will drive a system pretty hard, so if there's a borderline memory chip or a temperature-sensitive component, that can trigger a system wide fault. Try running some hardware checks or system burn-in software to see if the issue is independent of the applications being run. You can also test to see if the GPU is a part of the problem by leaving it out of the process of running inference, by setting `num_gpu` to zero. ```console $ ollama run gemma:2b >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> hello Hello! 👋 It's great to hear from you. What can I do for you today? 😊 >>> /bye $ ollama ps NAME ID SIZE PROCESSOR UNTIL gemma:2b b50d6c999e59 1.8 GB 100% CPU Forever ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53980