[GH-ISSUE #7440] [v0.4.0-rc6] CUDA OOM using x/llama3.2-vision:11b-instruct #51240

New Issue

GiteaMirror · 2026-04-28T18:58:17-05:00

GiteaMirror commented

2026-04-28 18:58:17 -05:00

Originally created by @thatjpk on GitHub (Oct 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7440

Originally assigned to: @mxyng on GitHub.

What is the issue?

Attached log: llama3.2-cuda-oom.log

I'm testing the x/llama3.2-vision:11b-instruct-q4_K_M and x/llama3.2-vision:11b-instruct-q8_0 models from ollama.com, using ollama 0.4.0-rc6 via Open WebUI v0.3.35 (in docker).

~ docker ps
CONTAINER ID   IMAGE                                COMMAND               CREATED          STATUS                    PORTS                                       NAMES
4c149404563a   ghcr.io/open-webui/open-webui:main   "bash start.sh"       14 minutes ago   Up 14 minutes (healthy)   0.0.0.0:3000->8080/tcp, :::3000->8080/tcp   open-webui
c4d43daa9ad5   ollama/ollama:0.4.0-rc6              "/bin/ollama serve"   14 minutes ago   Up 14 minutes             11434/tcp                                   ollama
~ docker --version
Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1
~ nvidia-smi
Thu Oct 31 01:28:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     Off |   00000000:0B:00.0  On |                  N/A |
|  0%   32C    P5             68W /  366W |    1990MiB /  12288MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

When ollama is running with CUDA enabled, and I post an image in a chat with a llama3.2-vision model, Open WebUI reports Oops! No text generated from Ollama, Please try again., and ollama generates the attached log. A snippet of the log around the SIGSEGV is this:

  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.36 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/41 layers to GPU
llm_load_tensors:        CPU buffer size =  5679.33 MiB
llm_load_tensors:      CUDA0 buffer size =  3841.45 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   156.06 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   500.19 MiB
llama_new_context_with_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 95
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: vision using CUDA backend
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2991947904
mllama_model_load: compute allocated memory: 0.00 MB
time=2024-10-31T05:39:41.603Z level=INFO source=server.go:606 msg="llama runner started in 2.26 seconds"
SIGSEGV: segmentation violation
PC=0x634314838794 m=7 sigcode=1 addr=0x10
signal arrived during cgo execution

goroutine 18 gp=0xc000218000 m=7 mp=0xc000100808 [syscall]:
runtime.cgocall(0x634314832920, 0xc00002b360)
        runtime/cgocall.go:157 +0x4b fp=0xc00002b338 sp=0xc00002b300 pc=0x6343145b53ab
github.com/ollama/ollama/llama._Cfunc_mllama_image_encode(0x78d73983e760, 0x10, 0x78d73c000ce0, 0xc0050ea000)
        _cgo_gotypes.go:915 +0x4c fp=0xc00002b360 sp=0xc00002b338 pc=0x6343146b3d4c
github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed.func3(0xc000014300?, 0xc000202130?, 0x78d73c000ce0, {0xc0050ea000, 0xc00002b400?, 0x6343146b949f?})
        github.com/ollama/ollama/llama/llama.go:541 +0xa8 fp=0xc00002b3b8 sp=0xc00002b360 pc=0x6343146b7f48
github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed(0xc000014300, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6)
        github.com/ollama/ollama/llama/llama.go:541 +0x111 fp=0xc00002b448 sp=0xc00002b3b8 pc=0x6343146b7db1
main.(*ImageContext).NewEmbed(0xc0000d0dd0, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6)
        github.com/ollama/ollama/llama/runner/image.go:78 +0x1a7 fp=0xc00002b4e0 sp=0xc00002b448 pc=0x63431482ad47
main.(*Server).inputs(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x146138a5?})
        github.com/ollama/ollama/llama/runner/runner.go:193 +0x28e fp=0xc00002b600 sp=0xc00002b4e0 pc=0x63431482c2ee
main.(*Server).NewSequence(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x1}, {0x5000, {0x0, 0x0, 0x0}, ...})
        github.com/ollama/ollama/llama/runner/runner.go:100 +0xb2 fp=0xc00002b7b8 sp=0xc00002b600 pc=0x63431482b8b2
main.(*Server).completion(0xc0000ea120, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0)
        github.com/ollama/ollama/llama/runner/runner.go:591 +0x52a fp=0xc00002bab8 sp=0xc00002b7b8 pc=0x63431482e7ca
main.(*Server).completion-fm({0x634314b6acf0?, 0xc0002342a0?}, 0x63431480a32d?)
        <autogenerated>:1 +0x36 fp=0xc00002bae8 sp=0xc00002bab8 pc=0x634314831b96
net/http.HandlerFunc.ServeHTTP(0xc0000d0c30?, {0x634314b6acf0?, 0xc0002342a0?}, 0x10?)
        net/http/server.go:2171 +0x29 fp=0xc00002bb10 sp=0xc00002bae8 pc=0x634314802dc9
net/http.(*ServeMux).ServeHTTP(0x6343145bef65?, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0)
        net/http/server.go:2688 +0x1ad fp=0xc00002bb60 sp=0xc00002bb10 pc=0x634314804c4d
net/http.serverHandler.ServeHTTP({0x634314b6a040?}, {0x634314b6acf0?, 0xc0002342a0?}, 0x6?)
        net/http/server.go:3142 +0x8e fp=0xc00002bb90 sp=0xc00002bb60 pc=0x634314805c6e
net/http.(*conn).serve(0xc000212000, {0x634314b6b148, 0xc0000cedb0})
        net/http/server.go:2044 +0x5e8 fp=0xc00002bfb8 sp=0xc00002bb90 pc=0x634314801a08
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3290 +0x28 fp=0xc00002bfe0 sp=0xc00002bfb8 pc=0x6343148063e8
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x63431461ddc1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3290 +0x4b4

Some additional notes:

I see ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory in there, which doesn't add up to me because this GPU has 12GB of VRAM (about 10GB of which is usable as it's also running the KDE session).
This happens on both the q4_K_M and q8_0 quants of the model.
This doesn't happen when I run without CUDA. The model runs on the CPU and works, albeit slowly.
Older vision models in this setup, like llava-llama3, work as they always have with or without CUDA.

All that said, I recognize this may be something to do with my setup. So if you have additional troubleshooting steps I can do to better isolate the behavior, please let me know. Thanks for taking a look!

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

v0.4.0-rc6

Originally created by @thatjpk on GitHub (Oct 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7440 Originally assigned to: @mxyng on GitHub. ### What is the issue? Attached log: [llama3.2-cuda-oom.log](https://github.com/user-attachments/files/17582524/llama3.2-cuda-oom.log) I'm testing the `x/llama3.2-vision:11b-instruct-q4_K_M` and `x/llama3.2-vision:11b-instruct-q8_0` models from ollama.com, using ollama 0.4.0-rc6 via Open WebUI v0.3.35 (in docker). ``` ~ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4c149404563a ghcr.io/open-webui/open-webui:main "bash start.sh" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:3000->8080/tcp, :::3000->8080/tcp open-webui c4d43daa9ad5 ollama/ollama:0.4.0-rc6 "/bin/ollama serve" 14 minutes ago Up 14 minutes 11434/tcp ollama ~ docker --version Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1 ~ nvidia-smi Thu Oct 31 01:28:17 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:0B:00.0 On | N/A | | 0% 32C P5 68W / 366W | 1990MiB / 12288MiB | 17% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` When ollama is running with CUDA enabled, and I post an image in a chat with a llama3.2-vision model, Open WebUI reports `Oops! No text generated from Ollama, Please try again.`, and ollama generates the [attached log](https://github.com/user-attachments/files/17582524/llama3.2-cuda-oom.log). A snippet of the log around the SIGSEGV is this: ``` Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.36 MiB llm_load_tensors: offloading 31 repeating layers to GPU llm_load_tensors: offloaded 31/41 layers to GPU llm_load_tensors: CPU buffer size = 5679.33 MiB llm_load_tensors: CUDA0 buffer size = 3841.45 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 156.06 MiB llama_kv_cache_init: CUDA0 KV buffer size = 500.19 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 95 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2991947904 mllama_model_load: compute allocated memory: 0.00 MB time=2024-10-31T05:39:41.603Z level=INFO source=server.go:606 msg="llama runner started in 2.26 seconds" SIGSEGV: segmentation violation PC=0x634314838794 m=7 sigcode=1 addr=0x10 signal arrived during cgo execution goroutine 18 gp=0xc000218000 m=7 mp=0xc000100808 [syscall]: runtime.cgocall(0x634314832920, 0xc00002b360) runtime/cgocall.go:157 +0x4b fp=0xc00002b338 sp=0xc00002b300 pc=0x6343145b53ab github.com/ollama/ollama/llama._Cfunc_mllama_image_encode(0x78d73983e760, 0x10, 0x78d73c000ce0, 0xc0050ea000) _cgo_gotypes.go:915 +0x4c fp=0xc00002b360 sp=0xc00002b338 pc=0x6343146b3d4c github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed.func3(0xc000014300?, 0xc000202130?, 0x78d73c000ce0, {0xc0050ea000, 0xc00002b400?, 0x6343146b949f?}) github.com/ollama/ollama/llama/llama.go:541 +0xa8 fp=0xc00002b3b8 sp=0xc00002b360 pc=0x6343146b7f48 github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed(0xc000014300, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6) github.com/ollama/ollama/llama/llama.go:541 +0x111 fp=0xc00002b448 sp=0xc00002b3b8 pc=0x6343146b7db1 main.(*ImageContext).NewEmbed(0xc0000d0dd0, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6) github.com/ollama/ollama/llama/runner/image.go:78 +0x1a7 fp=0xc00002b4e0 sp=0xc00002b448 pc=0x63431482ad47 main.(*Server).inputs(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x146138a5?}) github.com/ollama/ollama/llama/runner/runner.go:193 +0x28e fp=0xc00002b600 sp=0xc00002b4e0 pc=0x63431482c2ee main.(*Server).NewSequence(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x1}, {0x5000, {0x0, 0x0, 0x0}, ...}) github.com/ollama/ollama/llama/runner/runner.go:100 +0xb2 fp=0xc00002b7b8 sp=0xc00002b600 pc=0x63431482b8b2 main.(*Server).completion(0xc0000ea120, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0) github.com/ollama/ollama/llama/runner/runner.go:591 +0x52a fp=0xc00002bab8 sp=0xc00002b7b8 pc=0x63431482e7ca main.(*Server).completion-fm({0x634314b6acf0?, 0xc0002342a0?}, 0x63431480a32d?) <autogenerated>:1 +0x36 fp=0xc00002bae8 sp=0xc00002bab8 pc=0x634314831b96 net/http.HandlerFunc.ServeHTTP(0xc0000d0c30?, {0x634314b6acf0?, 0xc0002342a0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc00002bb10 sp=0xc00002bae8 pc=0x634314802dc9 net/http.(*ServeMux).ServeHTTP(0x6343145bef65?, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0) net/http/server.go:2688 +0x1ad fp=0xc00002bb60 sp=0xc00002bb10 pc=0x634314804c4d net/http.serverHandler.ServeHTTP({0x634314b6a040?}, {0x634314b6acf0?, 0xc0002342a0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc00002bb90 sp=0xc00002bb60 pc=0x634314805c6e net/http.(*conn).serve(0xc000212000, {0x634314b6b148, 0xc0000cedb0}) net/http/server.go:2044 +0x5e8 fp=0xc00002bfb8 sp=0xc00002bb90 pc=0x634314801a08 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc00002bfe0 sp=0xc00002bfb8 pc=0x6343148063e8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x63431461ddc1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 ``` Some additional notes: - I see `ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory` in there, which doesn't add up to me because this GPU has 12GB of VRAM (about 10GB of which is usable as it's also running the KDE session). - This happens on both the `q4_K_M` and `q8_0` quants of the model. - This _doesn't_ happen when I run without CUDA. The model runs on the CPU and works, albeit slowly. - Older vision models in this setup, like llava-llama3, work as they always have with or without CUDA. All that said, I recognize this may be something to do with my setup. So if you have additional troubleshooting steps I can do to better isolate the behavior, please let me know. Thanks for taking a look! ### OS Linux, Docker ### GPU Nvidia ### CPU AMD ### Ollama version v0.4.0-rc6

GiteaMirror added the memory bug labels 2026-04-28 18:58:18 -05:00

GiteaMirror closed this issue

2026-04-28 18:58:22 -05:00

GiteaMirror commented

2026-04-28 18:58:23 -05:00

@rick-github commented on GitHub (Oct 31, 2024):

time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB"

You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle.

If this is the problem, there are a few things you can do to try to mitigate the issue.

Reduce the number of layers offloaded to the GPU by explicitly setting num_gpu. See here for details. You can find the current value by searching for layers.model in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure.
Set OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model.
Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable.

@rick-github commented on GitHub (Oct 31, 2024): ``` time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB" ``` You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle. If this is the problem, there are a few things you can do to try to mitigate the issue. 1. Reduce the number of layers offloaded to the GPU by explicitly setting `num_gpu`. See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for details. You can find the current value by searching for `layers.model` in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure. 2. Set `OLLAMA_FLASH_ATTENTION=1` in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model. 3. Set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable.

GiteaMirror commented

2026-04-28 18:58:24 -05:00

@rick-github commented on GitHub (Nov 1, 2024):

https://github.com/ollama/ollama/pull/7456

@rick-github commented on GitHub (Nov 1, 2024): https://github.com/ollama/ollama/pull/7456

GiteaMirror commented

2026-04-28 18:58:25 -05:00

@thatjpk commented on GitHub (Nov 1, 2024):

time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB"
You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle.

Ah, yeah this makes sense. I did the tests you suggested below, and it looks like you're probably right about the layer offload math just being off.

If this is the problem, there are a few things you can do to try to mitigate the issue.

Reduce the number of layers offloaded to the GPU by explicitly setting num_gpu. See here for details. You can find the current value by searching for layers.model in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure.

Trying x/llama3.2-vision:11b-instruct-q4_K_M again with num_gpu set to 27 or lower worked, anything higher reproduces the crash. Available VRAM fluctuated I guess because of other stuff going on in the desktop session, so 27 was the magic number here instead of 28.

Set OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model.

Tried this also on the q4 model (and verified the env var was set by seeing the OLLAMA_FLASH_ATTENTION:true in the log), but still got the crash when num_gpu was higher than 27. So, not enough to avoid the crash at least in this case.

Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable.

Tried this on the q4 model, and it worked! Even leaving num_gpu on the default (which from logs looks like it was 30 this time around), but setting the unified memory variable avoided the crash.

@thatjpk commented on GitHub (Nov 1, 2024): > ``` > time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB" > ``` > > You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle. Ah, yeah this makes sense. I did the tests you suggested below, and it looks like you're probably right about the layer offload math just being off. > > If this is the problem, there are a few things you can do to try to mitigate the issue. > > 1. Reduce the number of layers offloaded to the GPU by explicitly setting `num_gpu`. See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for details. You can find the current value by searching for `layers.model` in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure. Trying `x/llama3.2-vision:11b-instruct-q4_K_M` again with `num_gpu` set to 27 or lower worked, anything higher reproduces the crash. Available VRAM fluctuated I guess because of other stuff going on in the desktop session, so 27 was the magic number here instead of 28. > 2. Set `OLLAMA_FLASH_ATTENTION=1` in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model. Tried this also on the q4 model (and verified the env var was set by seeing the `OLLAMA_FLASH_ATTENTION:true` in the log), but still got the crash when `num_gpu` was higher than 27. So, not enough to avoid the crash at least in this case. > 3. Set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable. Tried this on the q4 model, and it worked! Even leaving `num_gpu` on the default (which from logs looks like it was 30 this time around), but setting the unified memory variable avoided the crash.

GiteaMirror commented

2026-04-28 18:58:26 -05:00

@jessegross commented on GitHub (Nov 5, 2024):

For those who are running into this issue, it should now be fixed in RC8.

@jessegross commented on GitHub (Nov 5, 2024): For those who are running into this issue, it should now be fixed in RC8.

GiteaMirror commented

2026-04-28 18:58:26 -05:00

@thatjpk commented on GitHub (Nov 5, 2024):

Re-tested rc8 without an explicit num_gpu or any of the other mitigations mentioned above, and I haven't been able to repro the crash. Thanks for the fix!

Closing this, but feel free to reopen if needed.

@thatjpk commented on GitHub (Nov 5, 2024): Re-tested rc8 without an explicit `num_gpu` or any of the other mitigations mentioned above, and I haven't been able to repro the crash. Thanks for the fix! Closing this, but feel free to reopen if needed.

GiteaMirror commented

2026-04-28 18:58:27 -05:00

@mastoca commented on GitHub (Nov 8, 2024):

Still seeing this in 0.4.1-rc0

No matter what settings I use the GPU is never used.

this looks kinda suspicious ....

time=2024-11-08T18:11:03.922-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.45 seconds"
[GIN] 2024/11/08 - 18:11:03 | 200 | 1.721373066s | 127.0.0.1 | POST "/api/generate"
time=2024-11-08T18:11:52.342-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"

I have ~20GB of vram on my gpu (checked & watched in nvtop. There was no ollama present ever)

@mastoca commented on GitHub (Nov 8, 2024): Still seeing this in 0.4.1-rc0 No matter what settings I use the GPU is **never** used. this looks kinda suspicious .... > time=2024-11-08T18:11:03.922-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.45 seconds" > [GIN] 2024/11/08 - 18:11:03 | 200 | 1.721373066s | 127.0.0.1 | POST "/api/generate" > time=2024-11-08T18:11:52.342-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" I have ~20GB of vram on my gpu (checked & watched in nvtop. There was no ollama present ever)

GiteaMirror commented

2026-04-28 18:58:28 -05:00

@rick-github commented on GitHub (Nov 8, 2024):

Supply full logs. The warning is just a warning, it enforces OLLAMA_NUM_PARALLEL=1.

@rick-github commented on GitHub (Nov 8, 2024): Supply full logs. The warning is just a warning, it enforces `OLLAMA_NUM_PARALLEL=1`.

GiteaMirror commented

2026-04-28 18:58:29 -05:00

@mastoca commented on GitHub (Nov 9, 2024):

here's the output (slightly cleaned)

2024/11/08 19:39:16 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/mastoca/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-08T19:39:16.790-05:00 level=INFO source=images.go:755 msg="total blobs: 225"
time=2024-11-08T19:39:16.793-05:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
time=2024-11-08T19:39:16.795-05:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1-rc0)"
time=2024-11-08T19:39:16.796-05:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1805014448/runners
time=2024-11-08T19:39:16.838-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]"
time=2024-11-08T19:39:16.838-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-08T19:39:16.838-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-11-08T19:39:16.970-05:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:296 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"
time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:399 msg="no compatible amdgpu devices detected"
time=2024-11-08T19:39:16.970-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU- library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="22.1 GiB" available="20.8 GiB"
[GIN] 2024/11/08 - 19:39:45 | 200 | 144.552µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/11/08 - 19:39:45 | 200 | 23.576778ms | 127.0.0.1 | POST "/api/show"
time=2024-11-08T19:39:45.576-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2024-11-08T19:39:45.721-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 gpu=GPU- parallel=1 available=22295216128 required="15.3 GiB"
time=2024-11-08T19:39:45.836-05:00 level=INFO source=server.go:105 msg="system memory" total="61.9 GiB" free="39.1 GiB" free_swap="0 B"
time=2024-11-08T19:39:45.837-05:00 level=INFO source=memory.go:343 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[20.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.3 GiB" memory.required.partial="15.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[15.3 GiB]" memory.weights.total="9.3 GiB" memory.weights.repeating="8.8 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1805014448/runners/cpu_avx2/ollama_llama_server --model /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 --ctx-size 2048 --batch-size 512 --n-gpu-layers 41 --mmproj /home/mastoca/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 16 --parallel 1 --port 32953"
time=2024-11-08T19:39:45.838-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-08T19:39:45.839-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:863 msg="starting go runner"
time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16
time=2024-11-08T19:39:45.844-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:32953"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 7
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 9.67 GiB (8.50 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: ggml ctx size = 0.18 MiB
time=2024-11-08T19:39:46.290-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server not responding"
llm_load_tensors: CPU buffer size = 9905.93 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-11-08T19:39:46.541-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2024-11-08T19:39:47.545-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.71 seconds"
[GIN] 2024/11/08 - 19:39:47 | 200 | 1.980357826s | 127.0.0.1 | POST "/api/generate"
time=2024-11-08T19:40:13.747-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2024/11/08 - 19:41:04 | 200 | 50.550516492s | 127.0.0.1 | POST "/api/chat"

@mastoca commented on GitHub (Nov 9, 2024): here's the output (slightly cleaned) > 2024/11/08 19:39:16 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/mastoca/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-11-08T19:39:16.790-05:00 level=INFO source=images.go:755 msg="total blobs: 225" time=2024-11-08T19:39:16.793-05:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" time=2024-11-08T19:39:16.795-05:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1-rc0)" time=2024-11-08T19:39:16.796-05:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1805014448/runners time=2024-11-08T19:39:16.838-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]" time=2024-11-08T19:39:16.838-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-08T19:39:16.838-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries" time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries" time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries" time=2024-11-08T19:39:16.970-05:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:296 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB" time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:399 msg="no compatible amdgpu devices detected" time=2024-11-08T19:39:16.970-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-<redacted> library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="22.1 GiB" available="20.8 GiB" [GIN] 2024/11/08 - 19:39:45 | 200 | 144.552µs | 127.0.0.1 | HEAD "/" [GIN] 2024/11/08 - 19:39:45 | 200 | 23.576778ms | 127.0.0.1 | POST "/api/show" time=2024-11-08T19:39:45.576-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2024-11-08T19:39:45.721-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 gpu=GPU-<redacted> parallel=1 available=22295216128 required="15.3 GiB" time=2024-11-08T19:39:45.836-05:00 level=INFO source=server.go:105 msg="system memory" total="61.9 GiB" free="39.1 GiB" free_swap="0 B" time=2024-11-08T19:39:45.837-05:00 level=INFO source=memory.go:343 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[20.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.3 GiB" memory.required.partial="15.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[15.3 GiB]" memory.weights.total="9.3 GiB" memory.weights.repeating="8.8 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1805014448/runners/cpu_avx2/ollama_llama_server --model /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 --ctx-size 2048 --batch-size 512 --n-gpu-layers 41 --mmproj /home/mastoca/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 16 --parallel 1 --port 32953" time=2024-11-08T19:39:45.838-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-08T19:39:45.839-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:863 msg="starting go runner" time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16 time=2024-11-08T19:39:45.844-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:32953" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 7 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q8_0: 282 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 9.67 GiB (8.50 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: ggml ctx size = 0.18 MiB time=2024-11-08T19:39:46.290-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server not responding" llm_load_tensors: CPU buffer size = 9905.93 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 time=2024-11-08T19:39:46.541-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2024-11-08T19:39:47.545-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.71 seconds" [GIN] 2024/11/08 - 19:39:47 | 200 | 1.980357826s | 127.0.0.1 | POST "/api/generate" time=2024-11-08T19:40:13.747-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2024/11/08 - 19:41:04 | 200 | 50.550516492s | 127.0.0.1 | POST "/api/chat"

GiteaMirror commented

2026-04-28 18:58:30 -05:00

@rick-github commented on GitHub (Nov 9, 2024):

How did you install ollama? It looks like you have missing libraries, which might indicate a packaging problem. I'm able to run 0.4.1-rc0 in docker without a problem, so perhaps specific to the package you installed. You could also try installing the actual release, 0.4.1, and see if it works better.

@rick-github commented on GitHub (Nov 9, 2024): How did you install ollama? It looks like you have missing libraries, which might indicate a packaging problem. I'm able to run 0.4.1-rc0 in docker without a problem, so perhaps specific to the package you installed. You could also try installing the actual release, [0.4.1](https://github.com/ollama/ollama/releases/tag/v0.4.1), and see if it works better.

GiteaMirror commented

2026-04-28 18:58:30 -05:00

@mastoca commented on GitHub (Nov 9, 2024):

@rick-github I think you're right. I'm using a nix flake (nixos unstable branch) that I'm trying to upgrade to 0.4.1. I don't think it's linking correctly the cuda portion to both binaries (bin/runner and bin/ollama) perhaps. I'll build the docker image and see its behavior next.

I'm not a go developer yet so not comfy with the go toolchain yet.

@mastoca commented on GitHub (Nov 9, 2024): @rick-github I think you're right. I'm using a nix flake (nixos unstable branch) that I'm trying to upgrade to 0.4.1. I don't think it's linking correctly the cuda portion to both binaries (bin/runner and bin/ollama) perhaps. I'll build the docker image and see its behavior next. I'm not a go developer yet so not comfy with the go toolchain yet.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#51240