[GH-ISSUE #7440] [v0.4.0-rc6] CUDA OOM using x/llama3.2-vision:11b-instruct #51240

Closed
opened 2026-04-28 18:58:17 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @thatjpk on GitHub (Oct 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7440

Originally assigned to: @mxyng on GitHub.

What is the issue?

Attached log: llama3.2-cuda-oom.log

I'm testing the x/llama3.2-vision:11b-instruct-q4_K_M and x/llama3.2-vision:11b-instruct-q8_0 models from ollama.com, using ollama 0.4.0-rc6 via Open WebUI v0.3.35 (in docker).

~ docker ps
CONTAINER ID   IMAGE                                COMMAND               CREATED          STATUS                    PORTS                                       NAMES
4c149404563a   ghcr.io/open-webui/open-webui:main   "bash start.sh"       14 minutes ago   Up 14 minutes (healthy)   0.0.0.0:3000->8080/tcp, :::3000->8080/tcp   open-webui
c4d43daa9ad5   ollama/ollama:0.4.0-rc6              "/bin/ollama serve"   14 minutes ago   Up 14 minutes             11434/tcp                                   ollama
~ docker --version
Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1
~ nvidia-smi
Thu Oct 31 01:28:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     Off |   00000000:0B:00.0  On |                  N/A |
|  0%   32C    P5             68W /  366W |    1990MiB /  12288MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

When ollama is running with CUDA enabled, and I post an image in a chat with a llama3.2-vision model, Open WebUI reports Oops! No text generated from Ollama, Please try again., and ollama generates the attached log. A snippet of the log around the SIGSEGV is this:

  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.36 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/41 layers to GPU
llm_load_tensors:        CPU buffer size =  5679.33 MiB
llm_load_tensors:      CUDA0 buffer size =  3841.45 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   156.06 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   500.19 MiB
llama_new_context_with_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 95
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: vision using CUDA backend
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2991947904
mllama_model_load: compute allocated memory: 0.00 MB
time=2024-10-31T05:39:41.603Z level=INFO source=server.go:606 msg="llama runner started in 2.26 seconds"
SIGSEGV: segmentation violation
PC=0x634314838794 m=7 sigcode=1 addr=0x10
signal arrived during cgo execution

goroutine 18 gp=0xc000218000 m=7 mp=0xc000100808 [syscall]:
runtime.cgocall(0x634314832920, 0xc00002b360)
        runtime/cgocall.go:157 +0x4b fp=0xc00002b338 sp=0xc00002b300 pc=0x6343145b53ab
github.com/ollama/ollama/llama._Cfunc_mllama_image_encode(0x78d73983e760, 0x10, 0x78d73c000ce0, 0xc0050ea000)
        _cgo_gotypes.go:915 +0x4c fp=0xc00002b360 sp=0xc00002b338 pc=0x6343146b3d4c
github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed.func3(0xc000014300?, 0xc000202130?, 0x78d73c000ce0, {0xc0050ea000, 0xc00002b400?, 0x6343146b949f?})
        github.com/ollama/ollama/llama/llama.go:541 +0xa8 fp=0xc00002b3b8 sp=0xc00002b360 pc=0x6343146b7f48
github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed(0xc000014300, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6)
        github.com/ollama/ollama/llama/llama.go:541 +0x111 fp=0xc00002b448 sp=0xc00002b3b8 pc=0x6343146b7db1
main.(*ImageContext).NewEmbed(0xc0000d0dd0, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6)
        github.com/ollama/ollama/llama/runner/image.go:78 +0x1a7 fp=0xc00002b4e0 sp=0xc00002b448 pc=0x63431482ad47
main.(*Server).inputs(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x146138a5?})
        github.com/ollama/ollama/llama/runner/runner.go:193 +0x28e fp=0xc00002b600 sp=0xc00002b4e0 pc=0x63431482c2ee
main.(*Server).NewSequence(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x1}, {0x5000, {0x0, 0x0, 0x0}, ...})
        github.com/ollama/ollama/llama/runner/runner.go:100 +0xb2 fp=0xc00002b7b8 sp=0xc00002b600 pc=0x63431482b8b2
main.(*Server).completion(0xc0000ea120, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0)
        github.com/ollama/ollama/llama/runner/runner.go:591 +0x52a fp=0xc00002bab8 sp=0xc00002b7b8 pc=0x63431482e7ca
main.(*Server).completion-fm({0x634314b6acf0?, 0xc0002342a0?}, 0x63431480a32d?)
        <autogenerated>:1 +0x36 fp=0xc00002bae8 sp=0xc00002bab8 pc=0x634314831b96
net/http.HandlerFunc.ServeHTTP(0xc0000d0c30?, {0x634314b6acf0?, 0xc0002342a0?}, 0x10?)
        net/http/server.go:2171 +0x29 fp=0xc00002bb10 sp=0xc00002bae8 pc=0x634314802dc9
net/http.(*ServeMux).ServeHTTP(0x6343145bef65?, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0)
        net/http/server.go:2688 +0x1ad fp=0xc00002bb60 sp=0xc00002bb10 pc=0x634314804c4d
net/http.serverHandler.ServeHTTP({0x634314b6a040?}, {0x634314b6acf0?, 0xc0002342a0?}, 0x6?)
        net/http/server.go:3142 +0x8e fp=0xc00002bb90 sp=0xc00002bb60 pc=0x634314805c6e
net/http.(*conn).serve(0xc000212000, {0x634314b6b148, 0xc0000cedb0})
        net/http/server.go:2044 +0x5e8 fp=0xc00002bfb8 sp=0xc00002bb90 pc=0x634314801a08
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3290 +0x28 fp=0xc00002bfe0 sp=0xc00002bfb8 pc=0x6343148063e8
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x63431461ddc1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3290 +0x4b4

Some additional notes:

  • I see ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory in there, which doesn't add up to me because this GPU has 12GB of VRAM (about 10GB of which is usable as it's also running the KDE session).
  • This happens on both the q4_K_M and q8_0 quants of the model.
  • This doesn't happen when I run without CUDA. The model runs on the CPU and works, albeit slowly.
  • Older vision models in this setup, like llava-llama3, work as they always have with or without CUDA.

All that said, I recognize this may be something to do with my setup. So if you have additional troubleshooting steps I can do to better isolate the behavior, please let me know. Thanks for taking a look!

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

v0.4.0-rc6

Originally created by @thatjpk on GitHub (Oct 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7440 Originally assigned to: @mxyng on GitHub. ### What is the issue? Attached log: [llama3.2-cuda-oom.log](https://github.com/user-attachments/files/17582524/llama3.2-cuda-oom.log) I'm testing the `x/llama3.2-vision:11b-instruct-q4_K_M` and `x/llama3.2-vision:11b-instruct-q8_0` models from ollama.com, using ollama 0.4.0-rc6 via Open WebUI v0.3.35 (in docker). ``` ~ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4c149404563a ghcr.io/open-webui/open-webui:main "bash start.sh" 14 minutes ago Up 14 minutes (healthy) 0.0.0.0:3000->8080/tcp, :::3000->8080/tcp open-webui c4d43daa9ad5 ollama/ollama:0.4.0-rc6 "/bin/ollama serve" 14 minutes ago Up 14 minutes 11434/tcp ollama ~ docker --version Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1 ~ nvidia-smi Thu Oct 31 01:28:17 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Ti Off | 00000000:0B:00.0 On | N/A | | 0% 32C P5 68W / 366W | 1990MiB / 12288MiB | 17% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` When ollama is running with CUDA enabled, and I post an image in a chat with a llama3.2-vision model, Open WebUI reports `Oops! No text generated from Ollama, Please try again.`, and ollama generates the [attached log](https://github.com/user-attachments/files/17582524/llama3.2-cuda-oom.log). A snippet of the log around the SIGSEGV is this: ``` Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.36 MiB llm_load_tensors: offloading 31 repeating layers to GPU llm_load_tensors: offloaded 31/41 layers to GPU llm_load_tensors: CPU buffer size = 5679.33 MiB llm_load_tensors: CUDA0 buffer size = 3841.45 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 156.06 MiB llama_kv_cache_init: CUDA0 KV buffer size = 500.19 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 95 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2991947904 mllama_model_load: compute allocated memory: 0.00 MB time=2024-10-31T05:39:41.603Z level=INFO source=server.go:606 msg="llama runner started in 2.26 seconds" SIGSEGV: segmentation violation PC=0x634314838794 m=7 sigcode=1 addr=0x10 signal arrived during cgo execution goroutine 18 gp=0xc000218000 m=7 mp=0xc000100808 [syscall]: runtime.cgocall(0x634314832920, 0xc00002b360) runtime/cgocall.go:157 +0x4b fp=0xc00002b338 sp=0xc00002b300 pc=0x6343145b53ab github.com/ollama/ollama/llama._Cfunc_mllama_image_encode(0x78d73983e760, 0x10, 0x78d73c000ce0, 0xc0050ea000) _cgo_gotypes.go:915 +0x4c fp=0xc00002b360 sp=0xc00002b338 pc=0x6343146b3d4c github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed.func3(0xc000014300?, 0xc000202130?, 0x78d73c000ce0, {0xc0050ea000, 0xc00002b400?, 0x6343146b949f?}) github.com/ollama/ollama/llama/llama.go:541 +0xa8 fp=0xc00002b3b8 sp=0xc00002b360 pc=0x6343146b7f48 github.com/ollama/ollama/llama.(*MllamaContext).NewEmbed(0xc000014300, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6) github.com/ollama/ollama/llama/llama.go:541 +0x111 fp=0xc00002b448 sp=0xc00002b3b8 pc=0x6343146b7db1 main.(*ImageContext).NewEmbed(0xc0000d0dd0, 0xc000202130, {0xc00428e000, 0xe5b000, 0xe5b000}, 0x6) github.com/ollama/ollama/llama/runner/image.go:78 +0x1a7 fp=0xc00002b4e0 sp=0xc00002b448 pc=0x63431482ad47 main.(*Server).inputs(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x146138a5?}) github.com/ollama/ollama/llama/runner/runner.go:193 +0x28e fp=0xc00002b600 sp=0xc00002b4e0 pc=0x63431482c2ee main.(*Server).NewSequence(0xc0000ea120, {0xc0001c8000, 0x86}, {0xc0000cf050, 0x1, 0x1}, {0x5000, {0x0, 0x0, 0x0}, ...}) github.com/ollama/ollama/llama/runner/runner.go:100 +0xb2 fp=0xc00002b7b8 sp=0xc00002b600 pc=0x63431482b8b2 main.(*Server).completion(0xc0000ea120, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0) github.com/ollama/ollama/llama/runner/runner.go:591 +0x52a fp=0xc00002bab8 sp=0xc00002b7b8 pc=0x63431482e7ca main.(*Server).completion-fm({0x634314b6acf0?, 0xc0002342a0?}, 0x63431480a32d?) <autogenerated>:1 +0x36 fp=0xc00002bae8 sp=0xc00002bab8 pc=0x634314831b96 net/http.HandlerFunc.ServeHTTP(0xc0000d0c30?, {0x634314b6acf0?, 0xc0002342a0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc00002bb10 sp=0xc00002bae8 pc=0x634314802dc9 net/http.(*ServeMux).ServeHTTP(0x6343145bef65?, {0x634314b6acf0, 0xc0002342a0}, 0xc0002226c0) net/http/server.go:2688 +0x1ad fp=0xc00002bb60 sp=0xc00002bb10 pc=0x634314804c4d net/http.serverHandler.ServeHTTP({0x634314b6a040?}, {0x634314b6acf0?, 0xc0002342a0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc00002bb90 sp=0xc00002bb60 pc=0x634314805c6e net/http.(*conn).serve(0xc000212000, {0x634314b6b148, 0xc0000cedb0}) net/http/server.go:2044 +0x5e8 fp=0xc00002bfb8 sp=0xc00002bb90 pc=0x634314801a08 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc00002bfe0 sp=0xc00002bfb8 pc=0x6343148063e8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x63431461ddc1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 ``` Some additional notes: - I see `ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory` in there, which doesn't add up to me because this GPU has 12GB of VRAM (about 10GB of which is usable as it's also running the KDE session). - This happens on both the `q4_K_M` and `q8_0` quants of the model. - This _doesn't_ happen when I run without CUDA. The model runs on the CPU and works, albeit slowly. - Older vision models in this setup, like llava-llama3, work as they always have with or without CUDA. All that said, I recognize this may be something to do with my setup. So if you have additional troubleshooting steps I can do to better isolate the behavior, please let me know. Thanks for taking a look! ### OS Linux, Docker ### GPU Nvidia ### CPU AMD ### Ollama version v0.4.0-rc6
GiteaMirror added the memorybug labels 2026-04-28 18:58:18 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 31, 2024):

time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB"

You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle.

If this is the problem, there are a few things you can do to try to mitigate the issue.

  1. Reduce the number of layers offloaded to the GPU by explicitly setting num_gpu. See here for details. You can find the current value by searching for layers.model in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure.
  2. Set OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model.
  3. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable.
<!-- gh-comment-id:2451009341 --> @rick-github commented on GitHub (Oct 31, 2024): ``` time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB" ``` You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle. If this is the problem, there are a few things you can do to try to mitigate the issue. 1. Reduce the number of layers offloaded to the GPU by explicitly setting `num_gpu`. See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for details. You can find the current value by searching for `layers.model` in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure. 2. Set `OLLAMA_FLASH_ATTENTION=1` in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model. 3. Set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable.
Author
Owner

@rick-github commented on GitHub (Nov 1, 2024):

https://github.com/ollama/ollama/pull/7456

<!-- gh-comment-id:2451057265 --> @rick-github commented on GitHub (Nov 1, 2024): https://github.com/ollama/ollama/pull/7456
Author
Owner

@thatjpk commented on GitHub (Nov 1, 2024):

time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB"

You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle.

Ah, yeah this makes sense. I did the tests you suggested below, and it looks like you're probably right about the layer offload math just being off.

If this is the problem, there are a few things you can do to try to mitigate the issue.

  1. Reduce the number of layers offloaded to the GPU by explicitly setting num_gpu. See here for details. You can find the current value by searching for layers.model in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure.

Trying x/llama3.2-vision:11b-instruct-q4_K_M again with num_gpu set to 27 or lower worked, anything higher reproduces the crash. Available VRAM fluctuated I guess because of other stuff going on in the desktop session, so 27 was the magic number here instead of 28.

  1. Set OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model.

Tried this also on the q4 model (and verified the env var was set by seeing the OLLAMA_FLASH_ATTENTION:true in the log), but still got the crash when num_gpu was higher than 27. So, not enough to avoid the crash at least in this case.

  1. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable.

Tried this on the q4 model, and it worked! Even leaving num_gpu on the default (which from logs looks like it was 30 this time around), but setting the unified memory variable avoided the crash.

<!-- gh-comment-id:2451426064 --> @thatjpk commented on GitHub (Nov 1, 2024): > ``` > time=2024-10-31T05:39:39.345Z level=INFO source=memory.go:346 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=31 layers.split="" memory.available="[9.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="9.4 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[9.4 GiB]" memory.weights.total="5.2 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="213.3 MiB" memory.graph.partial="213.3 MiB" > ``` > > You might have 10G available on your GPU, but the model requires 11G, and so only 31 of 41 layers are allocated to the GPU, the rest will be loaded into system memory. This means that the GPU is fully allocated, and because the architecture of this model is new, it's possible that the memory calculations aren't quite correct and ollama is telling the runner to load more layers than it can handle. Ah, yeah this makes sense. I did the tests you suggested below, and it looks like you're probably right about the layer offload math just being off. > > If this is the problem, there are a few things you can do to try to mitigate the issue. > > 1. Reduce the number of layers offloaded to the GPU by explicitly setting `num_gpu`. See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for details. You can find the current value by searching for `layers.model` in the logs. For this model it's 31, so try 28 and if that works, increase it until you get a failure. Trying `x/llama3.2-vision:11b-instruct-q4_K_M` again with `num_gpu` set to 27 or lower worked, anything higher reproduces the crash. Available VRAM fluctuated I guess because of other stuff going on in the desktop session, so 27 was the magic number here instead of 28. > 2. Set `OLLAMA_FLASH_ATTENTION=1` in the server environment. Flash attention is a more efficient use of KV space and may reduce memory pressure, although it's effectiveness varies by model. Tried this also on the q4 model (and verified the env var was set by seeing the `OLLAMA_FLASH_ATTENTION:true` in the log), but still got the crash when `num_gpu` was higher than 27. So, not enough to avoid the crash at least in this case. > 3. Set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in the server environment. This will allow the runner to use system RAM if there's insufficient VRAM. This can be a bottleneck and make the model run slower so it's not great for allocating large amounts of RAM, but if it's just enough to prevent the server from OOMing on VRAM the performance hit is not likely to be noticeable. Tried this on the q4 model, and it worked! Even leaving `num_gpu` on the default (which from logs looks like it was 30 this time around), but setting the unified memory variable avoided the crash.
Author
Owner

@jessegross commented on GitHub (Nov 5, 2024):

For those who are running into this issue, it should now be fixed in RC8.

<!-- gh-comment-id:2456053254 --> @jessegross commented on GitHub (Nov 5, 2024): For those who are running into this issue, it should now be fixed in RC8.
Author
Owner

@thatjpk commented on GitHub (Nov 5, 2024):

Re-tested rc8 without an explicit num_gpu or any of the other mitigations mentioned above, and I haven't been able to repro the crash. Thanks for the fix!

Closing this, but feel free to reopen if needed.

<!-- gh-comment-id:2456169933 --> @thatjpk commented on GitHub (Nov 5, 2024): Re-tested rc8 without an explicit `num_gpu` or any of the other mitigations mentioned above, and I haven't been able to repro the crash. Thanks for the fix! Closing this, but feel free to reopen if needed.
Author
Owner

@mastoca commented on GitHub (Nov 8, 2024):

Still seeing this in 0.4.1-rc0

No matter what settings I use the GPU is never used.

this looks kinda suspicious ....

time=2024-11-08T18:11:03.922-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.45 seconds"
[GIN] 2024/11/08 - 18:11:03 | 200 | 1.721373066s | 127.0.0.1 | POST "/api/generate"
time=2024-11-08T18:11:52.342-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"

I have ~20GB of vram on my gpu (checked & watched in nvtop. There was no ollama present ever)

<!-- gh-comment-id:2465897115 --> @mastoca commented on GitHub (Nov 8, 2024): Still seeing this in 0.4.1-rc0 No matter what settings I use the GPU is **never** used. this looks kinda suspicious .... > time=2024-11-08T18:11:03.922-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.45 seconds" > [GIN] 2024/11/08 - 18:11:03 | 200 | 1.721373066s | 127.0.0.1 | POST "/api/generate" > time=2024-11-08T18:11:52.342-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" I have ~20GB of vram on my gpu (checked & watched in nvtop. There was no ollama present ever)
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

Supply full logs. The warning is just a warning, it enforces OLLAMA_NUM_PARALLEL=1.

<!-- gh-comment-id:2465913146 --> @rick-github commented on GitHub (Nov 8, 2024): Supply full logs. The warning is just a warning, it enforces `OLLAMA_NUM_PARALLEL=1`.
Author
Owner

@mastoca commented on GitHub (Nov 9, 2024):

here's the output (slightly cleaned)

2024/11/08 19:39:16 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/mastoca/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-08T19:39:16.790-05:00 level=INFO source=images.go:755 msg="total blobs: 225"
time=2024-11-08T19:39:16.793-05:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
time=2024-11-08T19:39:16.795-05:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1-rc0)"
time=2024-11-08T19:39:16.796-05:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1805014448/runners
time=2024-11-08T19:39:16.838-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]"
time=2024-11-08T19:39:16.838-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-08T19:39:16.838-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries"
time=2024-11-08T19:39:16.970-05:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:296 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB"
time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:399 msg="no compatible amdgpu devices detected"
time=2024-11-08T19:39:16.970-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU- library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="22.1 GiB" available="20.8 GiB"
[GIN] 2024/11/08 - 19:39:45 | 200 | 144.552µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/11/08 - 19:39:45 | 200 | 23.576778ms | 127.0.0.1 | POST "/api/show"
time=2024-11-08T19:39:45.576-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
time=2024-11-08T19:39:45.721-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 gpu=GPU- parallel=1 available=22295216128 required="15.3 GiB"
time=2024-11-08T19:39:45.836-05:00 level=INFO source=server.go:105 msg="system memory" total="61.9 GiB" free="39.1 GiB" free_swap="0 B"
time=2024-11-08T19:39:45.837-05:00 level=INFO source=memory.go:343 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[20.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.3 GiB" memory.required.partial="15.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[15.3 GiB]" memory.weights.total="9.3 GiB" memory.weights.repeating="8.8 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB"
time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1805014448/runners/cpu_avx2/ollama_llama_server --model /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 --ctx-size 2048 --batch-size 512 --n-gpu-layers 41 --mmproj /home/mastoca/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 16 --parallel 1 --port 32953"
time=2024-11-08T19:39:45.838-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-08T19:39:45.839-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:863 msg="starting go runner"
time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16
time=2024-11-08T19:39:45.844-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:32953"
llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mllama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model
llama_model_loader: - kv 3: general.size_label str = 10B
llama_model_loader: - kv 4: mllama.block_count u32 = 40
llama_model_loader: - kv 5: mllama.context_length u32 = 131072
llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096
llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336
llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32
llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 12: general.file_type u32 = 7
llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256
llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128
llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38]
llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004
llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - type f32: 114 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: special tokens cache size = 257
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mllama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 11B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 9.78 B
llm_load_print_meta: model size = 9.67 GiB (8.50 BPW)
llm_load_print_meta: general.name = Model
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab mismatch 128256 !- 128257 ...
llm_load_tensors: ggml ctx size = 0.18 MiB
time=2024-11-08T19:39:46.290-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server not responding"
llm_load_tensors: CPU buffer size = 9905.93 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
time=2024-11-08T19:39:46.541-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
llama_kv_cache_init: CPU KV buffer size = 656.25 MiB
llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 258.50 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct
mllama_model_load: description: vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment: 32
mllama_model_load: n_tensors: 512
mllama_model_load: n_kv: 17
mllama_model_load: ftype: f16
mllama_model_load:
mllama_model_load: vision using CPU backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2024-11-08T19:39:47.545-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.71 seconds"
[GIN] 2024/11/08 - 19:39:47 | 200 | 1.980357826s | 127.0.0.1 | POST "/api/generate"
time=2024-11-08T19:40:13.747-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet"
[GIN] 2024/11/08 - 19:41:04 | 200 | 50.550516492s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2465950104 --> @mastoca commented on GitHub (Nov 9, 2024): here's the output (slightly cleaned) > 2024/11/08 19:39:16 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/mastoca/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-11-08T19:39:16.790-05:00 level=INFO source=images.go:755 msg="total blobs: 225" time=2024-11-08T19:39:16.793-05:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" time=2024-11-08T19:39:16.795-05:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1-rc0)" time=2024-11-08T19:39:16.796-05:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1805014448/runners time=2024-11-08T19:39:16.838-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2]" time=2024-11-08T19:39:16.838-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-08T19:39:16.838-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries" time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries" time=2024-11-08T19:39:16.839-05:00 level=WARN source=gpu.go:732 msg="unable to locate gpu dependency libraries" time=2024-11-08T19:39:16.970-05:00 level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:296 msg="unsupported Radeon iGPU detected skipping" id=0 total="512.0 MiB" time=2024-11-08T19:39:16.970-05:00 level=INFO source=amd_linux.go:399 msg="no compatible amdgpu devices detected" time=2024-11-08T19:39:16.970-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-<redacted> library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090" total="22.1 GiB" available="20.8 GiB" [GIN] 2024/11/08 - 19:39:45 | 200 | 144.552µs | 127.0.0.1 | HEAD "/" [GIN] 2024/11/08 - 19:39:45 | 200 | 23.576778ms | 127.0.0.1 | POST "/api/show" time=2024-11-08T19:39:45.576-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" time=2024-11-08T19:39:45.721-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 gpu=GPU-<redacted> parallel=1 available=22295216128 required="15.3 GiB" time=2024-11-08T19:39:45.836-05:00 level=INFO source=server.go:105 msg="system memory" total="61.9 GiB" free="39.1 GiB" free_swap="0 B" time=2024-11-08T19:39:45.837-05:00 level=INFO source=memory.go:343 msg="offload to cuda" projector.weights="1.8 GiB" projector.graph="2.8 GiB" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[20.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="15.3 GiB" memory.required.partial="15.3 GiB" memory.required.kv="656.2 MiB" memory.required.allocations="[15.3 GiB]" memory.weights.total="9.3 GiB" memory.weights.repeating="8.8 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="669.5 MiB" time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1805014448/runners/cpu_avx2/ollama_llama_server --model /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 --ctx-size 2048 --batch-size 512 --n-gpu-layers 41 --mmproj /home/mastoca/.ollama/models/blobs/sha256-ece5e659647a20a5c28ab9eea1c12a1ad430bc0f2a27021d00ad103b3bf5206f --threads 16 --parallel 1 --port 32953" time=2024-11-08T19:39:45.838-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-08T19:39:45.838-05:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-08T19:39:45.839-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:863 msg="starting go runner" time=2024-11-08T19:39:45.844-05:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16 time=2024-11-08T19:39:45.844-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:32953" llama_model_loader: loaded meta data with 27 key-value pairs and 396 tensors from /home/mastoca/.ollama/models/blobs/sha256-7ef0839fb71fbab13fda97c1b9819ffd99c799ba4f93d421ae1e2a46d68c5fa6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = mllama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Model llama_model_loader: - kv 3: general.size_label str = 10B llama_model_loader: - kv 4: mllama.block_count u32 = 40 llama_model_loader: - kv 5: mllama.context_length u32 = 131072 llama_model_loader: - kv 6: mllama.embedding_length u32 = 4096 llama_model_loader: - kv 7: mllama.feed_forward_length u32 = 14336 llama_model_loader: - kv 8: mllama.attention.head_count u32 = 32 llama_model_loader: - kv 9: mllama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: mllama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 11: mllama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 12: general.file_type u32 = 7 llama_model_loader: - kv 13: mllama.vocab_size u32 = 128256 llama_model_loader: - kv 14: mllama.rope.dimension_count u32 = 128 llama_model_loader: - kv 15: mllama.attention.cross_attention_layers arr[i32,8] = [3, 8, 13, 18, 23, 28, 33, 38] llama_model_loader: - kv 16: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 18: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,128257] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,128257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 24: tokenizer.ggml.padding_token_id u32 = 128004 llama_model_loader: - kv 25: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 114 tensors llama_model_loader: - type q8_0: 282 tensors llm_load_vocab: special tokens cache size = 257 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = mllama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 11B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 9.78 B llm_load_print_meta: model size = 9.67 GiB (8.50 BPW) llm_load_print_meta: general.name = Model llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: PAD token = 128004 '<|finetune_right_pad_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llama_model_load: vocab mismatch 128256 !- 128257 ... llm_load_tensors: ggml ctx size = 0.18 MiB time=2024-11-08T19:39:46.290-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server not responding" llm_load_tensors: CPU buffer size = 9905.93 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 time=2024-11-08T19:39:46.541-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 656.25 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CPU backend mllama_model_load: compute allocated memory: 2853.34 MB time=2024-11-08T19:39:47.545-05:00 level=INFO source=server.go:601 msg="llama runner started in 1.71 seconds" [GIN] 2024/11/08 - 19:39:47 | 200 | 1.980357826s | 127.0.0.1 | POST "/api/generate" time=2024-11-08T19:40:13.747-05:00 level=WARN source=sched.go:137 msg="mllama doesn't support parallel requests yet" [GIN] 2024/11/08 - 19:41:04 | 200 | 50.550516492s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@rick-github commented on GitHub (Nov 9, 2024):

How did you install ollama? It looks like you have missing libraries, which might indicate a packaging problem. I'm able to run 0.4.1-rc0 in docker without a problem, so perhaps specific to the package you installed. You could also try installing the actual release, 0.4.1, and see if it works better.

<!-- gh-comment-id:2465955561 --> @rick-github commented on GitHub (Nov 9, 2024): How did you install ollama? It looks like you have missing libraries, which might indicate a packaging problem. I'm able to run 0.4.1-rc0 in docker without a problem, so perhaps specific to the package you installed. You could also try installing the actual release, [0.4.1](https://github.com/ollama/ollama/releases/tag/v0.4.1), and see if it works better.
Author
Owner

@mastoca commented on GitHub (Nov 9, 2024):

@rick-github I think you're right. I'm using a nix flake (nixos unstable branch) that I'm trying to upgrade to 0.4.1. I don't think it's linking correctly the cuda portion to both binaries (bin/runner and bin/ollama) perhaps. I'll build the docker image and see its behavior next.

I'm not a go developer yet so not comfy with the go toolchain yet.

<!-- gh-comment-id:2465959085 --> @mastoca commented on GitHub (Nov 9, 2024): @rick-github I think you're right. I'm using a nix flake (nixos unstable branch) that I'm trying to upgrade to 0.4.1. I don't think it's linking correctly the cuda portion to both binaries (bin/runner and bin/ollama) perhaps. I'll build the docker image and see its behavior next. I'm not a go developer yet so not comfy with the go toolchain yet.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51240