[GH-ISSUE #7439] CUDA error occurs when use x/llama3.2-vision on v0.4.0-rc6 #66784

Closed
opened 2026-05-04 08:11:09 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @baijunty on GitHub (Oct 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7439

What is the issue?

laptop gtx 4060
_ollama_logs.txt

OS

Linux, Docker

GPU

Nvidia

CPU

AMD

Ollama version

v0.4.0-rc6

Originally created by @baijunty on GitHub (Oct 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7439 ### What is the issue? laptop gtx 4060 [_ollama_logs.txt](https://github.com/user-attachments/files/17581556/_ollama_logs.txt) ### OS Linux, Docker ### GPU Nvidia ### CPU AMD ### Ollama version v0.4.0-rc6
GiteaMirror added the bug label 2026-05-04 08:11:09 -05:00
Author
Owner

@thatjpk commented on GitHub (Oct 31, 2024):

Actually came to report a similar issue. My log is similar but not the same, so I filed a separate issue in https://github.com/ollama/ollama/issues/7440, but posting here in case they end up being related.

@baijunty 's log posted above:

  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.36 MiB
llm_load_tensors: offloading 17 repeating layers to GPU
llm_load_tensors: offloaded 17/41 layers to GPU
llm_load_tensors:        CPU buffer size =  5679.33 MiB
llm_load_tensors:      CUDA0 buffer size =  2126.70 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   352.12 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   304.12 MiB
llama_new_context_with_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 251
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: vision using CUDA backend
mllama_model_load: compute allocated memory: 2853.34 MB
time=2024-10-31T03:29:56.629Z level=INFO source=server.go:606 msg="llama runner started in 6.78 seconds"
CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at ggml-cuda/common.cuh:677
  cublasCreate_v2(&cublas_handles[device])
ggml-cuda.cu:132: CUDA error
SIGSEGV: segmentation violation
PC=0x75b5a9224c47 m=8 sigcode=1 addr=0x206603fd8
signal arrived during cgo execution

My log

  Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.36 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/41 layers to GPU
llm_load_tensors:        CPU buffer size =  5679.33 MiB
llm_load_tensors:      CUDA0 buffer size =  3841.45 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   156.06 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   500.19 MiB
llama_new_context_with_model: KV self size  =  656.25 MiB, K (f16):  328.12 MiB, V (f16):  328.12 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   669.48 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 95
mllama_model_load: model name:   Llama-3.2-11B-Vision-Instruct
mllama_model_load: description:  vision encoder for Mllama
mllama_model_load: GGUF version: 3
mllama_model_load: alignment:    32
mllama_model_load: n_tensors:    512
mllama_model_load: n_kv:         17
mllama_model_load: ftype:        f16
mllama_model_load: 
mllama_model_load: vision using CUDA backend
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2991947904
mllama_model_load: compute allocated memory: 0.00 MB
time=2024-10-31T05:39:41.603Z level=INFO source=server.go:606 msg="llama runner started in 2.26 seconds"
SIGSEGV: segmentation violation
PC=0x634314838794 m=7 sigcode=1 addr=0x10
signal arrived during cgo execution
<!-- gh-comment-id:2449071901 --> @thatjpk commented on GitHub (Oct 31, 2024): Actually came to report a similar issue. My log is similar but not the same, so I filed a separate issue in https://github.com/ollama/ollama/issues/7440, but posting here in case they end up being related. @baijunty 's log posted above: ``` Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.36 MiB llm_load_tensors: offloading 17 repeating layers to GPU llm_load_tensors: offloaded 17/41 layers to GPU llm_load_tensors: CPU buffer size = 5679.33 MiB llm_load_tensors: CUDA0 buffer size = 2126.70 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 352.12 MiB llama_kv_cache_init: CUDA0 KV buffer size = 304.12 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 251 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend mllama_model_load: compute allocated memory: 2853.34 MB time=2024-10-31T03:29:56.629Z level=INFO source=server.go:606 msg="llama runner started in 6.78 seconds" CUDA error: the resource allocation failed current device: 0, in function cublas_handle at ggml-cuda/common.cuh:677 cublasCreate_v2(&cublas_handles[device]) ggml-cuda.cu:132: CUDA error SIGSEGV: segmentation violation PC=0x75b5a9224c47 m=8 sigcode=1 addr=0x206603fd8 signal arrived during cgo execution ``` My log ``` Device 0: NVIDIA GeForce RTX 3080 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.36 MiB llm_load_tensors: offloading 31 repeating layers to GPU llm_load_tensors: offloaded 31/41 layers to GPU llm_load_tensors: CPU buffer size = 5679.33 MiB llm_load_tensors: CUDA0 buffer size = 3841.45 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 156.06 MiB llama_kv_cache_init: CUDA0 KV buffer size = 500.19 MiB llama_new_context_with_model: KV self size = 656.25 MiB, K (f16): 328.12 MiB, V (f16): 328.12 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 669.48 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 95 mllama_model_load: model name: Llama-3.2-11B-Vision-Instruct mllama_model_load: description: vision encoder for Mllama mllama_model_load: GGUF version: 3 mllama_model_load: alignment: 32 mllama_model_load: n_tensors: 512 mllama_model_load: n_kv: 17 mllama_model_load: ftype: f16 mllama_model_load: mllama_model_load: vision using CUDA backend ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2853.34 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2991947904 mllama_model_load: compute allocated memory: 0.00 MB time=2024-10-31T05:39:41.603Z level=INFO source=server.go:606 msg="llama runner started in 2.26 seconds" SIGSEGV: segmentation violation PC=0x634314838794 m=7 sigcode=1 addr=0x10 signal arrived during cgo execution ```
Author
Owner

@jessegross commented on GitHub (Oct 31, 2024):

Thanks for the report - I think these are the same issue, so tracking this in #7440

<!-- gh-comment-id:2450740886 --> @jessegross commented on GitHub (Oct 31, 2024): Thanks for the report - I think these are the same issue, so tracking this in #7440
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66784