[GH-ISSUE #7673] CUDA error: out of memory - Llama 3.2 3B on laptop with 13 GB RAM #66953

Closed
opened 2026-05-04 08:59:55 -05:00 by GiteaMirror · 26 comments
Owner

Originally created by @kripper on GitHub (Nov 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7673

What is the issue?

Hardware has 11.1 GiB (RAM) + 1.9 GiB (GPU) = 13 GiB, but fails to run a 3B model.
Any idea why?

Nov 14 17:49:49 fedora ollama[1197]: r14    0x6
Nov 14 17:49:49 fedora ollama[1197]: r15    0x626b00000
Nov 14 17:49:49 fedora ollama[1197]: rip    0x7fd1485c4664
Nov 14 17:49:49 fedora ollama[1197]: rflags 0x246
Nov 14 17:49:49 fedora ollama[1197]: cs     0x33
Nov 14 17:49:49 fedora ollama[1197]: fs     0x0
Nov 14 17:49:49 fedora ollama[1197]: gs     0x0
Nov 14 17:49:49 fedora ollama[1197]: [GIN] 2024/11/14 - 17:49:49 | 200 |         1m56s |     192.168.0.7 | POST     "/api/chat"
Nov 14 17:52:06 fedora ollama[1197]: [GIN] 2024/11/14 - 17:52:06 | 200 |      74.935µs |     192.168.0.7 | GET      "/api/version"
Nov 14 17:52:13 fedora ollama[1197]: [GIN] 2024/11/14 - 17:52:13 | 200 |      41.147µs |     192.168.0.7 | GET      "/api/version"
Nov 14 17:52:33 fedora ollama[1197]: [GIN] 2024/11/14 - 17:52:33 | 200 |    1.082721ms |     192.168.0.7 | GET      "/api/tags"
Nov 14 17:52:34 fedora ollama[1197]: time=2024-11-14T17:52:34.562-05:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda total="1.9 GiB" available="94.7 MiB"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.610-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.047152054 model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.797-05:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="9.9 GiB" free_swap="8.0 GiB"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.798-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=13 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.799-05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1543119167/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 13 --threads 2 --parallel 1 --port 42457"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.800-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.800-05:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.800-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.811-05:00 level=INFO source=runner.go:863 msg="starting go runner"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.811-05:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.811-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42457"
Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.861-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.297443815 model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   1:                               general.type str              = model
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   5:                         general.size_label str              = 3B
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   8:                          llama.block_count u32              = 28
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  18:                          general.file_type u32              = 15
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 14 17:52:40 fedora ollama[1197]: time=2024-11-14T17:52:40.051-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - type  f32:   58 tensors
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - type q4_K:  168 tensors
Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - type q6_K:   29 tensors
Nov 14 17:52:40 fedora ollama[1197]: time=2024-11-14T17:52:40.110-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.547122288 model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730
Nov 14 17:52:40 fedora ollama[1197]: llm_load_vocab: special tokens cache size = 256
Nov 14 17:52:40 fedora ollama[1197]: llm_load_vocab: token to piece cache size = 0.7999 MB
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: arch             = llama
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: vocab type       = BPE
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_vocab          = 128256
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_merges         = 280147
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: vocab_only       = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_ctx_train      = 131072
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd           = 3072
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_layer          = 28
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_head           = 24
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_head_kv        = 8
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_rot            = 128
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_swa            = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_head_k    = 128
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_head_v    = 128
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_gqa            = 3
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_k_gqa     = 1024
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_v_gqa     = 1024
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_ff             = 8192
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_expert         = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_expert_used    = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: causal attn      = 1
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: pooling type     = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: rope type        = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: rope scaling     = linear
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: freq_base_train  = 500000.0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: freq_scale_train = 1
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: rope_finetuned   = unknown
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_d_conv       = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_d_inner      = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_d_state      = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_dt_rank      = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model type       = 3B
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model ftype      = Q4_K - Medium
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model params     = 3.21 B
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: LF token         = 128 'Ä'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: max token length = 256
Nov 14 17:52:40 fedora ollama[1197]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Nov 14 17:52:40 fedora ollama[1197]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Nov 14 17:52:40 fedora ollama[1197]: ggml_cuda_init: found 1 CUDA devices:
Nov 14 17:52:40 fedora ollama[1197]:   Device 0: NVIDIA GeForce 940M, compute capability 5.0, VMM: yes
Nov 14 17:52:40 fedora ollama[1197]: llm_load_tensors: ggml ctx size =    0.24 MiB
Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors: offloading 13 repeating layers to GPU
Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors: offloaded 13/29 layers to GPU
Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors:        CPU buffer size =  1918.35 MiB
Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors:      CUDA0 buffer size =   757.22 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: n_ctx      = 2048
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: n_batch    = 512
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: n_ubatch   = 512
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: flash_attn = 0
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: freq_base  = 500000.0
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: freq_scale = 1
Nov 14 17:53:06 fedora ollama[1197]: llama_kv_cache_init:  CUDA_Host KV buffer size =   120.00 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_kv_cache_init:      CUDA0 KV buffer size =   104.00 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model:      CUDA0 compute buffer size =   564.73 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: graph nodes  = 902
Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: graph splits = 199
Nov 14 17:53:06 fedora ollama[1197]: time=2024-11-14T17:53:06.409-05:00 level=INFO source=server.go:601 msg="llama runner started in 26.61 seconds"
Nov 14 17:53:16 fedora ollama[1197]: CUDA error: out of memory
Nov 14 17:53:16 fedora ollama[1197]:   current device: 0, in function alloc at ggml-cuda.cu:406
Nov 14 17:53:16 fedora ollama[1197]:   cuMemCreate(&handle, reserve_size, &prop, 0)
Nov 14 17:53:16 fedora ollama[1197]: ggml-cuda.cu:132: CUDA error
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2509]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2508]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2507]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2506]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2503]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2502]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2501]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2500]
Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2499]
Nov 14 17:53:16 fedora ollama[2556]: [Thread debugging using libthread_db enabled]
Nov 14 17:53:16 fedora ollama[2556]: Using host libthread_db library "/lib64/libthread_db.so.1".
Nov 14 17:53:16 fedora ollama[2556]: 0x00005595604abba3 in ?? ()
Nov 14 17:53:16 fedora ollama[2556]: #0  0x00005595604abba3 in ?? ()
Nov 14 17:53:16 fedora ollama[2556]: #1  0x0000559560470ef0 in _start ()
Nov 14 17:53:16 fedora ollama[2556]: [Inferior 1 (process 2498) detached]
Nov 14 17:53:16 fedora ollama[1197]: SIGABRT: abort
Nov 14 17:53:16 fedora ollama[1197]: PC=0x7fc2f38a8664 m=4 sigcode=18446744073709551610
Nov 14 17:53:16 fedora ollama[1197]: signal arrived during cgo execution
Nov 14 17:53:16 fedora ollama[1197]: goroutine 7 gp=0xc000156000 m=4 mp=0xc000049808 [syscall]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.cgocall(0x5595606bee90, 0xc000052b60)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/cgocall.go:157 +0x4b fp=0xc000052b38 sp=0xc000052b00 pc=0x5595604413cb
Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7fc27c0068f0, {0x200, 0x7fc27c028e80, 0x0, 0x0, 0x7fc27c029690, 0x7fc27c029ea0, 0x7fc27c02a6b0, 0x7fc2559b1600, 0x0, ...})
Nov 14 17:53:16 fedora ollama[1197]:         _cgo_gotypes.go:543 +0x52 fp=0xc000052b60 sp=0xc000052b38 pc=0x55956053e952
Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5595606bad4b?, 0x7fc27c0068f0?)
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000052c80 sp=0xc000052b60 pc=0x559560540e78
Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama.(*Context).Decode(0x559560cb3060?, 0x0?)
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000052cc8 sp=0xc000052c80 pc=0x559560540cd7
Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).processBatch(0xc000122120, 0xc0000ce000, 0xc000052f10)
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000052ed0 sp=0xc000052cc8 pc=0x5595606b9d7e
Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).run(0xc000122120, {0x5595609fca40, 0xc000078050})
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000052fb8 sp=0xc000052ed0 pc=0x5595606b9765
Nov 14 17:53:16 fedora ollama[1197]: main.main.gowrap2()
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000052fe0 sp=0xc000052fb8 pc=0x5595606bdec8
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc000052fe8 sp=0xc000052fe0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by main.main in goroutine 1
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b
Nov 14 17:53:16 fedora ollama[1197]: goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x1?, 0xc000029908?, 0xf4?, 0x7d?, 0xc0000298e8?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc000029888 sp=0xc000029868 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.netpollblock(0x10?, 0x60440b26?, 0x95?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/netpoll.go:573 +0xf7 fp=0xc0000298c0 sp=0xc000029888 pc=0x559560470257
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.runtime_pollWait(0x7fc2f36d6fe0, 0x72)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/netpoll.go:345 +0x85 fp=0xc0000298e0 sp=0xc0000298c0 pc=0x5595604a4aa5
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).wait(0x3?, 0x7fc2f55512c8?, 0x0)
Nov 14 17:53:16 fedora ollama[1197]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000029908 sp=0xc0000298e0 pc=0x5595604f49c7
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).waitRead(...)
Nov 14 17:53:16 fedora ollama[1197]:         internal/poll/fd_poll_runtime.go:89
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*FD).Accept(0xc000150080)
Nov 14 17:53:16 fedora ollama[1197]:         internal/poll/fd_unix.go:611 +0x2ac fp=0xc0000299b0 sp=0xc000029908 pc=0x5595604f5e8c
Nov 14 17:53:16 fedora ollama[1197]: net.(*netFD).accept(0xc000150080)
Nov 14 17:53:16 fedora ollama[1197]:         net/fd_unix.go:172 +0x29 fp=0xc000029a68 sp=0xc0000299b0 pc=0x5595605648a9
Nov 14 17:53:16 fedora ollama[1197]: net.(*TCPListener).accept(0xc00002e1e0)
Nov 14 17:53:16 fedora ollama[1197]:         net/tcpsock_posix.go:159 +0x1e fp=0xc000029a90 sp=0xc000029a68 pc=0x5595605755de
Nov 14 17:53:16 fedora ollama[1197]: net.(*TCPListener).Accept(0xc00002e1e0)
Nov 14 17:53:16 fedora ollama[1197]:         net/tcpsock.go:327 +0x30 fp=0xc000029ac0 sp=0xc000029a90 pc=0x559560574930
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*onceCloseListener).Accept(0xc00009e000?)
Nov 14 17:53:16 fedora ollama[1197]:         <autogenerated>:1 +0x24 fp=0xc000029ad8 sp=0xc000029ac0 pc=0x55956069ba44
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*Server).Serve(0xc0000163c0, {0x5595609fc400, 0xc00002e1e0})
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:3260 +0x33e fp=0xc000029c08 sp=0xc000029ad8 pc=0x55956069285e
Nov 14 17:53:16 fedora ollama[1197]: main.main()
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc000029f50 sp=0xc000029c08 pc=0x5595606bdc4c
Nov 14 17:53:16 fedora ollama[1197]: runtime.main()
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:271 +0x29d fp=0xc000029fe0 sp=0xc000029f50 pc=0x559560477bdd
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc000029fe8 sp=0xc000029fe0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc000042fa8 sp=0xc000042f88 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.goparkunlock(...)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:408
Nov 14 17:53:16 fedora ollama[1197]: runtime.forcegchelper()
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:326 +0xb8 fp=0xc000042fe0 sp=0xc000042fa8 pc=0x559560477e98
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc000042fe8 sp=0xc000042fe0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by runtime.init.6 in goroutine 1
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:314 +0x1a
Nov 14 17:53:16 fedora ollama[1197]: goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc000043780 sp=0xc000043760 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.goparkunlock(...)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:408
Nov 14 17:53:16 fedora ollama[1197]: runtime.bgsweep(0xc00006a000)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgcsweep.go:278 +0x94 fp=0xc0000437c8 sp=0xc000043780 pc=0x559560462b54
Nov 14 17:53:16 fedora ollama[1197]: runtime.gcenable.gowrap1()
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgc.go:203 +0x25 fp=0xc0000437e0 sp=0xc0000437c8 pc=0x559560457685
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc0000437e8 sp=0xc0000437e0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by runtime.gcenable in goroutine 1
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgc.go:203 +0x66
Nov 14 17:53:16 fedora ollama[1197]: goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0xc00006a000?, 0x5595608fce98?, 0x1?, 0x0?, 0xc000007340?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc000043f78 sp=0xc000043f58 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.goparkunlock(...)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:408
Nov 14 17:53:16 fedora ollama[1197]: runtime.(*scavengerState).park(0x559560bca4c0)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgcscavenge.go:425 +0x49 fp=0xc000043fa8 sp=0xc000043f78 pc=0x559560460549
Nov 14 17:53:16 fedora ollama[1197]: runtime.bgscavenge(0xc00006a000)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgcscavenge.go:653 +0x3c fp=0xc000043fc8 sp=0xc000043fa8 pc=0x559560460adc
Nov 14 17:53:16 fedora ollama[1197]: runtime.gcenable.gowrap2()
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgc.go:204 +0x25 fp=0xc000043fe0 sp=0xc000043fc8 pc=0x559560457625
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc000043fe8 sp=0xc000043fe0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by runtime.gcenable in goroutine 1
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mgc.go:204 +0xa5
Nov 14 17:53:16 fedora ollama[1197]: goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0xc000042648?, 0x55956044af85?, 0xa8?, 0x1?, 0xc0000061c0?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc000042620 sp=0xc000042600 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.runfinq()
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mfinal.go:194 +0x107 fp=0xc0000427e0 sp=0xc000042620 pc=0x5595604566c7
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc0000427e8 sp=0xc0000427e0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by runtime.createfing in goroutine 1
Nov 14 17:53:16 fedora ollama[1197]:         runtime/mfinal.go:164 +0x3d
Nov 14 17:53:16 fedora ollama[1197]: goroutine 108 gp=0xc000007dc0 m=nil [IO wait]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x10?, 0x10?, 0xf0?, 0x4d?, 0xb?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc000044da8 sp=0xc000044d88 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.netpollblock(0x5595604de558?, 0x60440b26?, 0x95?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/netpoll.go:573 +0xf7 fp=0xc000044de0 sp=0xc000044da8 pc=0x559560470257
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.runtime_pollWait(0x7fc2f36d6ee8, 0x72)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/netpoll.go:345 +0x85 fp=0xc000044e00 sp=0xc000044de0 pc=0x5595604a4aa5
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).wait(0xc00009c000?, 0xc000092101?, 0x0)
Nov 14 17:53:16 fedora ollama[1197]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000044e28 sp=0xc000044e00 pc=0x5595604f49c7
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).waitRead(...)
Nov 14 17:53:16 fedora ollama[1197]:         internal/poll/fd_poll_runtime.go:89
Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*FD).Read(0xc00009c000, {0xc000092101, 0x1, 0x1})
Nov 14 17:53:16 fedora ollama[1197]:         internal/poll/fd_unix.go:164 +0x27a fp=0xc000044ec0 sp=0xc000044e28 pc=0x5595604f551a
Nov 14 17:53:16 fedora ollama[1197]: net.(*netFD).Read(0xc00009c000, {0xc000092101?, 0xc000044f48?, 0x5595604a66d0?})
Nov 14 17:53:16 fedora ollama[1197]:         net/fd_posix.go:55 +0x25 fp=0xc000044f08 sp=0xc000044ec0 pc=0x5595605637a5
Nov 14 17:53:16 fedora ollama[1197]: net.(*conn).Read(0xc000094008, {0xc000092101?, 0x0?, 0x559560cb3060?})
Nov 14 17:53:16 fedora ollama[1197]:         net/net.go:185 +0x45 fp=0xc000044f50 sp=0xc000044f08 pc=0x55956056da65
Nov 14 17:53:16 fedora ollama[1197]: net.(*TCPConn).Read(0xc0000920f0?, {0xc000092101?, 0x0?, 0x0?})
Nov 14 17:53:16 fedora ollama[1197]:         <autogenerated>:1 +0x25 fp=0xc000044f80 sp=0xc000044f50 pc=0x559560579445
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*connReader).backgroundRead(0xc0000920f0)
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:681 +0x37 fp=0xc000044fc8 sp=0xc000044f80 pc=0x5595606881d7
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*connReader).startBackgroundRead.gowrap2()
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:677 +0x25 fp=0xc000044fe0 sp=0xc000044fc8 pc=0x559560688105
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc000044fe8 sp=0xc000044fe0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by net/http.(*connReader).startBackgroundRead in goroutine 18
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:677 +0xba
Nov 14 17:53:16 fedora ollama[1197]: goroutine 18 gp=0xc000082380 m=nil [select]:
Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0xc00029fa80?, 0x2?, 0x60?, 0x0?, 0xc00029f824?)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/proc.go:402 +0xce fp=0xc00029f698 sp=0xc00029f678 pc=0x55956047800e
Nov 14 17:53:16 fedora ollama[1197]: runtime.selectgo(0xc00029fa80, 0xc00029f820, 0x59a?, 0x0, 0x1?, 0x1)
Nov 14 17:53:16 fedora ollama[1197]:         runtime/select.go:327 +0x725 fp=0xc00029f7b8 sp=0xc00029f698 pc=0x5595604893e5
Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).completion(0xc000122120, {0x5595609fc5b0, 0xc000280460}, 0xc00017ab40)
Nov 14 17:53:16 fedora ollama[1197]:         github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc00029fab8 sp=0xc00029f7b8 pc=0x5595606bb6de
Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).completion-fm({0x5595609fc5b0?, 0xc000280460?}, 0x559560696b8d?)
Nov 14 17:53:16 fedora ollama[1197]:         <autogenerated>:1 +0x36 fp=0xc00029fae8 sp=0xc00029fab8 pc=0x5595606be6b6
Nov 14 17:53:16 fedora ollama[1197]: net/http.HandlerFunc.ServeHTTP(0xc00007ed00?, {0x5595609fc5b0?, 0xc000280460?}, 0x10?)
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:2171 +0x29 fp=0xc00029fb10 sp=0xc00029fae8 pc=0x55956068f629
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*ServeMux).ServeHTTP(0x55956044af85?, {0x5595609fc5b0, 0xc000280460}, 0xc00017ab40)
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:2688 +0x1ad fp=0xc00029fb60 sp=0xc00029fb10 pc=0x5595606914ad
Nov 14 17:53:16 fedora ollama[1197]: net/http.serverHandler.ServeHTTP({0x5595609fb900?}, {0x5595609fc5b0?, 0xc000280460?}, 0x6?)
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:3142 +0x8e fp=0xc00029fb90 sp=0xc00029fb60 pc=0x5595606924ce
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*conn).serve(0xc00009e000, {0x5595609fca08, 0xc00007cdb0})
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:2044 +0x5e8 fp=0xc00029ffb8 sp=0xc00029fb90 pc=0x55956068e268
Nov 14 17:53:16 fedora ollama[1197]: net/http.(*Server).Serve.gowrap3()
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:3290 +0x28 fp=0xc00029ffe0 sp=0xc00029ffb8 pc=0x559560692c48
Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({})
Nov 14 17:53:16 fedora ollama[1197]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc00029ffe8 sp=0xc00029ffe0 pc=0x5595604a9de1
Nov 14 17:53:16 fedora ollama[1197]: created by net/http.(*Server).Serve in goroutine 1
Nov 14 17:53:16 fedora ollama[1197]:         net/http/server.go:3290 +0x4b4
Nov 14 17:53:16 fedora ollama[1197]: rax    0x0
Nov 14 17:53:16 fedora ollama[1197]: rbx    0x9c5
Nov 14 17:53:16 fedora ollama[1197]: rcx    0x7fc2f38a8664
Nov 14 17:53:16 fedora ollama[1197]: rdx    0x6
Nov 14 17:53:16 fedora ollama[1197]: rdi    0x9c2
Nov 14 17:53:16 fedora ollama[1197]: rsi    0x9c5
Nov 14 17:53:16 fedora ollama[1197]: rbp    0x7fc2933f6410
Nov 14 17:53:16 fedora ollama[1197]: rsp    0x7fc2933f63d0
Nov 14 17:53:16 fedora ollama[1197]: r8     0x0
Nov 14 17:53:16 fedora ollama[1197]: r9     0xfffffffc
Nov 14 17:53:16 fedora ollama[1197]: r10    0x8
Nov 14 17:53:16 fedora ollama[1197]: r11    0x246
Nov 14 17:53:16 fedora ollama[1197]: r12    0x7fc293400000
Nov 14 17:53:16 fedora ollama[1197]: r13    0x84
Nov 14 17:53:16 fedora ollama[1197]: r14    0x6
Nov 14 17:53:16 fedora ollama[1197]: r15    0x637f60000
Nov 14 17:53:16 fedora ollama[1197]: rip    0x7fc2f38a8664
Nov 14 17:53:16 fedora ollama[1197]: rflags 0x246
Nov 14 17:53:16 fedora ollama[1197]: cs     0x33
Nov 14 17:53:16 fedora ollama[1197]: fs     0x0
Nov 14 17:53:16 fedora ollama[1197]: gs     0x0
Nov 14 17:53:16 fedora ollama[1197]: [GIN] 2024/11/14 - 17:53:16 | 200 |  42.62644068s |     192.168.0.7 | POST     "/api/chat"
Nov 14 17:53:17 fedora ollama[1197]: [GIN] 2024/11/14 - 17:53:17 | 200 |     752.265µs |     192.168.0.7 | GET      "/api/tags"
Nov 14 17:54:25 fedora ollama[1197]: [GIN] 2024/11/14 - 17:54:25 | 200 |     819.913µs |     192.168.0.7 | GET      "/api/tags"

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @kripper on GitHub (Nov 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7673 ### What is the issue? Hardware has 11.1 GiB (RAM) + 1.9 GiB (GPU) = 13 GiB, but fails to run a 3B model. Any idea why? ``` Nov 14 17:49:49 fedora ollama[1197]: r14 0x6 Nov 14 17:49:49 fedora ollama[1197]: r15 0x626b00000 Nov 14 17:49:49 fedora ollama[1197]: rip 0x7fd1485c4664 Nov 14 17:49:49 fedora ollama[1197]: rflags 0x246 Nov 14 17:49:49 fedora ollama[1197]: cs 0x33 Nov 14 17:49:49 fedora ollama[1197]: fs 0x0 Nov 14 17:49:49 fedora ollama[1197]: gs 0x0 Nov 14 17:49:49 fedora ollama[1197]: [GIN] 2024/11/14 - 17:49:49 | 200 | 1m56s | 192.168.0.7 | POST "/api/chat" Nov 14 17:52:06 fedora ollama[1197]: [GIN] 2024/11/14 - 17:52:06 | 200 | 74.935µs | 192.168.0.7 | GET "/api/version" Nov 14 17:52:13 fedora ollama[1197]: [GIN] 2024/11/14 - 17:52:13 | 200 | 41.147µs | 192.168.0.7 | GET "/api/version" Nov 14 17:52:33 fedora ollama[1197]: [GIN] 2024/11/14 - 17:52:33 | 200 | 1.082721ms | 192.168.0.7 | GET "/api/tags" Nov 14 17:52:34 fedora ollama[1197]: time=2024-11-14T17:52:34.562-05:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda total="1.9 GiB" available="94.7 MiB" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.610-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.047152054 model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.797-05:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="9.9 GiB" free_swap="8.0 GiB" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.798-05:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=13 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.799-05:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1543119167/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 13 --threads 2 --parallel 1 --port 42457" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.800-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.800-05:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.800-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.811-05:00 level=INFO source=runner.go:863 msg="starting go runner" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.811-05:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2 Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.811-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:42457" Nov 14 17:52:39 fedora ollama[1197]: time=2024-11-14T17:52:39.861-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.297443815 model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 0: general.architecture str = llama Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 1: general.type str = model Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 3: general.finetune str = Instruct Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 4: general.basename str = Llama-3.2 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 5: general.size_label str = 3B Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 8: llama.block_count u32 = 28 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 9: llama.context_length u32 = 131072 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 18: general.file_type u32 = 15 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 14 17:52:39 fedora ollama[1197]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 14 17:52:40 fedora ollama[1197]: time=2024-11-14T17:52:40.051-05:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - type f32: 58 tensors Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - type q4_K: 168 tensors Nov 14 17:52:40 fedora ollama[1197]: llama_model_loader: - type q6_K: 29 tensors Nov 14 17:52:40 fedora ollama[1197]: time=2024-11-14T17:52:40.110-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.547122288 model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 Nov 14 17:52:40 fedora ollama[1197]: llm_load_vocab: special tokens cache size = 256 Nov 14 17:52:40 fedora ollama[1197]: llm_load_vocab: token to piece cache size = 0.7999 MB Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: format = GGUF V3 (latest) Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: arch = llama Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: vocab type = BPE Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_vocab = 128256 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_merges = 280147 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: vocab_only = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_ctx_train = 131072 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd = 3072 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_layer = 28 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_head = 24 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_head_kv = 8 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_rot = 128 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_swa = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_head_k = 128 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_head_v = 128 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_gqa = 3 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_k_gqa = 1024 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_embd_v_gqa = 1024 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_norm_eps = 0.0e+00 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: f_logit_scale = 0.0e+00 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_ff = 8192 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_expert = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_expert_used = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: causal attn = 1 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: pooling type = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: rope type = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: rope scaling = linear Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: freq_base_train = 500000.0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: freq_scale_train = 1 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: rope_finetuned = unknown Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_d_conv = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_d_inner = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_d_state = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_dt_rank = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model type = 3B Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model ftype = Q4_K - Medium Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model params = 3.21 B Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: general.name = Llama 3.2 3B Instruct Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: LF token = 128 'Ä' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Nov 14 17:52:40 fedora ollama[1197]: llm_load_print_meta: max token length = 256 Nov 14 17:52:40 fedora ollama[1197]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Nov 14 17:52:40 fedora ollama[1197]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Nov 14 17:52:40 fedora ollama[1197]: ggml_cuda_init: found 1 CUDA devices: Nov 14 17:52:40 fedora ollama[1197]: Device 0: NVIDIA GeForce 940M, compute capability 5.0, VMM: yes Nov 14 17:52:40 fedora ollama[1197]: llm_load_tensors: ggml ctx size = 0.24 MiB Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors: offloading 13 repeating layers to GPU Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors: offloaded 13/29 layers to GPU Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors: CPU buffer size = 1918.35 MiB Nov 14 17:53:05 fedora ollama[1197]: llm_load_tensors: CUDA0 buffer size = 757.22 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: n_ctx = 2048 Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: n_batch = 512 Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: n_ubatch = 512 Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: flash_attn = 0 Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: freq_base = 500000.0 Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: freq_scale = 1 Nov 14 17:53:06 fedora ollama[1197]: llama_kv_cache_init: CUDA_Host KV buffer size = 120.00 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_kv_cache_init: CUDA0 KV buffer size = 104.00 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: CUDA0 compute buffer size = 564.73 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: graph nodes = 902 Nov 14 17:53:06 fedora ollama[1197]: llama_new_context_with_model: graph splits = 199 Nov 14 17:53:06 fedora ollama[1197]: time=2024-11-14T17:53:06.409-05:00 level=INFO source=server.go:601 msg="llama runner started in 26.61 seconds" Nov 14 17:53:16 fedora ollama[1197]: CUDA error: out of memory Nov 14 17:53:16 fedora ollama[1197]: current device: 0, in function alloc at ggml-cuda.cu:406 Nov 14 17:53:16 fedora ollama[1197]: cuMemCreate(&handle, reserve_size, &prop, 0) Nov 14 17:53:16 fedora ollama[1197]: ggml-cuda.cu:132: CUDA error Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2509] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2508] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2507] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2506] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2503] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2502] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2501] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2500] Nov 14 17:53:16 fedora ollama[2556]: [New LWP 2499] Nov 14 17:53:16 fedora ollama[2556]: [Thread debugging using libthread_db enabled] Nov 14 17:53:16 fedora ollama[2556]: Using host libthread_db library "/lib64/libthread_db.so.1". Nov 14 17:53:16 fedora ollama[2556]: 0x00005595604abba3 in ?? () Nov 14 17:53:16 fedora ollama[2556]: #0 0x00005595604abba3 in ?? () Nov 14 17:53:16 fedora ollama[2556]: #1 0x0000559560470ef0 in _start () Nov 14 17:53:16 fedora ollama[2556]: [Inferior 1 (process 2498) detached] Nov 14 17:53:16 fedora ollama[1197]: SIGABRT: abort Nov 14 17:53:16 fedora ollama[1197]: PC=0x7fc2f38a8664 m=4 sigcode=18446744073709551610 Nov 14 17:53:16 fedora ollama[1197]: signal arrived during cgo execution Nov 14 17:53:16 fedora ollama[1197]: goroutine 7 gp=0xc000156000 m=4 mp=0xc000049808 [syscall]: Nov 14 17:53:16 fedora ollama[1197]: runtime.cgocall(0x5595606bee90, 0xc000052b60) Nov 14 17:53:16 fedora ollama[1197]: runtime/cgocall.go:157 +0x4b fp=0xc000052b38 sp=0xc000052b00 pc=0x5595604413cb Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7fc27c0068f0, {0x200, 0x7fc27c028e80, 0x0, 0x0, 0x7fc27c029690, 0x7fc27c029ea0, 0x7fc27c02a6b0, 0x7fc2559b1600, 0x0, ...}) Nov 14 17:53:16 fedora ollama[1197]: _cgo_gotypes.go:543 +0x52 fp=0xc000052b60 sp=0xc000052b38 pc=0x55956053e952 Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5595606bad4b?, 0x7fc27c0068f0?) Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000052c80 sp=0xc000052b60 pc=0x559560540e78 Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama.(*Context).Decode(0x559560cb3060?, 0x0?) Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000052cc8 sp=0xc000052c80 pc=0x559560540cd7 Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).processBatch(0xc000122120, 0xc0000ce000, 0xc000052f10) Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000052ed0 sp=0xc000052cc8 pc=0x5595606b9d7e Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).run(0xc000122120, {0x5595609fca40, 0xc000078050}) Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000052fb8 sp=0xc000052ed0 pc=0x5595606b9765 Nov 14 17:53:16 fedora ollama[1197]: main.main.gowrap2() Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000052fe0 sp=0xc000052fb8 pc=0x5595606bdec8 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc000052fe8 sp=0xc000052fe0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by main.main in goroutine 1 Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b Nov 14 17:53:16 fedora ollama[1197]: goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x1?, 0xc000029908?, 0xf4?, 0x7d?, 0xc0000298e8?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc000029888 sp=0xc000029868 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.netpollblock(0x10?, 0x60440b26?, 0x95?) Nov 14 17:53:16 fedora ollama[1197]: runtime/netpoll.go:573 +0xf7 fp=0xc0000298c0 sp=0xc000029888 pc=0x559560470257 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.runtime_pollWait(0x7fc2f36d6fe0, 0x72) Nov 14 17:53:16 fedora ollama[1197]: runtime/netpoll.go:345 +0x85 fp=0xc0000298e0 sp=0xc0000298c0 pc=0x5595604a4aa5 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).wait(0x3?, 0x7fc2f55512c8?, 0x0) Nov 14 17:53:16 fedora ollama[1197]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000029908 sp=0xc0000298e0 pc=0x5595604f49c7 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).waitRead(...) Nov 14 17:53:16 fedora ollama[1197]: internal/poll/fd_poll_runtime.go:89 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*FD).Accept(0xc000150080) Nov 14 17:53:16 fedora ollama[1197]: internal/poll/fd_unix.go:611 +0x2ac fp=0xc0000299b0 sp=0xc000029908 pc=0x5595604f5e8c Nov 14 17:53:16 fedora ollama[1197]: net.(*netFD).accept(0xc000150080) Nov 14 17:53:16 fedora ollama[1197]: net/fd_unix.go:172 +0x29 fp=0xc000029a68 sp=0xc0000299b0 pc=0x5595605648a9 Nov 14 17:53:16 fedora ollama[1197]: net.(*TCPListener).accept(0xc00002e1e0) Nov 14 17:53:16 fedora ollama[1197]: net/tcpsock_posix.go:159 +0x1e fp=0xc000029a90 sp=0xc000029a68 pc=0x5595605755de Nov 14 17:53:16 fedora ollama[1197]: net.(*TCPListener).Accept(0xc00002e1e0) Nov 14 17:53:16 fedora ollama[1197]: net/tcpsock.go:327 +0x30 fp=0xc000029ac0 sp=0xc000029a90 pc=0x559560574930 Nov 14 17:53:16 fedora ollama[1197]: net/http.(*onceCloseListener).Accept(0xc00009e000?) Nov 14 17:53:16 fedora ollama[1197]: <autogenerated>:1 +0x24 fp=0xc000029ad8 sp=0xc000029ac0 pc=0x55956069ba44 Nov 14 17:53:16 fedora ollama[1197]: net/http.(*Server).Serve(0xc0000163c0, {0x5595609fc400, 0xc00002e1e0}) Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:3260 +0x33e fp=0xc000029c08 sp=0xc000029ad8 pc=0x55956069285e Nov 14 17:53:16 fedora ollama[1197]: main.main() Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc000029f50 sp=0xc000029c08 pc=0x5595606bdc4c Nov 14 17:53:16 fedora ollama[1197]: runtime.main() Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:271 +0x29d fp=0xc000029fe0 sp=0xc000029f50 pc=0x559560477bdd Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc000029fe8 sp=0xc000029fe0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc000042fa8 sp=0xc000042f88 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.goparkunlock(...) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:408 Nov 14 17:53:16 fedora ollama[1197]: runtime.forcegchelper() Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:326 +0xb8 fp=0xc000042fe0 sp=0xc000042fa8 pc=0x559560477e98 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc000042fe8 sp=0xc000042fe0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by runtime.init.6 in goroutine 1 Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:314 +0x1a Nov 14 17:53:16 fedora ollama[1197]: goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc000043780 sp=0xc000043760 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.goparkunlock(...) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:408 Nov 14 17:53:16 fedora ollama[1197]: runtime.bgsweep(0xc00006a000) Nov 14 17:53:16 fedora ollama[1197]: runtime/mgcsweep.go:278 +0x94 fp=0xc0000437c8 sp=0xc000043780 pc=0x559560462b54 Nov 14 17:53:16 fedora ollama[1197]: runtime.gcenable.gowrap1() Nov 14 17:53:16 fedora ollama[1197]: runtime/mgc.go:203 +0x25 fp=0xc0000437e0 sp=0xc0000437c8 pc=0x559560457685 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc0000437e8 sp=0xc0000437e0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by runtime.gcenable in goroutine 1 Nov 14 17:53:16 fedora ollama[1197]: runtime/mgc.go:203 +0x66 Nov 14 17:53:16 fedora ollama[1197]: goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0xc00006a000?, 0x5595608fce98?, 0x1?, 0x0?, 0xc000007340?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc000043f78 sp=0xc000043f58 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.goparkunlock(...) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:408 Nov 14 17:53:16 fedora ollama[1197]: runtime.(*scavengerState).park(0x559560bca4c0) Nov 14 17:53:16 fedora ollama[1197]: runtime/mgcscavenge.go:425 +0x49 fp=0xc000043fa8 sp=0xc000043f78 pc=0x559560460549 Nov 14 17:53:16 fedora ollama[1197]: runtime.bgscavenge(0xc00006a000) Nov 14 17:53:16 fedora ollama[1197]: runtime/mgcscavenge.go:653 +0x3c fp=0xc000043fc8 sp=0xc000043fa8 pc=0x559560460adc Nov 14 17:53:16 fedora ollama[1197]: runtime.gcenable.gowrap2() Nov 14 17:53:16 fedora ollama[1197]: runtime/mgc.go:204 +0x25 fp=0xc000043fe0 sp=0xc000043fc8 pc=0x559560457625 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc000043fe8 sp=0xc000043fe0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by runtime.gcenable in goroutine 1 Nov 14 17:53:16 fedora ollama[1197]: runtime/mgc.go:204 +0xa5 Nov 14 17:53:16 fedora ollama[1197]: goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0xc000042648?, 0x55956044af85?, 0xa8?, 0x1?, 0xc0000061c0?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc000042620 sp=0xc000042600 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.runfinq() Nov 14 17:53:16 fedora ollama[1197]: runtime/mfinal.go:194 +0x107 fp=0xc0000427e0 sp=0xc000042620 pc=0x5595604566c7 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc0000427e8 sp=0xc0000427e0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by runtime.createfing in goroutine 1 Nov 14 17:53:16 fedora ollama[1197]: runtime/mfinal.go:164 +0x3d Nov 14 17:53:16 fedora ollama[1197]: goroutine 108 gp=0xc000007dc0 m=nil [IO wait]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0x10?, 0x10?, 0xf0?, 0x4d?, 0xb?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc000044da8 sp=0xc000044d88 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.netpollblock(0x5595604de558?, 0x60440b26?, 0x95?) Nov 14 17:53:16 fedora ollama[1197]: runtime/netpoll.go:573 +0xf7 fp=0xc000044de0 sp=0xc000044da8 pc=0x559560470257 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.runtime_pollWait(0x7fc2f36d6ee8, 0x72) Nov 14 17:53:16 fedora ollama[1197]: runtime/netpoll.go:345 +0x85 fp=0xc000044e00 sp=0xc000044de0 pc=0x5595604a4aa5 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).wait(0xc00009c000?, 0xc000092101?, 0x0) Nov 14 17:53:16 fedora ollama[1197]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000044e28 sp=0xc000044e00 pc=0x5595604f49c7 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*pollDesc).waitRead(...) Nov 14 17:53:16 fedora ollama[1197]: internal/poll/fd_poll_runtime.go:89 Nov 14 17:53:16 fedora ollama[1197]: internal/poll.(*FD).Read(0xc00009c000, {0xc000092101, 0x1, 0x1}) Nov 14 17:53:16 fedora ollama[1197]: internal/poll/fd_unix.go:164 +0x27a fp=0xc000044ec0 sp=0xc000044e28 pc=0x5595604f551a Nov 14 17:53:16 fedora ollama[1197]: net.(*netFD).Read(0xc00009c000, {0xc000092101?, 0xc000044f48?, 0x5595604a66d0?}) Nov 14 17:53:16 fedora ollama[1197]: net/fd_posix.go:55 +0x25 fp=0xc000044f08 sp=0xc000044ec0 pc=0x5595605637a5 Nov 14 17:53:16 fedora ollama[1197]: net.(*conn).Read(0xc000094008, {0xc000092101?, 0x0?, 0x559560cb3060?}) Nov 14 17:53:16 fedora ollama[1197]: net/net.go:185 +0x45 fp=0xc000044f50 sp=0xc000044f08 pc=0x55956056da65 Nov 14 17:53:16 fedora ollama[1197]: net.(*TCPConn).Read(0xc0000920f0?, {0xc000092101?, 0x0?, 0x0?}) Nov 14 17:53:16 fedora ollama[1197]: <autogenerated>:1 +0x25 fp=0xc000044f80 sp=0xc000044f50 pc=0x559560579445 Nov 14 17:53:16 fedora ollama[1197]: net/http.(*connReader).backgroundRead(0xc0000920f0) Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:681 +0x37 fp=0xc000044fc8 sp=0xc000044f80 pc=0x5595606881d7 Nov 14 17:53:16 fedora ollama[1197]: net/http.(*connReader).startBackgroundRead.gowrap2() Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:677 +0x25 fp=0xc000044fe0 sp=0xc000044fc8 pc=0x559560688105 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc000044fe8 sp=0xc000044fe0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by net/http.(*connReader).startBackgroundRead in goroutine 18 Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:677 +0xba Nov 14 17:53:16 fedora ollama[1197]: goroutine 18 gp=0xc000082380 m=nil [select]: Nov 14 17:53:16 fedora ollama[1197]: runtime.gopark(0xc00029fa80?, 0x2?, 0x60?, 0x0?, 0xc00029f824?) Nov 14 17:53:16 fedora ollama[1197]: runtime/proc.go:402 +0xce fp=0xc00029f698 sp=0xc00029f678 pc=0x55956047800e Nov 14 17:53:16 fedora ollama[1197]: runtime.selectgo(0xc00029fa80, 0xc00029f820, 0x59a?, 0x0, 0x1?, 0x1) Nov 14 17:53:16 fedora ollama[1197]: runtime/select.go:327 +0x725 fp=0xc00029f7b8 sp=0xc00029f698 pc=0x5595604893e5 Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).completion(0xc000122120, {0x5595609fc5b0, 0xc000280460}, 0xc00017ab40) Nov 14 17:53:16 fedora ollama[1197]: github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc00029fab8 sp=0xc00029f7b8 pc=0x5595606bb6de Nov 14 17:53:16 fedora ollama[1197]: main.(*Server).completion-fm({0x5595609fc5b0?, 0xc000280460?}, 0x559560696b8d?) Nov 14 17:53:16 fedora ollama[1197]: <autogenerated>:1 +0x36 fp=0xc00029fae8 sp=0xc00029fab8 pc=0x5595606be6b6 Nov 14 17:53:16 fedora ollama[1197]: net/http.HandlerFunc.ServeHTTP(0xc00007ed00?, {0x5595609fc5b0?, 0xc000280460?}, 0x10?) Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:2171 +0x29 fp=0xc00029fb10 sp=0xc00029fae8 pc=0x55956068f629 Nov 14 17:53:16 fedora ollama[1197]: net/http.(*ServeMux).ServeHTTP(0x55956044af85?, {0x5595609fc5b0, 0xc000280460}, 0xc00017ab40) Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:2688 +0x1ad fp=0xc00029fb60 sp=0xc00029fb10 pc=0x5595606914ad Nov 14 17:53:16 fedora ollama[1197]: net/http.serverHandler.ServeHTTP({0x5595609fb900?}, {0x5595609fc5b0?, 0xc000280460?}, 0x6?) Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:3142 +0x8e fp=0xc00029fb90 sp=0xc00029fb60 pc=0x5595606924ce Nov 14 17:53:16 fedora ollama[1197]: net/http.(*conn).serve(0xc00009e000, {0x5595609fca08, 0xc00007cdb0}) Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:2044 +0x5e8 fp=0xc00029ffb8 sp=0xc00029fb90 pc=0x55956068e268 Nov 14 17:53:16 fedora ollama[1197]: net/http.(*Server).Serve.gowrap3() Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:3290 +0x28 fp=0xc00029ffe0 sp=0xc00029ffb8 pc=0x559560692c48 Nov 14 17:53:16 fedora ollama[1197]: runtime.goexit({}) Nov 14 17:53:16 fedora ollama[1197]: runtime/asm_amd64.s:1695 +0x1 fp=0xc00029ffe8 sp=0xc00029ffe0 pc=0x5595604a9de1 Nov 14 17:53:16 fedora ollama[1197]: created by net/http.(*Server).Serve in goroutine 1 Nov 14 17:53:16 fedora ollama[1197]: net/http/server.go:3290 +0x4b4 Nov 14 17:53:16 fedora ollama[1197]: rax 0x0 Nov 14 17:53:16 fedora ollama[1197]: rbx 0x9c5 Nov 14 17:53:16 fedora ollama[1197]: rcx 0x7fc2f38a8664 Nov 14 17:53:16 fedora ollama[1197]: rdx 0x6 Nov 14 17:53:16 fedora ollama[1197]: rdi 0x9c2 Nov 14 17:53:16 fedora ollama[1197]: rsi 0x9c5 Nov 14 17:53:16 fedora ollama[1197]: rbp 0x7fc2933f6410 Nov 14 17:53:16 fedora ollama[1197]: rsp 0x7fc2933f63d0 Nov 14 17:53:16 fedora ollama[1197]: r8 0x0 Nov 14 17:53:16 fedora ollama[1197]: r9 0xfffffffc Nov 14 17:53:16 fedora ollama[1197]: r10 0x8 Nov 14 17:53:16 fedora ollama[1197]: r11 0x246 Nov 14 17:53:16 fedora ollama[1197]: r12 0x7fc293400000 Nov 14 17:53:16 fedora ollama[1197]: r13 0x84 Nov 14 17:53:16 fedora ollama[1197]: r14 0x6 Nov 14 17:53:16 fedora ollama[1197]: r15 0x637f60000 Nov 14 17:53:16 fedora ollama[1197]: rip 0x7fc2f38a8664 Nov 14 17:53:16 fedora ollama[1197]: rflags 0x246 Nov 14 17:53:16 fedora ollama[1197]: cs 0x33 Nov 14 17:53:16 fedora ollama[1197]: fs 0x0 Nov 14 17:53:16 fedora ollama[1197]: gs 0x0 Nov 14 17:53:16 fedora ollama[1197]: [GIN] 2024/11/14 - 17:53:16 | 200 | 42.62644068s | 192.168.0.7 | POST "/api/chat" Nov 14 17:53:17 fedora ollama[1197]: [GIN] 2024/11/14 - 17:53:17 | 200 | 752.265µs | 192.168.0.7 | GET "/api/tags" Nov 14 17:54:25 fedora ollama[1197]: [GIN] 2024/11/14 - 17:54:25 | 200 | 819.913µs | 192.168.0.7 | GET "/api/tags" ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 08:59:55 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 14, 2024):

The memory calculations may be a little off, resulting in ollama trying to offload too many layers. You can try reducing the number of layers being offloaded to GPU and see if it loads successfully, see https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650.

As a quick check, what does the following do:

curl localhost:11434/api/generate -d '{"model":"llama3.2:3b","options":{"num_gpu":10},"prompt":"hi","stream":false}'
<!-- gh-comment-id:2477599556 --> @rick-github commented on GitHub (Nov 14, 2024): The memory calculations may be a little off, resulting in ollama trying to offload too many layers. You can try reducing the number of layers being offloaded to GPU and see if it loads successfully, see https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650. As a quick check, what does the following do: ```console curl localhost:11434/api/generate -d '{"model":"llama3.2:3b","options":{"num_gpu":10},"prompt":"hi","stream":false}' ```
Author
Owner

@rick-github commented on GitHub (Nov 14, 2024):

Why can't the GPU be used at all? Isn't the CPU using the same memory as the GPU?

No, GPU and CPU have different memory, VRAM vs RAM. In most cases, ollama will spill model weights into RAM if VRAM is not big enough, and use both GPU/CPU + VRAM/RAM for inference. If the requested context will not fit in VRAM, then the whole model will be moved to RAM.

<!-- gh-comment-id:2477613193 --> @rick-github commented on GitHub (Nov 14, 2024): > Why can't the GPU be used at all? Isn't the CPU using the same memory as the GPU? No, GPU and CPU have different memory, VRAM vs RAM. In most cases, ollama will spill model weights into RAM if VRAM is not big enough, and use both GPU/CPU + VRAM/RAM for inference. If the requested context will not fit in VRAM, then the whole model will be moved to RAM.
Author
Owner

@daphil19 commented on GitHub (Nov 15, 2024):

I've been seeing similar issues with several models on my system, which models either working or not working in confusing ways.

I've got a GTX 970 (4GB VRAM) and 40GB RAM. Loading llama3.2:3b yields a similar CUDA error: out of memory despite what I understand to be ample headroom on the GPU to hold the entire model, even if I go all the way down to q2. Other models have the same issue, like appropriately-sized quants of llama3.1:7b and qwen-2.5-coder:7b.

Confusingly, larger models like phi3.5 or mistral:7b, do seem to work without issue.

Here's the logs from an attempted run of llama3.2:3b:

[GIN] 2024/11/15 - 14:31:37 | 200 |      32.642µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/11/15 - 14:31:37 | 200 |   57.948877ms |       127.0.0.1 | POST     "/api/show"
time=2024-11-15T14:31:38.011Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-656eccd6-be96-b61b-c5db-c3ea5c2d155a parallel=4 available=4162060288 required="3.7 GiB"
time=2024-11-15T14:31:38.100Z level=INFO source=server.go:105 msg="system memory" total="39.1 GiB" free="14.2 GiB" free_swap="0 B"
time=2024-11-15T14:31:38.100Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
time=2024-11-15T14:31:38.101Z level=INFO source=server.go:383 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 4 --parallel 4 --port 39323"
time=2024-11-15T14:31:38.101Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-15T14:31:38.101Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-15T14:31:38.102Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
time=2024-11-15T14:31:38.116Z level=INFO source=runner.go:863 msg="starting go runner"
time=2024-11-15T14:31:38.116Z level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=4
time=2024-11-15T14:31:38.116Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39323"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
time=2024-11-15T14:31:38.353Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW) 
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   308.23 MiB
llm_load_tensors:      CUDA0 buffer size =  1918.36 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   424.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
time=2024-11-15T14:31:41.614Z level=INFO source=server.go:601 msg="llama runner started in 3.51 seconds"
[GIN] 2024/11/15 - 14:31:41 | 200 |  3.870364597s |       127.0.0.1 | POST     "/api/generate"
CUDA error: out of memory
  current device: 0, in function alloc at ggml-cuda.cu:406
  cuMemCreate(&handle, reserve_size, &prop, 0)
ggml-cuda.cu:132: CUDA error
/usr/lib/ollama/runners/cuda_v11/ollama_llama_server(+0x3a4d28)[0x5616b1729d28]
/usr/lib/ollama/runners/cuda_v11/ollama_llama_server(ggml_abort+0x136)[0x5616b172b656]
/usr/lib/ollama/libggml_cuda_v11.so(+0x36a52)[0x14de94636a52]
/usr/lib/ollama/libggml_cuda_v11.so(_ZN18ggml_cuda_pool_vmm5allocEmPm+0x1c5)[0x14de94644f35]
/usr/lib/ollama/libggml_cuda_v11.so(+0x398d6)[0x14de946398d6]
/usr/lib/ollama/libggml_cuda_v11.so(+0x3efc9)[0x14de9463efc9]
/usr/lib/ollama/libggml_cuda_v11.so(+0x4408e)[0x14de9464408e]
/usr/lib/ollama/runners/cuda_v11/ollama_llama_server(ggml_backend_sched_graph_compute_async+0x181)[0x5616b1714011]
/usr/lib/ollama/runners/cuda_v11/ollama_llama_server(llama_decode+0x5f1)[0x5616b17f6781]
/usr/lib/ollama/runners/cuda_v11/ollama_llama_server(_cgo_08d1de4ea234_Cfunc_llama_decode+0x51)[0x5616b170bee1]
/usr/lib/ollama/runners/cuda_v11/ollama_llama_server(+0x171a61)[0x5616b14f6a61]
SIGABRT: abort
PC=0x14de786969fc m=7 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 7 gp=0xc00015a000 m=7 mp=0xc000182008 [syscall]:
runtime.cgocall(0x5616b170be90, 0xc000067b60)
        runtime/cgocall.go:157 +0x4b fp=0xc000067b38 sp=0xc000067b00 pc=0x5616b148e3cb
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x14de2405b9e0, {0x85, 0x14de24029b10, 0x0, 0x0, 0x14de243aec50, 0x14de243b0c60, 0x14de240070e0, 0x14de24131550, 0x0, ...})
        _cgo_gotypes.go:543 +0x52 fp=0xc000067b60 sp=0xc000067b38 pc=0x5616b158b952
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5616b1707d4b?, 0x14de2405b9e0?)
        github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000067c80 sp=0xc000067b60 pc=0x5616b158de78
github.com/ollama/ollama/llama.(*Context).Decode(0x5616b1d00060?, 0x0?)
        github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000067cc8 sp=0xc000067c80 pc=0x5616b158dcd7
main.(*Server).processBatch(0xc00012a120, 0xc0000c8000, 0xc000067f10)
        github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000067ed0 sp=0xc000067cc8 pc=0x5616b1706d7e
main.(*Server).run(0xc00012a120, {0x5616b1a49a40, 0xc00007e050})
        github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000067fb8 sp=0xc000067ed0 pc=0x5616b1706765
main.main.gowrap2()
        github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000067fe0 sp=0xc000067fb8 pc=0x5616b170aec8
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x5616b14f6de1
created by main.main in goroutine 1
        github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0xc000032008?, 0x0?, 0xc0?, 0x61?, 0xc00002b8c0?)
        runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x5616b14c500e
runtime.netpollblock(0xc00002b920?, 0xb148db26?, 0x16?)
        runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x5616b14bd257
internal/poll.runtime_pollWait(0x14de91067fe0, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x5616b14f1aa5
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x5616b15419c7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000154080)
        internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x5616b1542e8c
net.(*netFD).accept(0xc000154080)
        net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x5616b15b18a9
net.(*TCPListener).accept(0xc0000721e0)
        net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5616b15c25de
net.(*TCPListener).Accept(0xc0000721e0)
        net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x5616b15c1930
net/http.(*onceCloseListener).Accept(0xc00012a1b0?)
        <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5616b16e8a44
net/http.(*Server).Serve(0xc0000161e0, {0x5616b1a49400, 0xc0000721e0})
        net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5616b16df85e
main.main()
        github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5616b170ac4c
runtime.main()
        runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x5616b14c4bdd
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x5616b14f6de1

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000052fa8 sp=0xc000052f88 pc=0x5616b14c500e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.forcegchelper()
        runtime/proc.go:326 +0xb8 fp=0xc000052fe0 sp=0xc000052fa8 pc=0x5616b14c4e98
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000052fe8 sp=0xc000052fe0 pc=0x5616b14f6de1
created by runtime.init.6 in goroutine 1
        runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000053780 sp=0xc000053760 pc=0x5616b14c500e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.bgsweep(0xc000020070)
        runtime/mgcsweep.go:278 +0x94 fp=0xc0000537c8 sp=0xc000053780 pc=0x5616b14afb54
runtime.gcenable.gowrap1()
        runtime/mgc.go:203 +0x25 fp=0xc0000537e0 sp=0xc0000537c8 pc=0x5616b14a4685
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000537e8 sp=0xc0000537e0 pc=0x5616b14f6de1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
runtime.gopark(0xc000020070?, 0x5616b1949e98?, 0x1?, 0x0?, 0xc000007340?)
        runtime/proc.go:402 +0xce fp=0xc000053f78 sp=0xc000053f58 pc=0x5616b14c500e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.(*scavengerState).park(0x5616b1c174c0)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000053fa8 sp=0xc000053f78 pc=0x5616b14ad549
runtime.bgscavenge(0xc000020070)
        runtime/mgcscavenge.go:653 +0x3c fp=0xc000053fc8 sp=0xc000053fa8 pc=0x5616b14adadc
runtime.gcenable.gowrap2()
        runtime/mgc.go:204 +0x25 fp=0xc000053fe0 sp=0xc000053fc8 pc=0x5616b14a4625
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000053fe8 sp=0xc000053fe0 pc=0x5616b14f6de1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]:
runtime.gopark(0xc000052648?, 0x5616b1497f85?, 0xa8?, 0x1?, 0xc0000061c0?)
        runtime/proc.go:402 +0xce fp=0xc000052620 sp=0xc000052600 pc=0x5616b14c500e
runtime.runfinq()
        runtime/mfinal.go:194 +0x107 fp=0xc0000527e0 sp=0xc000052620 pc=0x5616b14a36c7
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000527e8 sp=0xc0000527e0 pc=0x5616b14f6de1
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:164 +0x3d

goroutine 8 gp=0xc00015a1c0 m=nil [select]:
runtime.gopark(0xc000207a80?, 0x2?, 0x60?, 0x0?, 0xc000207824?)
        runtime/proc.go:402 +0xce fp=0xc000207698 sp=0xc000207678 pc=0x5616b14c500e
runtime.selectgo(0xc000207a80, 0xc000207820, 0x85?, 0x0, 0x1?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc0002077b8 sp=0xc000207698 pc=0x5616b14d63e5
main.(*Server).completion(0xc00012a120, {0x5616b1a495b0, 0xc0000189a0}, 0xc00012c6c0)
        github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc000207ab8 sp=0xc0002077b8 pc=0x5616b17086de
main.(*Server).completion-fm({0x5616b1a495b0?, 0xc0000189a0?}, 0x5616b16e3b8d?)
        <autogenerated>:1 +0x36 fp=0xc000207ae8 sp=0xc000207ab8 pc=0x5616b170b6b6
net/http.HandlerFunc.ServeHTTP(0xc00010ec30?, {0x5616b1a495b0?, 0xc0000189a0?}, 0x10?)
        net/http/server.go:2171 +0x29 fp=0xc000207b10 sp=0xc000207ae8 pc=0x5616b16dc629
net/http.(*ServeMux).ServeHTTP(0x5616b1497f85?, {0x5616b1a495b0, 0xc0000189a0}, 0xc00012c6c0)
        net/http/server.go:2688 +0x1ad fp=0xc000207b60 sp=0xc000207b10 pc=0x5616b16de4ad
net/http.serverHandler.ServeHTTP({0x5616b1a48900?}, {0x5616b1a495b0?, 0xc0000189a0?}, 0x6?)
        net/http/server.go:3142 +0x8e fp=0xc000207b90 sp=0xc000207b60 pc=0x5616b16df4ce
net/http.(*conn).serve(0xc00012a1b0, {0x5616b1a49a08, 0xc00010cdb0})
        net/http/server.go:2044 +0x5e8 fp=0xc000207fb8 sp=0xc000207b90 pc=0x5616b16db268
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3290 +0x28 fp=0xc000207fe0 sp=0xc000207fb8 pc=0x5616b16dfc48
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000207fe8 sp=0xc000207fe0 pc=0x5616b14f6de1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3290 +0x4b4

goroutine 12 gp=0xc00015a380 m=nil [IO wait]:
runtime.gopark(0x10?, 0x10?, 0xf0?, 0x55?, 0xb?)
        runtime/proc.go:402 +0xce fp=0xc0000555a8 sp=0xc000055588 pc=0x5616b14c500e
runtime.netpollblock(0x5616b152b558?, 0xb148db26?, 0x16?)
        runtime/netpoll.go:573 +0xf7 fp=0xc0000555e0 sp=0xc0000555a8 pc=0x5616b14bd257
internal/poll.runtime_pollWait(0x14de91067ee8, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc000055600 sp=0xc0000555e0 pc=0x5616b14f1aa5
internal/poll.(*pollDesc).wait(0xc000154100?, 0xc000098041?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000055628 sp=0xc000055600 pc=0x5616b15419c7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000154100, {0xc000098041, 0x1, 0x1})
        internal/poll/fd_unix.go:164 +0x27a fp=0xc0000556c0 sp=0xc000055628 pc=0x5616b154251a
net.(*netFD).Read(0xc000154100, {0xc000098041?, 0xc000055748?, 0x5616b14f36d0?})
        net/fd_posix.go:55 +0x25 fp=0xc000055708 sp=0xc0000556c0 pc=0x5616b15b07a5
net.(*conn).Read(0xc000056098, {0xc000098041?, 0x0?, 0x5616b1d00060?})
        net/net.go:185 +0x45 fp=0xc000055750 sp=0xc000055708 pc=0x5616b15baa65
net.(*TCPConn).Read(0x5616b1bd8840?, {0xc000098041?, 0x0?, 0x0?})
        <autogenerated>:1 +0x25 fp=0xc000055780 sp=0xc000055750 pc=0x5616b15c6445
net/http.(*connReader).backgroundRead(0xc000098030)
        net/http/server.go:681 +0x37 fp=0xc0000557c8 sp=0xc000055780 pc=0x5616b16d51d7
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:677 +0x25 fp=0xc0000557e0 sp=0xc0000557c8 pc=0x5616b16d5105
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000557e8 sp=0xc0000557e0 pc=0x5616b14f6de1
created by net/http.(*connReader).startBackgroundRead in goroutine 8
        net/http/server.go:677 +0xba

rax    0x0
rbx    0x14de30bbd000
rcx    0x14de786969fc
rdx    0x6
rdi    0x2b
rsi    0x31
rbp    0x31
rsp    0x14de30bb3390
r8     0x14de30bb3460
r9     0x5
r10    0x8
r11    0x246
r12    0x6
r13    0x16
r14    0x14de922edc2f
r15    0xdb71a0000
rip    0x14de786969fc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
[GIN] 2024/11/15 - 14:32:35 | 200 |  2.714013605s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2479075521 --> @daphil19 commented on GitHub (Nov 15, 2024): I've been seeing similar issues with several models on my system, which models either working or not working in confusing ways. I've got a GTX 970 (4GB VRAM) and 40GB RAM. Loading `llama3.2:3b` yields a similar `CUDA error: out of memory` despite what I understand to be ample headroom on the GPU to hold the entire model, even if I go all the way down to q2. Other models have the same issue, like appropriately-sized quants of `llama3.1:7b` and `qwen-2.5-coder:7b`. Confusingly, larger models like `phi3.5` or `mistral:7b`, _do_ seem to work without issue. Here's the logs from an attempted run of `llama3.2:3b`: ``` [GIN] 2024/11/15 - 14:31:37 | 200 | 32.642µs | 127.0.0.1 | HEAD "/" [GIN] 2024/11/15 - 14:31:37 | 200 | 57.948877ms | 127.0.0.1 | POST "/api/show" time=2024-11-15T14:31:38.011Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-656eccd6-be96-b61b-c5db-c3ea5c2d155a parallel=4 available=4162060288 required="3.7 GiB" time=2024-11-15T14:31:38.100Z level=INFO source=server.go:105 msg="system memory" total="39.1 GiB" free="14.2 GiB" free_swap="0 B" time=2024-11-15T14:31:38.100Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[3.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" time=2024-11-15T14:31:38.101Z level=INFO source=server.go:383 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 4 --parallel 4 --port 39323" time=2024-11-15T14:31:38.101Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-15T14:31:38.101Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-15T14:31:38.102Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" time=2024-11-15T14:31:38.116Z level=INFO source=runner.go:863 msg="starting go runner" time=2024-11-15T14:31:38.116Z level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=4 time=2024-11-15T14:31:38.116Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39323" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe time=2024-11-15T14:31:38.353Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 970, compute capability 5.2, VMM: yes llm_load_tensors: ggml ctx size = 0.24 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 308.23 MiB llm_load_tensors: CUDA0 buffer size = 1918.36 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 2 time=2024-11-15T14:31:41.614Z level=INFO source=server.go:601 msg="llama runner started in 3.51 seconds" [GIN] 2024/11/15 - 14:31:41 | 200 | 3.870364597s | 127.0.0.1 | POST "/api/generate" CUDA error: out of memory current device: 0, in function alloc at ggml-cuda.cu:406 cuMemCreate(&handle, reserve_size, &prop, 0) ggml-cuda.cu:132: CUDA error /usr/lib/ollama/runners/cuda_v11/ollama_llama_server(+0x3a4d28)[0x5616b1729d28] /usr/lib/ollama/runners/cuda_v11/ollama_llama_server(ggml_abort+0x136)[0x5616b172b656] /usr/lib/ollama/libggml_cuda_v11.so(+0x36a52)[0x14de94636a52] /usr/lib/ollama/libggml_cuda_v11.so(_ZN18ggml_cuda_pool_vmm5allocEmPm+0x1c5)[0x14de94644f35] /usr/lib/ollama/libggml_cuda_v11.so(+0x398d6)[0x14de946398d6] /usr/lib/ollama/libggml_cuda_v11.so(+0x3efc9)[0x14de9463efc9] /usr/lib/ollama/libggml_cuda_v11.so(+0x4408e)[0x14de9464408e] /usr/lib/ollama/runners/cuda_v11/ollama_llama_server(ggml_backend_sched_graph_compute_async+0x181)[0x5616b1714011] /usr/lib/ollama/runners/cuda_v11/ollama_llama_server(llama_decode+0x5f1)[0x5616b17f6781] /usr/lib/ollama/runners/cuda_v11/ollama_llama_server(_cgo_08d1de4ea234_Cfunc_llama_decode+0x51)[0x5616b170bee1] /usr/lib/ollama/runners/cuda_v11/ollama_llama_server(+0x171a61)[0x5616b14f6a61] SIGABRT: abort PC=0x14de786969fc m=7 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 7 gp=0xc00015a000 m=7 mp=0xc000182008 [syscall]: runtime.cgocall(0x5616b170be90, 0xc000067b60) runtime/cgocall.go:157 +0x4b fp=0xc000067b38 sp=0xc000067b00 pc=0x5616b148e3cb github.com/ollama/ollama/llama._Cfunc_llama_decode(0x14de2405b9e0, {0x85, 0x14de24029b10, 0x0, 0x0, 0x14de243aec50, 0x14de243b0c60, 0x14de240070e0, 0x14de24131550, 0x0, ...}) _cgo_gotypes.go:543 +0x52 fp=0xc000067b60 sp=0xc000067b38 pc=0x5616b158b952 github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5616b1707d4b?, 0x14de2405b9e0?) github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000067c80 sp=0xc000067b60 pc=0x5616b158de78 github.com/ollama/ollama/llama.(*Context).Decode(0x5616b1d00060?, 0x0?) github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000067cc8 sp=0xc000067c80 pc=0x5616b158dcd7 main.(*Server).processBatch(0xc00012a120, 0xc0000c8000, 0xc000067f10) github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000067ed0 sp=0xc000067cc8 pc=0x5616b1706d7e main.(*Server).run(0xc00012a120, {0x5616b1a49a40, 0xc00007e050}) github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000067fb8 sp=0xc000067ed0 pc=0x5616b1706765 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000067fe0 sp=0xc000067fb8 pc=0x5616b170aec8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000067fe8 sp=0xc000067fe0 pc=0x5616b14f6de1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0xc000032008?, 0x0?, 0xc0?, 0x61?, 0xc00002b8c0?) runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x5616b14c500e runtime.netpollblock(0xc00002b920?, 0xb148db26?, 0x16?) runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x5616b14bd257 internal/poll.runtime_pollWait(0x14de91067fe0, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x5616b14f1aa5 internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x5616b15419c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000154080) internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x5616b1542e8c net.(*netFD).accept(0xc000154080) net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x5616b15b18a9 net.(*TCPListener).accept(0xc0000721e0) net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5616b15c25de net.(*TCPListener).Accept(0xc0000721e0) net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x5616b15c1930 net/http.(*onceCloseListener).Accept(0xc00012a1b0?) <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5616b16e8a44 net/http.(*Server).Serve(0xc0000161e0, {0x5616b1a49400, 0xc0000721e0}) net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5616b16df85e main.main() github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5616b170ac4c runtime.main() runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x5616b14c4bdd runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x5616b14f6de1 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000052fa8 sp=0xc000052f88 pc=0x5616b14c500e runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc000052fe0 sp=0xc000052fa8 pc=0x5616b14c4e98 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000052fe8 sp=0xc000052fe0 pc=0x5616b14f6de1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000053780 sp=0xc000053760 pc=0x5616b14c500e runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc000020070) runtime/mgcsweep.go:278 +0x94 fp=0xc0000537c8 sp=0xc000053780 pc=0x5616b14afb54 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc0000537e0 sp=0xc0000537c8 pc=0x5616b14a4685 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000537e8 sp=0xc0000537e0 pc=0x5616b14f6de1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: runtime.gopark(0xc000020070?, 0x5616b1949e98?, 0x1?, 0x0?, 0xc000007340?) runtime/proc.go:402 +0xce fp=0xc000053f78 sp=0xc000053f58 pc=0x5616b14c500e runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0x5616b1c174c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000053fa8 sp=0xc000053f78 pc=0x5616b14ad549 runtime.bgscavenge(0xc000020070) runtime/mgcscavenge.go:653 +0x3c fp=0xc000053fc8 sp=0xc000053fa8 pc=0x5616b14adadc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc000053fe0 sp=0xc000053fc8 pc=0x5616b14a4625 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000053fe8 sp=0xc000053fe0 pc=0x5616b14f6de1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 5 gp=0xc000007c00 m=nil [finalizer wait]: runtime.gopark(0xc000052648?, 0x5616b1497f85?, 0xa8?, 0x1?, 0xc0000061c0?) runtime/proc.go:402 +0xce fp=0xc000052620 sp=0xc000052600 pc=0x5616b14c500e runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc0000527e0 sp=0xc000052620 pc=0x5616b14a36c7 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000527e8 sp=0xc0000527e0 pc=0x5616b14f6de1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 8 gp=0xc00015a1c0 m=nil [select]: runtime.gopark(0xc000207a80?, 0x2?, 0x60?, 0x0?, 0xc000207824?) runtime/proc.go:402 +0xce fp=0xc000207698 sp=0xc000207678 pc=0x5616b14c500e runtime.selectgo(0xc000207a80, 0xc000207820, 0x85?, 0x0, 0x1?, 0x1) runtime/select.go:327 +0x725 fp=0xc0002077b8 sp=0xc000207698 pc=0x5616b14d63e5 main.(*Server).completion(0xc00012a120, {0x5616b1a495b0, 0xc0000189a0}, 0xc00012c6c0) github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc000207ab8 sp=0xc0002077b8 pc=0x5616b17086de main.(*Server).completion-fm({0x5616b1a495b0?, 0xc0000189a0?}, 0x5616b16e3b8d?) <autogenerated>:1 +0x36 fp=0xc000207ae8 sp=0xc000207ab8 pc=0x5616b170b6b6 net/http.HandlerFunc.ServeHTTP(0xc00010ec30?, {0x5616b1a495b0?, 0xc0000189a0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc000207b10 sp=0xc000207ae8 pc=0x5616b16dc629 net/http.(*ServeMux).ServeHTTP(0x5616b1497f85?, {0x5616b1a495b0, 0xc0000189a0}, 0xc00012c6c0) net/http/server.go:2688 +0x1ad fp=0xc000207b60 sp=0xc000207b10 pc=0x5616b16de4ad net/http.serverHandler.ServeHTTP({0x5616b1a48900?}, {0x5616b1a495b0?, 0xc0000189a0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc000207b90 sp=0xc000207b60 pc=0x5616b16df4ce net/http.(*conn).serve(0xc00012a1b0, {0x5616b1a49a08, 0xc00010cdb0}) net/http/server.go:2044 +0x5e8 fp=0xc000207fb8 sp=0xc000207b90 pc=0x5616b16db268 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc000207fe0 sp=0xc000207fb8 pc=0x5616b16dfc48 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000207fe8 sp=0xc000207fe0 pc=0x5616b14f6de1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 goroutine 12 gp=0xc00015a380 m=nil [IO wait]: runtime.gopark(0x10?, 0x10?, 0xf0?, 0x55?, 0xb?) runtime/proc.go:402 +0xce fp=0xc0000555a8 sp=0xc000055588 pc=0x5616b14c500e runtime.netpollblock(0x5616b152b558?, 0xb148db26?, 0x16?) runtime/netpoll.go:573 +0xf7 fp=0xc0000555e0 sp=0xc0000555a8 pc=0x5616b14bd257 internal/poll.runtime_pollWait(0x14de91067ee8, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc000055600 sp=0xc0000555e0 pc=0x5616b14f1aa5 internal/poll.(*pollDesc).wait(0xc000154100?, 0xc000098041?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000055628 sp=0xc000055600 pc=0x5616b15419c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000154100, {0xc000098041, 0x1, 0x1}) internal/poll/fd_unix.go:164 +0x27a fp=0xc0000556c0 sp=0xc000055628 pc=0x5616b154251a net.(*netFD).Read(0xc000154100, {0xc000098041?, 0xc000055748?, 0x5616b14f36d0?}) net/fd_posix.go:55 +0x25 fp=0xc000055708 sp=0xc0000556c0 pc=0x5616b15b07a5 net.(*conn).Read(0xc000056098, {0xc000098041?, 0x0?, 0x5616b1d00060?}) net/net.go:185 +0x45 fp=0xc000055750 sp=0xc000055708 pc=0x5616b15baa65 net.(*TCPConn).Read(0x5616b1bd8840?, {0xc000098041?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc000055780 sp=0xc000055750 pc=0x5616b15c6445 net/http.(*connReader).backgroundRead(0xc000098030) net/http/server.go:681 +0x37 fp=0xc0000557c8 sp=0xc000055780 pc=0x5616b16d51d7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:677 +0x25 fp=0xc0000557e0 sp=0xc0000557c8 pc=0x5616b16d5105 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000557e8 sp=0xc0000557e0 pc=0x5616b14f6de1 created by net/http.(*connReader).startBackgroundRead in goroutine 8 net/http/server.go:677 +0xba rax 0x0 rbx 0x14de30bbd000 rcx 0x14de786969fc rdx 0x6 rdi 0x2b rsi 0x31 rbp 0x31 rsp 0x14de30bb3390 r8 0x14de30bb3460 r9 0x5 r10 0x8 r11 0x246 r12 0x6 r13 0x16 r14 0x14de922edc2f r15 0xdb71a0000 rip 0x14de786969fc rflags 0x246 cs 0x33 fs 0x0 gs 0x0 [GIN] 2024/11/15 - 14:32:35 | 200 | 2.714013605s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@kripper commented on GitHub (Nov 15, 2024):

As a quick check, what does the following do:

Short prompts are no problem:

[root@fedora ~]# curl localhost:11434/api/generate -d '{"model":"llama3.2:latest","options":{"num_gpu":10},"prompt":"hi","stream":false}'

{"model":"llama3.2:latest","created_at":"2024-11-15T16:34:59.506124077Z","response":"How can I assist you today?","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,6151,128009,128006,78191,128007,271,4438,649,358,7945,499,3432,30],"total_duration":63607924193,"load_duration":52833178648,"prompt_eval_count":26,"prompt_eval_duration":9447000000,"eval_count":8,"eval_duration":1326000000}

[root@fedora ~]# curl localhost:11434/api/generate -d '{"model":"llama3.2-vision:latest","options":{"num_gpu":10},"prompt":"hi","stream":false}'

{"model":"llama3.2-vision:latest","created_at":"2024-11-15T16:44:44.727808022Z","response":"How's your day going so far? Is there something I can help you with or would you like to chat?","done":true,"done_reason":"stop","context":[128006,882,128007,271,6151,128009,128006,78191,128007,271,4438,596,701,1938,2133,779,3117,30,2209,1070,2555,358,649,1520,499,449,477,1053,499,1093,311,6369,30],"total_duration":8708490324,"load_duration":27991851,"prompt_eval_count":11,"prompt_eval_duration":359000000,"eval_count":24,"eval_duration":8320000000}

<!-- gh-comment-id:2479417002 --> @kripper commented on GitHub (Nov 15, 2024): > As a quick check, what does the following do: Short prompts are no problem: ``` [root@fedora ~]# curl localhost:11434/api/generate -d '{"model":"llama3.2:latest","options":{"num_gpu":10},"prompt":"hi","stream":false}' ``` `{"model":"llama3.2:latest","created_at":"2024-11-15T16:34:59.506124077Z","response":"How can I assist you today?","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,6151,128009,128006,78191,128007,271,4438,649,358,7945,499,3432,30],"total_duration":63607924193,"load_duration":52833178648,"prompt_eval_count":26,"prompt_eval_duration":9447000000,"eval_count":8,"eval_duration":1326000000} ` ``` [root@fedora ~]# curl localhost:11434/api/generate -d '{"model":"llama3.2-vision:latest","options":{"num_gpu":10},"prompt":"hi","stream":false}' ``` `{"model":"llama3.2-vision:latest","created_at":"2024-11-15T16:44:44.727808022Z","response":"How's your day going so far? Is there something I can help you with or would you like to chat?","done":true,"done_reason":"stop","context":[128006,882,128007,271,6151,128009,128006,78191,128007,271,4438,596,701,1938,2133,779,3117,30,2209,1070,2555,358,649,1520,499,449,477,1053,499,1093,311,6369,30],"total_duration":8708490324,"load_duration":27991851,"prompt_eval_count":11,"prompt_eval_duration":359000000,"eval_count":24,"eval_duration":8320000000}`
Author
Owner

@kripper commented on GitHub (Nov 15, 2024):

No, GPU and CPU have different memory, VRAM vs RAM. In most cases, ollama will spill model weights into RAM if VRAM is not big enough, and use both GPU/CPU + VRAM/RAM for inference. If the requested context will not fit in VRAM, then the whole model will be moved to RAM.

I see. So, Ollama cannot access the RAM via an integrated GPU?
Is it possible to support this feature in the future?
I believe it's called UMA (Unified Memory Architecture), correct?

Would UMA support allow handling a larger requested context by utilizing both VRAM and RAM?

<!-- gh-comment-id:2479456514 --> @kripper commented on GitHub (Nov 15, 2024): > No, GPU and CPU have different memory, VRAM vs RAM. In most cases, ollama will spill model weights into RAM if VRAM is not big enough, and use both GPU/CPU + VRAM/RAM for inference. If the requested context will not fit in VRAM, then the whole model will be moved to RAM. I see. So, Ollama cannot access the RAM via an integrated GPU? Is it possible to support this feature in the future? I believe it's called UMA (Unified Memory Architecture), correct? Would UMA support allow handling a larger requested context by utilizing both VRAM and RAM?
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

I glossed over some details there for brevity. For an integrated GPU, yes, the memory is the same, but it's partitioned. I don't deal with those sorts of systems, so I don't know for sure, but my understanding is that the partitioning is determined in the BIOS and the memory is not interchangeable. My understanding could be wrong.

For discrete Nvidia devices, some models do support accessing system RAM from the GPU. Nvidia calls this fallback memory, llama.cpp calls this unified memory. llama.cpp uses this on supported drivers, on by default for WIndows, while Linux users need to set an environment variable. More discussion here, the summary of which is that in the use cases I've seen, using unified memory is not a win for large models on small machines.

Back the the problem that you are having - you successfully loaded the model by reducing the number of layers offloaded to the GPU. This points to ollama not properly computing the memory requirements for the model in a really constrained space. Memory calculations are dependent on model architecture, so ollama may err for some models and not others. Anecdotally, I've also seen old GPU/driver combos be a bit loose with the memory reporting, so if ollama is not getting the right numbers to work with, it may contribute to the error. Same holds true for @daphil19's 970. My suggestion is to adjust num_gpu until you get a failure to load, then back off one or two layers and create a new model as detailed in https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650.

<!-- gh-comment-id:2479521055 --> @rick-github commented on GitHub (Nov 15, 2024): I glossed over some details there for brevity. For an integrated GPU, yes, the memory is the same, but it's partitioned. I don't deal with those sorts of systems, so I don't know for sure, but my understanding is that the partitioning is determined in the BIOS and the memory is not interchangeable. My understanding could be wrong. For discrete Nvidia devices, some models do support accessing system RAM from the GPU. Nvidia calls this fallback memory, llama.cpp calls this unified memory. llama.cpp uses this on supported drivers, on by default for WIndows, while Linux users need to set an environment variable. More discussion [here](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900), the summary of which is that in the use cases I've seen, using unified memory is not a win for large models on small machines. Back the the problem that you are having - you successfully loaded the model by reducing the number of layers offloaded to the GPU. This points to ollama not properly computing the memory requirements for the model in a really constrained space. Memory calculations are dependent on model architecture, so ollama may err for some models and not others. Anecdotally, I've also seen old GPU/driver combos be a bit loose with the memory reporting, so if ollama is not getting the right numbers to work with, it may contribute to the error. Same holds true for @daphil19's 970. My suggestion is to adjust `num_gpu` until you get a failure to load, then back off one or two layers and create a new model as detailed in https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650.
Author
Owner

@kripper commented on GitHub (Nov 15, 2024):

Ok, thanks.
BTW, I'm pasting the hardware specs for my two iGPUs.
I believe UMA is only supported on the HD Graphics 520, but not on the GeForce 940M.
And Ollama is probably using the GeForce 940M because it's the only one supportting CUDA.
Does Ollama support OpenVINO?

lshw -C display
  *-display
       description: VGA compatible controller
       product: Skylake GT2 [HD Graphics 520]
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 07
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
       configuration: driver=i915 latency=0
       resources: irq:128 memory:dd000000-ddffffff memory:b0000000-bfffffff ioport:f000(size=64) memory:c0000-dffff
  *-display
       description: 3D controller
       product: GM108M [GeForce 940M]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a2
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:134 memory:de000000-deffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:df000000-df07ffff
<!-- gh-comment-id:2479609870 --> @kripper commented on GitHub (Nov 15, 2024): Ok, thanks. BTW, I'm pasting the hardware specs for my two iGPUs. I believe UMA is only supported on the `HD Graphics 520`, but not on the `GeForce 940M`. And Ollama is probably using the `GeForce 940M` because it's the only one supportting CUDA. Does Ollama support OpenVINO? ``` lshw -C display *-display description: VGA compatible controller product: Skylake GT2 [HD Graphics 520] vendor: Intel Corporation physical id: 2 bus info: pci@0000:00:02.0 version: 07 width: 64 bits clock: 33MHz capabilities: pciexpress msi pm vga_controller bus_master cap_list rom configuration: driver=i915 latency=0 resources: irq:128 memory:dd000000-ddffffff memory:b0000000-bfffffff ioport:f000(size=64) memory:c0000-dffff *-display description: 3D controller product: GM108M [GeForce 940M] vendor: NVIDIA Corporation physical id: 0 bus info: pci@0000:01:00.0 version: a2 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress bus_master cap_list rom configuration: driver=nvidia latency=0 resources: irq:134 memory:de000000-deffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:e000(size=128) memory:df000000-df07ffff ```
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

Not currently. There's open tickets for OpenVINO and other backends like ONNX, but implementing support for those has a lower priority than other ongoing work.

<!-- gh-comment-id:2479631405 --> @rick-github commented on GitHub (Nov 15, 2024): Not currently. There's open tickets for [OpenVINO](https://github.com/ollama/ollama/issues/2169) and other backends like [ONNX](https://github.com/ollama/ollama/issues/6502), but implementing support for those has a lower priority than other ongoing work.
Author
Owner

@kripper commented on GitHub (Nov 15, 2024):

Yes, and there is also the pending Vulkan PR that could be useful for Intel iGPUs.

<!-- gh-comment-id:2479685349 --> @kripper commented on GitHub (Nov 15, 2024): Yes, and there is also the [pending Vulkan PR](https://github.com/ollama/ollama/pull/5059) that could be useful for Intel iGPUs.
Author
Owner

@rick-github commented on GitHub (Nov 15, 2024):

There's another way the OOM problem can be mitigated. The problem with setting a fixed layer count is it's inflexible in the face of changing context size or loading other VRAM using apps or models. As an alternative,
OLLAMA_GPU_OVERHEAD can be set to enforce a buffer between the VRAM ollama wants to allocate to layers and the free space in the GPU. In this way if ollama is incorrect in the memory calculations, the overflow will go in to the buffer rather than having llama.cpp OOMing.

<!-- gh-comment-id:2479891241 --> @rick-github commented on GitHub (Nov 15, 2024): There's another way the OOM problem can be mitigated. The problem with setting a fixed layer count is it's inflexible in the face of changing context size or loading other VRAM using apps or models. As an alternative, [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/d875e99e4639dc07af90b2e3ea0d175e2e692efb/envconfig/config.go#L237) can be set to enforce a buffer between the VRAM ollama wants to allocate to layers and the free space in the GPU. In this way if ollama is incorrect in the memory calculations, the overflow will go in to the buffer rather than having llama.cpp OOMing.
Author
Owner

@kripper commented on GitHub (Nov 16, 2024):

Ok. I used OLLAMA_GPU_OVERHEAD=7000000000 to bypass the memory validation.
I'm testing qwen2.5:7b on a similar hardware with 2 iGPUs.
Ollama is now using all available VRAM (2 GB) + 3.8 GB shared memory.
ollama ps reports 100% GPU and windows Task Manager reports 100% CUDA on the NVIDIA GPU...

Everything looks good, except:

ollama is also using 20% CPU. Why? It is supposed to only use GPU.
And the performance is notable worse than when using 75%/25% CPU/GPU , in which case the GPU only uses 7% CUDA.

Is there anyway to profile or trace what's going on?
Maybe llama.cpp is not directly accesing the shared memory but copying from the buffer to the GPU or doing some other unnecesary operations?

According to ChatGPT the GPU should be able to directly access the shared memory and this should be faster than doing CPU compute.

<!-- gh-comment-id:2480393630 --> @kripper commented on GitHub (Nov 16, 2024): Ok. I used `OLLAMA_GPU_OVERHEAD=7000000000` to bypass the memory validation. I'm testing `qwen2.5:7b` on a similar hardware with 2 iGPUs. Ollama is now using all available VRAM (2 GB) + 3.8 GB shared memory. `ollama ps` reports `100% GPU` and windows Task Manager reports 100% CUDA on the NVIDIA GPU... Everything looks good, except: ollama is also using `20% CPU`. Why? It is supposed to only use GPU. And the performance is notable worse than when using `75%/25% CPU/GPU` , in which case the GPU only uses 7% CUDA. Is there anyway to profile or trace what's going on? Maybe llama.cpp is not directly accesing the shared memory but copying from the buffer to the GPU or doing some other unnecesary operations? [According to ChatGPT](https://chatgpt.com/share/6737f909-eb70-8002-8a00-7cc341df98b5) the GPU should be able to directly access the shared memory and this should be faster than doing CPU compute.
Author
Owner

@rick-github commented on GitHub (Nov 16, 2024):

Logs will help. If you've set OLLAMA_GPU_OVERHEAD=7G and you're using similar hardware with iGPUs with 4G of VRAM, then you may be setting up a sub-optimal configuration. If the model is being forced into shared RAM there may be a performance penalty as previously pointed out.

<!-- gh-comment-id:2480753684 --> @rick-github commented on GitHub (Nov 16, 2024): Logs will help. If you've set OLLAMA_GPU_OVERHEAD=7G and you're using similar hardware with iGPUs with 4G of VRAM, then you may be setting up a sub-optimal configuration. If the model is being forced into shared RAM there may be a performance penalty as previously pointed out.
Author
Owner

@kripper commented on GitHub (Nov 16, 2024):

Yes, all GPU VRAM is used and the rest is using shared GPU memory.

What I wonder is that In theory, using GPU + shared GPU memory should be faster than using CPU + RAM (even when shared GPU memory is slower then RAM), because CPUs generally lack the massively parallel processing capability of GPUs, making them slower for large-scale matrix operations unless they are small or require minimal parallelism.

<!-- gh-comment-id:2480780498 --> @kripper commented on GitHub (Nov 16, 2024): Yes, all GPU VRAM is used and the rest is using shared GPU memory. What I wonder is that In theory, using GPU + shared GPU memory should be faster than using CPU + RAM (even when shared GPU memory is slower then RAM), because CPUs generally lack the massively parallel processing capability of GPUs, making them slower for large-scale matrix operations unless they are small or require minimal parallelism.
Author
Owner

@rick-github commented on GitHub (Nov 16, 2024):

GPUs have massively parallel processing capability but it's useless if you can't feed it data.

<!-- gh-comment-id:2480785331 --> @rick-github commented on GitHub (Nov 16, 2024): GPUs have massively parallel processing capability but it's useless if you can't feed it data.
Author
Owner

@kripper commented on GitHub (Nov 16, 2024):

Right. At the end, the overall performance will depend on the caching strategy used to reduce memory bandwidth usage.

<!-- gh-comment-id:2480812003 --> @kripper commented on GitHub (Nov 16, 2024): Right. At the end, the overall performance will depend on the caching strategy used to reduce memory bandwidth usage.
Author
Owner

@rick-github commented on GitHub (Nov 16, 2024):

Well, that's the thing - LLMs are very poor candidates for caching. So as soon as a modest part of your model resides out of VRAM, performance suffers.

<!-- gh-comment-id:2480813021 --> @rick-github commented on GitHub (Nov 16, 2024): Well, that's the thing - LLMs are very poor candidates for caching. So as soon as a modest part of your model resides out of VRAM, performance suffers.
Author
Owner

@kripper commented on GitHub (Nov 17, 2024):

Llama 3.2 3B works fine for small prompts.
For bigger prompts, it throws OOM.

Some tests:

  • With OLLAMA_GPU_OVERHEAD < 1 GB, it OOM crashes.
  • With OLLAMA_GPU_OVERHEAD > 0.9 G, it uses 100% CPU (very slow)
  • With OLLAMA_GPU_OVERHEAD=2000000000 (2 G), works very slow with 100% CPU.

Expected behaviour: use "GPU + Shared Memory" or "GPU + VRAM + CPU + RAM".

Here are the logs from an OOM crash.

-sh-5.2$ export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
-sh-5.2$ export OLLAMA_GPU_OVERHEAD=0
-sh-5.2$ ollama serve
2024/11/17 17:20:27 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-17T17:20:27.398-03:00 level=INFO source=images.go:755 msg="total blobs: 17"
time=2024-11-17T17:20:27.399-03:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
time=2024-11-17T17:20:27.399-03:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1)"
time=2024-11-17T17:20:27.400-03:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3646209345/runners
time=2024-11-17T17:20:27.603-03:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
time=2024-11-17T17:20:27.603-03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-17T17:20:27.684-03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda variant=v11 compute=5.0 driver=12.5 name="NVIDIA GeForce 940M" total="1.9 GiB" available="1.9 GiB"
time=2024-11-17T17:20:42.657-03:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="10.0 GiB" free_swap="8.0 GiB"
time=2024-11-17T17:20:42.658-03:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=13 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2024-11-17T17:20:42.660-03:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3646209345/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 13 --threads 2 --parallel 1 --port 44877"
time=2024-11-17T17:20:42.661-03:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-17T17:20:42.661-03:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-17T17:20:42.661-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
time=2024-11-17T17:20:42.672-03:00 level=INFO source=runner.go:863 msg="starting go runner"
time=2024-11-17T17:20:42.672-03:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2
time=2024-11-17T17:20:42.673-03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:44877"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-11-17T17:20:42.913-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce 940M, compute capability 5.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 13 repeating layers to GPU
llm_load_tensors: offloaded 13/29 layers to GPU
llm_load_tensors:        CPU buffer size =  1918.35 MiB
llm_load_tensors:      CUDA0 buffer size =   757.22 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   120.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   104.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   564.73 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 199
time=2024-11-17T17:20:44.668-03:00 level=INFO source=server.go:601 msg="llama runner started in 2.01 seconds"
CUDA error: out of memory
  current device: 0, in function alloc at ggml-cuda.cu:406
  cuMemCreate(&handle, reserve_size, &prop, 0)
ggml-cuda.cu:132: CUDA error
[New LWP 14372]
[New LWP 14371]
[New LWP 14370]
[New LWP 14369]
[New LWP 14368]
[New LWP 14367]
[New LWP 14366]
[New LWP 14365]
[New LWP 14364]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x0000560841fd6ba3 in ?? ()
#0  0x0000560841fd6ba3 in ?? ()
#1  0x0000560841f9bef0 in _start ()
[Inferior 1 (process 14363) detached]
SIGABRT: abort
PC=0x7fbedcdc4664 m=7 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 20 gp=0xc000082c40 m=7 mp=0xc000182008 [syscall]:
runtime.cgocall(0x5608421e9e90, 0xc000054b60)
        runtime/cgocall.go:157 +0x4b fp=0xc000054b38 sp=0xc000054b00 pc=0x560841f6c3cb
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7fbe68006910, {0x200, 0x7fbe681c5220, 0x0, 0x0, 0x7fbe68027c10, 0x7fbe68028420, 0x7fbe68028c30, 0x7fbe2d9d84a0, 0x0, ...})
        _cgo_gotypes.go:543 +0x52 fp=0xc000054b60 sp=0xc000054b38 pc=0x560842069952
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5608421e5d4b?, 0x7fbe68006910?)
        github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000054c80 sp=0xc000054b60 pc=0x56084206be78
github.com/ollama/ollama/llama.(*Context).Decode(0x5608427de060?, 0x0?)
        github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000054cc8 sp=0xc000054c80 pc=0x56084206bcd7
main.(*Server).processBatch(0xc0000c4120, 0xc0000c2150, 0xc000040f10)
        github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000054ed0 sp=0xc000054cc8 pc=0x5608421e4d7e
main.(*Server).run(0xc0000c4120, {0x560842527a40, 0xc000104000})
        github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000054fb8 sp=0xc000054ed0 pc=0x5608421e4765
main.main.gowrap2()
        github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000054fe0 sp=0xc000054fb8 pc=0x5608421e8ec8
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x560841fd4de1
created by main.main in goroutine 1
        github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b

goroutine 1 gp=0xc000006380 m=nil [IO wait]:
runtime.gopark(0xc000034508?, 0x0?, 0x80?, 0x63?, 0xc00002b8c0?)
        runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x560841fa300e
runtime.netpollblock(0xc00002b920?, 0x41f6bb26?, 0x8?)
        runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x560841f9b257
internal/poll.runtime_pollWait(0x7fbedeb88f20, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x560841fcfaa5
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x56084201f9c7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000fe080)
        internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x560842020e8c
net.(*netFD).accept(0xc0000fe080)
        net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x56084208f8a9
net.(*TCPListener).accept(0xc0000c61a0)
        net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5608420a05de
net.(*TCPListener).Accept(0xc0000c61a0)
        net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x56084209f930
net/http.(*onceCloseListener).Accept(0xc0000c41b0?)
        <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5608421c6a44
net/http.(*Server).Serve(0xc0000aa0f0, {0x560842527400, 0xc0000c61a0})
        net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5608421bd85e
main.main()
        github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5608421e8c4c
runtime.main()
        runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x560841fa2bdd
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x560841fd4de1

goroutine 2 gp=0xc000006e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000044fa8 sp=0xc000044f88 pc=0x560841fa300e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.forcegchelper()
        runtime/proc.go:326 +0xb8 fp=0xc000044fe0 sp=0xc000044fa8 pc=0x560841fa2e98
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000044fe8 sp=0xc000044fe0 pc=0x560841fd4de1
created by runtime.init.6 in goroutine 1
        runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007340 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000045780 sp=0xc000045760 pc=0x560841fa300e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.bgsweep(0xc00006c000)
        runtime/mgcsweep.go:278 +0x94 fp=0xc0000457c8 sp=0xc000045780 pc=0x560841f8db54
runtime.gcenable.gowrap1()
        runtime/mgc.go:203 +0x25 fp=0xc0000457e0 sp=0xc0000457c8 pc=0x560841f82685
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000457e8 sp=0xc0000457e0 pc=0x560841fd4de1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007500 m=nil [GC scavenge wait]:
runtime.gopark(0xc00006c000?, 0x560842427e98?, 0x1?, 0x0?, 0xc000007500?)
        runtime/proc.go:402 +0xce fp=0xc000045f78 sp=0xc000045f58 pc=0x560841fa300e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.(*scavengerState).park(0x5608426f54c0)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000045fa8 sp=0xc000045f78 pc=0x560841f8b549
runtime.bgscavenge(0xc00006c000)
        runtime/mgcscavenge.go:653 +0x3c fp=0xc000045fc8 sp=0xc000045fa8 pc=0x560841f8badc
runtime.gcenable.gowrap2()
        runtime/mgc.go:204 +0x25 fp=0xc000045fe0 sp=0xc000045fc8 pc=0x560841f82625
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000045fe8 sp=0xc000045fe0 pc=0x560841fd4de1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0xa5

goroutine 18 gp=0xc0000828c0 m=nil [finalizer wait]:
runtime.gopark(0xc000044648?, 0x560841f75f85?, 0xa8?, 0x1?, 0xc000006380?)
        runtime/proc.go:402 +0xce fp=0xc000044620 sp=0xc000044600 pc=0x560841fa300e
runtime.runfinq()
        runtime/mfinal.go:194 +0x107 fp=0xc0000447e0 sp=0xc000044620 pc=0x560841f816c7
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000447e8 sp=0xc0000447e0 pc=0x560841fd4de1
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:164 +0x3d

goroutine 21 gp=0xc000082e00 m=nil [select]:
runtime.gopark(0xc00013fa80?, 0x2?, 0x60?, 0x0?, 0xc00013f824?)
        runtime/proc.go:402 +0xce fp=0xc00013f698 sp=0xc00013f678 pc=0x560841fa300e
runtime.selectgo(0xc00013fa80, 0xc00013f820, 0x7df?, 0x0, 0x1?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc00013f7b8 sp=0xc00013f698 pc=0x560841fb43e5
main.(*Server).completion(0xc0000c4120, {0x5608425275b0, 0xc0000fc7e0}, 0xc0000ca900)
        github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc00013fab8 sp=0xc00013f7b8 pc=0x5608421e66de
main.(*Server).completion-fm({0x5608425275b0?, 0xc0000fc7e0?}, 0x5608421c1b8d?)
        <autogenerated>:1 +0x36 fp=0xc00013fae8 sp=0xc00013fab8 pc=0x5608421e96b6
net/http.HandlerFunc.ServeHTTP(0xc00009eea0?, {0x5608425275b0?, 0xc0000fc7e0?}, 0x10?)
        net/http/server.go:2171 +0x29 fp=0xc00013fb10 sp=0xc00013fae8 pc=0x5608421ba629
net/http.(*ServeMux).ServeHTTP(0x560841f75f85?, {0x5608425275b0, 0xc0000fc7e0}, 0xc0000ca900)
        net/http/server.go:2688 +0x1ad fp=0xc00013fb60 sp=0xc00013fb10 pc=0x5608421bc4ad
net/http.serverHandler.ServeHTTP({0x560842526900?}, {0x5608425275b0?, 0xc0000fc7e0?}, 0x6?)
        net/http/server.go:3142 +0x8e fp=0xc00013fb90 sp=0xc00013fb60 pc=0x5608421bd4ce
net/http.(*conn).serve(0xc0000c41b0, {0x560842527a08, 0xc00009cdb0})
        net/http/server.go:2044 +0x5e8 fp=0xc00013ffb8 sp=0xc00013fb90 pc=0x5608421b9268
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3290 +0x28 fp=0xc00013ffe0 sp=0xc00013ffb8 pc=0x5608421bdc48
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00013ffe8 sp=0xc00013ffe0 pc=0x560841fd4de1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3290 +0x4b4

goroutine 25 gp=0xc000083180 m=nil [IO wait]:
runtime.gopark(0x10?, 0x10?, 0xf0?, 0x15?, 0xb?)
        runtime/proc.go:402 +0xce fp=0xc0000415a8 sp=0xc000041588 pc=0x560841fa300e
runtime.netpollblock(0x560842009558?, 0x41f6bb26?, 0x8?)
        runtime/netpoll.go:573 +0xf7 fp=0xc0000415e0 sp=0xc0000415a8 pc=0x560841f9b257
internal/poll.runtime_pollWait(0x7fbedeb88e28, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc000041600 sp=0xc0000415e0 pc=0x560841fcfaa5
internal/poll.(*pollDesc).wait(0xc0000fe100?, 0xc00009cee1?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000041628 sp=0xc000041600 pc=0x56084201f9c7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000fe100, {0xc00009cee1, 0x1, 0x1})
        internal/poll/fd_unix.go:164 +0x27a fp=0xc0000416c0 sp=0xc000041628 pc=0x56084202051a
net.(*netFD).Read(0xc0000fe100, {0xc00009cee1?, 0xc000041748?, 0x560841fd16d0?})
        net/fd_posix.go:55 +0x25 fp=0xc000041708 sp=0xc0000416c0 pc=0x56084208e7a5
net.(*conn).Read(0xc00009a098, {0xc00009cee1?, 0x0?, 0x5608427de060?})
        net/net.go:185 +0x45 fp=0xc000041750 sp=0xc000041708 pc=0x560842098a65
net.(*TCPConn).Read(0x5608426b6840?, {0xc00009cee1?, 0x0?, 0x0?})
        <autogenerated>:1 +0x25 fp=0xc000041780 sp=0xc000041750 pc=0x5608420a4445
net/http.(*connReader).backgroundRead(0xc00009ced0)
        net/http/server.go:681 +0x37 fp=0xc0000417c8 sp=0xc000041780 pc=0x5608421b31d7
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:677 +0x25 fp=0xc0000417e0 sp=0xc0000417c8 pc=0x5608421b3105
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000417e8 sp=0xc0000417e0 pc=0x560841fd4de1
created by net/http.(*connReader).startBackgroundRead in goroutine 21
        net/http/server.go:677 +0xba

rax    0x0
rbx    0x3821
rcx    0x7fbedcdc4664
rdx    0x6
rdi    0x381b
rsi    0x3821
rbp    0x7fbe763f6410
rsp    0x7fbe763f63d0
r8     0x0
r9     0xfffffffb
r10    0x8
r11    0x246
r12    0x7fbe76400000
r13    0x84
r14    0x6
r15    0x637f60000
rip    0x7fbedcdc4664
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
<!-- gh-comment-id:2481539334 --> @kripper commented on GitHub (Nov 17, 2024): Llama 3.2 3B works fine for small prompts. For bigger prompts, it throws OOM. Some tests: - With OLLAMA_GPU_OVERHEAD < 1 GB, it OOM crashes. - With OLLAMA_GPU_OVERHEAD > 0.9 G, it uses 100% CPU (very slow) - With OLLAMA_GPU_OVERHEAD=2000000000 (2 G), works very slow with 100% CPU. Expected behaviour: use "GPU + Shared Memory" or "GPU + VRAM + CPU + RAM". Here are the logs from an OOM crash. ``` -sh-5.2$ export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 -sh-5.2$ export OLLAMA_GPU_OVERHEAD=0 -sh-5.2$ ollama serve 2024/11/17 17:20:27 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-11-17T17:20:27.398-03:00 level=INFO source=images.go:755 msg="total blobs: 17" time=2024-11-17T17:20:27.399-03:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" time=2024-11-17T17:20:27.399-03:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1)" time=2024-11-17T17:20:27.400-03:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3646209345/runners time=2024-11-17T17:20:27.603-03:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]" time=2024-11-17T17:20:27.603-03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-17T17:20:27.684-03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda variant=v11 compute=5.0 driver=12.5 name="NVIDIA GeForce 940M" total="1.9 GiB" available="1.9 GiB" time=2024-11-17T17:20:42.657-03:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="10.0 GiB" free_swap="8.0 GiB" time=2024-11-17T17:20:42.658-03:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=13 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" time=2024-11-17T17:20:42.660-03:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3646209345/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 13 --threads 2 --parallel 1 --port 44877" time=2024-11-17T17:20:42.661-03:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-17T17:20:42.661-03:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-17T17:20:42.661-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" time=2024-11-17T17:20:42.672-03:00 level=INFO source=runner.go:863 msg="starting go runner" time=2024-11-17T17:20:42.672-03:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2 time=2024-11-17T17:20:42.673-03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:44877" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2024-11-17T17:20:42.913-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce 940M, compute capability 5.0, VMM: yes llm_load_tensors: ggml ctx size = 0.24 MiB llm_load_tensors: offloading 13 repeating layers to GPU llm_load_tensors: offloaded 13/29 layers to GPU llm_load_tensors: CPU buffer size = 1918.35 MiB llm_load_tensors: CUDA0 buffer size = 757.22 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 120.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 104.00 MiB llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 564.73 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 199 time=2024-11-17T17:20:44.668-03:00 level=INFO source=server.go:601 msg="llama runner started in 2.01 seconds" CUDA error: out of memory current device: 0, in function alloc at ggml-cuda.cu:406 cuMemCreate(&handle, reserve_size, &prop, 0) ggml-cuda.cu:132: CUDA error [New LWP 14372] [New LWP 14371] [New LWP 14370] [New LWP 14369] [New LWP 14368] [New LWP 14367] [New LWP 14366] [New LWP 14365] [New LWP 14364] This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.fedoraproject.org/> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal] Debuginfod has been disabled. To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x0000560841fd6ba3 in ?? () #0 0x0000560841fd6ba3 in ?? () #1 0x0000560841f9bef0 in _start () [Inferior 1 (process 14363) detached] SIGABRT: abort PC=0x7fbedcdc4664 m=7 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 20 gp=0xc000082c40 m=7 mp=0xc000182008 [syscall]: runtime.cgocall(0x5608421e9e90, 0xc000054b60) runtime/cgocall.go:157 +0x4b fp=0xc000054b38 sp=0xc000054b00 pc=0x560841f6c3cb github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7fbe68006910, {0x200, 0x7fbe681c5220, 0x0, 0x0, 0x7fbe68027c10, 0x7fbe68028420, 0x7fbe68028c30, 0x7fbe2d9d84a0, 0x0, ...}) _cgo_gotypes.go:543 +0x52 fp=0xc000054b60 sp=0xc000054b38 pc=0x560842069952 github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5608421e5d4b?, 0x7fbe68006910?) github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000054c80 sp=0xc000054b60 pc=0x56084206be78 github.com/ollama/ollama/llama.(*Context).Decode(0x5608427de060?, 0x0?) github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000054cc8 sp=0xc000054c80 pc=0x56084206bcd7 main.(*Server).processBatch(0xc0000c4120, 0xc0000c2150, 0xc000040f10) github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000054ed0 sp=0xc000054cc8 pc=0x5608421e4d7e main.(*Server).run(0xc0000c4120, {0x560842527a40, 0xc000104000}) github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000054fb8 sp=0xc000054ed0 pc=0x5608421e4765 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000054fe0 sp=0xc000054fb8 pc=0x5608421e8ec8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x560841fd4de1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b goroutine 1 gp=0xc000006380 m=nil [IO wait]: runtime.gopark(0xc000034508?, 0x0?, 0x80?, 0x63?, 0xc00002b8c0?) runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x560841fa300e runtime.netpollblock(0xc00002b920?, 0x41f6bb26?, 0x8?) runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x560841f9b257 internal/poll.runtime_pollWait(0x7fbedeb88f20, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x560841fcfaa5 internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x56084201f9c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0000fe080) internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x560842020e8c net.(*netFD).accept(0xc0000fe080) net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x56084208f8a9 net.(*TCPListener).accept(0xc0000c61a0) net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x5608420a05de net.(*TCPListener).Accept(0xc0000c61a0) net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x56084209f930 net/http.(*onceCloseListener).Accept(0xc0000c41b0?) <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x5608421c6a44 net/http.(*Server).Serve(0xc0000aa0f0, {0x560842527400, 0xc0000c61a0}) net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x5608421bd85e main.main() github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc00002bf50 sp=0xc00002bc08 pc=0x5608421e8c4c runtime.main() runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x560841fa2bdd runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x560841fd4de1 goroutine 2 gp=0xc000006e00 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000044fa8 sp=0xc000044f88 pc=0x560841fa300e runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc000044fe0 sp=0xc000044fa8 pc=0x560841fa2e98 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000044fe8 sp=0xc000044fe0 pc=0x560841fd4de1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000007340 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000045780 sp=0xc000045760 pc=0x560841fa300e runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc00006c000) runtime/mgcsweep.go:278 +0x94 fp=0xc0000457c8 sp=0xc000045780 pc=0x560841f8db54 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc0000457e0 sp=0xc0000457c8 pc=0x560841f82685 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000457e8 sp=0xc0000457e0 pc=0x560841fd4de1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000007500 m=nil [GC scavenge wait]: runtime.gopark(0xc00006c000?, 0x560842427e98?, 0x1?, 0x0?, 0xc000007500?) runtime/proc.go:402 +0xce fp=0xc000045f78 sp=0xc000045f58 pc=0x560841fa300e runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0x5608426f54c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000045fa8 sp=0xc000045f78 pc=0x560841f8b549 runtime.bgscavenge(0xc00006c000) runtime/mgcscavenge.go:653 +0x3c fp=0xc000045fc8 sp=0xc000045fa8 pc=0x560841f8badc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc000045fe0 sp=0xc000045fc8 pc=0x560841f82625 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000045fe8 sp=0xc000045fe0 pc=0x560841fd4de1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 18 gp=0xc0000828c0 m=nil [finalizer wait]: runtime.gopark(0xc000044648?, 0x560841f75f85?, 0xa8?, 0x1?, 0xc000006380?) runtime/proc.go:402 +0xce fp=0xc000044620 sp=0xc000044600 pc=0x560841fa300e runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc0000447e0 sp=0xc000044620 pc=0x560841f816c7 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000447e8 sp=0xc0000447e0 pc=0x560841fd4de1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 21 gp=0xc000082e00 m=nil [select]: runtime.gopark(0xc00013fa80?, 0x2?, 0x60?, 0x0?, 0xc00013f824?) runtime/proc.go:402 +0xce fp=0xc00013f698 sp=0xc00013f678 pc=0x560841fa300e runtime.selectgo(0xc00013fa80, 0xc00013f820, 0x7df?, 0x0, 0x1?, 0x1) runtime/select.go:327 +0x725 fp=0xc00013f7b8 sp=0xc00013f698 pc=0x560841fb43e5 main.(*Server).completion(0xc0000c4120, {0x5608425275b0, 0xc0000fc7e0}, 0xc0000ca900) github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc00013fab8 sp=0xc00013f7b8 pc=0x5608421e66de main.(*Server).completion-fm({0x5608425275b0?, 0xc0000fc7e0?}, 0x5608421c1b8d?) <autogenerated>:1 +0x36 fp=0xc00013fae8 sp=0xc00013fab8 pc=0x5608421e96b6 net/http.HandlerFunc.ServeHTTP(0xc00009eea0?, {0x5608425275b0?, 0xc0000fc7e0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc00013fb10 sp=0xc00013fae8 pc=0x5608421ba629 net/http.(*ServeMux).ServeHTTP(0x560841f75f85?, {0x5608425275b0, 0xc0000fc7e0}, 0xc0000ca900) net/http/server.go:2688 +0x1ad fp=0xc00013fb60 sp=0xc00013fb10 pc=0x5608421bc4ad net/http.serverHandler.ServeHTTP({0x560842526900?}, {0x5608425275b0?, 0xc0000fc7e0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc00013fb90 sp=0xc00013fb60 pc=0x5608421bd4ce net/http.(*conn).serve(0xc0000c41b0, {0x560842527a08, 0xc00009cdb0}) net/http/server.go:2044 +0x5e8 fp=0xc00013ffb8 sp=0xc00013fb90 pc=0x5608421b9268 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc00013ffe0 sp=0xc00013ffb8 pc=0x5608421bdc48 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00013ffe8 sp=0xc00013ffe0 pc=0x560841fd4de1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 goroutine 25 gp=0xc000083180 m=nil [IO wait]: runtime.gopark(0x10?, 0x10?, 0xf0?, 0x15?, 0xb?) runtime/proc.go:402 +0xce fp=0xc0000415a8 sp=0xc000041588 pc=0x560841fa300e runtime.netpollblock(0x560842009558?, 0x41f6bb26?, 0x8?) runtime/netpoll.go:573 +0xf7 fp=0xc0000415e0 sp=0xc0000415a8 pc=0x560841f9b257 internal/poll.runtime_pollWait(0x7fbedeb88e28, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc000041600 sp=0xc0000415e0 pc=0x560841fcfaa5 internal/poll.(*pollDesc).wait(0xc0000fe100?, 0xc00009cee1?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000041628 sp=0xc000041600 pc=0x56084201f9c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0000fe100, {0xc00009cee1, 0x1, 0x1}) internal/poll/fd_unix.go:164 +0x27a fp=0xc0000416c0 sp=0xc000041628 pc=0x56084202051a net.(*netFD).Read(0xc0000fe100, {0xc00009cee1?, 0xc000041748?, 0x560841fd16d0?}) net/fd_posix.go:55 +0x25 fp=0xc000041708 sp=0xc0000416c0 pc=0x56084208e7a5 net.(*conn).Read(0xc00009a098, {0xc00009cee1?, 0x0?, 0x5608427de060?}) net/net.go:185 +0x45 fp=0xc000041750 sp=0xc000041708 pc=0x560842098a65 net.(*TCPConn).Read(0x5608426b6840?, {0xc00009cee1?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc000041780 sp=0xc000041750 pc=0x5608420a4445 net/http.(*connReader).backgroundRead(0xc00009ced0) net/http/server.go:681 +0x37 fp=0xc0000417c8 sp=0xc000041780 pc=0x5608421b31d7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:677 +0x25 fp=0xc0000417e0 sp=0xc0000417c8 pc=0x5608421b3105 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000417e8 sp=0xc0000417e0 pc=0x560841fd4de1 created by net/http.(*connReader).startBackgroundRead in goroutine 21 net/http/server.go:677 +0xba rax 0x0 rbx 0x3821 rcx 0x7fbedcdc4664 rdx 0x6 rdi 0x381b rsi 0x3821 rbp 0x7fbe763f6410 rsp 0x7fbe763f63d0 r8 0x0 r9 0xfffffffb r10 0x8 r11 0x246 r12 0x7fbe76400000 r13 0x84 r14 0x6 r15 0x637f60000 rip 0x7fbedcdc4664 rflags 0x246 cs 0x33 fs 0x0 gs 0x0 ```
Author
Owner

@rick-github commented on GitHub (Nov 17, 2024):

Do you have logs from when OLLAMA_GPU_OVERHEAD is not zero?

<!-- gh-comment-id:2481542331 --> @rick-github commented on GitHub (Nov 17, 2024): Do you have logs from when `OLLAMA_GPU_OVERHEAD` is not zero?
Author
Owner

@kripper commented on GitHub (Nov 17, 2024):

Do you have logs from when OLLAMA_GPU_OVERHEAD is not zero?

Here with OLLAMA_GPU_OVERHEAD = 1.2 G (works very slow using 100% CPU):

-sh-5.2$ export OLLAMA_GPU_OVERHEAD=1200000000
-sh-5.2$ ollama serve
2024/11/17 17:53:44 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:1200000000 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-17T17:53:44.066-03:00 level=INFO source=images.go:755 msg="total blobs: 17"
time=2024-11-17T17:53:44.067-03:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
time=2024-11-17T17:53:44.067-03:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1)"
time=2024-11-17T17:53:44.067-03:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4035721721/runners
time=2024-11-17T17:53:44.271-03:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm cpu]"
time=2024-11-17T17:53:44.271-03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-17T17:53:44.352-03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda variant=v11 compute=5.0 driver=12.5 name="NVIDIA GeForce 940M" total="1.9 GiB" available="1.9 GiB"
time=2024-11-17T17:53:51.300-03:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="10.0 GiB" free_swap="8.0 GiB"
time=2024-11-17T17:53:51.301-03:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="1.1 GiB" memory.required.full="2.2 GiB" memory.required.partial="0 B" memory.required.kv="224.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2024-11-17T17:53:51.303-03:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama4035721721/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --threads 2 --no-mmap --parallel 1 --port 43929"
time=2024-11-17T17:53:51.304-03:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-17T17:53:51.304-03:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-17T17:53:51.304-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
time=2024-11-17T17:53:51.309-03:00 level=INFO source=runner.go:863 msg="starting go runner"
time=2024-11-17T17:53:51.309-03:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2
time=2024-11-17T17:53:51.309-03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:43929"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-11-17T17:53:51.555-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  2226.59 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
llama_new_context_with_model:        CPU compute buffer size =   256.50 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 1
time=2024-11-17T17:53:53.561-03:00 level=INFO source=server.go:601 msg="llama runner started in 2.26 seconds"

<!-- gh-comment-id:2481550678 --> @kripper commented on GitHub (Nov 17, 2024): > Do you have logs from when `OLLAMA_GPU_OVERHEAD` is not zero? Here with OLLAMA_GPU_OVERHEAD = 1.2 G (works very slow using 100% CPU): ``` -sh-5.2$ export OLLAMA_GPU_OVERHEAD=1200000000 -sh-5.2$ ollama serve 2024/11/17 17:53:44 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:1200000000 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-11-17T17:53:44.066-03:00 level=INFO source=images.go:755 msg="total blobs: 17" time=2024-11-17T17:53:44.067-03:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" time=2024-11-17T17:53:44.067-03:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1)" time=2024-11-17T17:53:44.067-03:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4035721721/runners time=2024-11-17T17:53:44.271-03:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm cpu]" time=2024-11-17T17:53:44.271-03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-17T17:53:44.352-03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda variant=v11 compute=5.0 driver=12.5 name="NVIDIA GeForce 940M" total="1.9 GiB" available="1.9 GiB" time=2024-11-17T17:53:51.300-03:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="10.0 GiB" free_swap="8.0 GiB" time=2024-11-17T17:53:51.301-03:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="1.1 GiB" memory.required.full="2.2 GiB" memory.required.partial="0 B" memory.required.kv="224.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" time=2024-11-17T17:53:51.303-03:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama4035721721/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --threads 2 --no-mmap --parallel 1 --port 43929" time=2024-11-17T17:53:51.304-03:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-17T17:53:51.304-03:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-17T17:53:51.304-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" time=2024-11-17T17:53:51.309-03:00 level=INFO source=runner.go:863 msg="starting go runner" time=2024-11-17T17:53:51.309-03:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2 time=2024-11-17T17:53:51.309-03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:43929" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2024-11-17T17:53:51.555-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.12 MiB llm_load_tensors: CPU buffer size = 2226.59 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 224.00 MiB llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 256.50 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 1 time=2024-11-17T17:53:53.561-03:00 level=INFO source=server.go:601 msg="llama runner started in 2.26 seconds" ```
Author
Owner

@kripper commented on GitHub (Nov 17, 2024):

And here with OLLAMA_GPU_OVERHEAD = 0,5 G and a OOM crash:

-sh-5.2$ export OLLAMA_GPU_OVERHEAD=500000000
-sh-5.2$ ollama serve
2024/11/17 18:03:14 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:500000000 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-11-17T18:03:14.143-03:00 level=INFO source=images.go:755 msg="total blobs: 17"
time=2024-11-17T18:03:14.144-03:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
time=2024-11-17T18:03:14.144-03:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1)"
time=2024-11-17T18:03:14.145-03:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3167376998/runners
time=2024-11-17T18:03:14.365-03:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm cpu]"
time=2024-11-17T18:03:14.365-03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-17T18:03:14.455-03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda variant=v11 compute=5.0 driver=12.5 name="NVIDIA GeForce 940M" total="1.9 GiB" available="1.9 GiB"
time=2024-11-17T18:03:22.816-03:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="10.0 GiB" free_swap="8.0 GiB"
time=2024-11-17T18:03:22.817-03:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=6 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="476.8 MiB" memory.required.full="3.2 GiB" memory.required.partial="1.5 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.5 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB"
time=2024-11-17T18:03:22.819-03:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3167376998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 6 --threads 2 --parallel 1 --port 37113"
time=2024-11-17T18:03:22.819-03:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-17T18:03:22.819-03:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
time=2024-11-17T18:03:22.819-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
time=2024-11-17T18:03:22.832-03:00 level=INFO source=runner.go:863 msg="starting go runner"
time=2024-11-17T18:03:22.832-03:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2
time=2024-11-17T18:03:22.832-03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:37113"
llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 28
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:  168 tensors
llama_model_loader: - type q6_K:   29 tensors
time=2024-11-17T18:03:23.070-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce 940M, compute capability 5.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 6 repeating layers to GPU
llm_load_tensors: offloaded 6/29 layers to GPU
llm_load_tensors:        CPU buffer size =  1918.35 MiB
llm_load_tensors:      CUDA0 buffer size =   358.95 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   176.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    48.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   564.73 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 290
time=2024-11-17T18:03:24.575-03:00 level=INFO source=server.go:601 msg="llama runner started in 1.76 seconds"
CUDA error: out of memory
  current device: 0, in function alloc at ggml-cuda.cu:406
  cuMemCreate(&handle, reserve_size, &prop, 0)
ggml-cuda.cu:132: CUDA error
[New LWP 16269]
[New LWP 16268]
[New LWP 16267]
[New LWP 16266]
[New LWP 16265]
[New LWP 16264]
[New LWP 16263]
[New LWP 16262]
[New LWP 16261]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fad83e2ee13 in wait4 () from /lib64/libc.so.6
#0  0x00007fad83e2ee13 in wait4 () from /lib64/libc.so.6
#1  0x000055f2b80f1639 in ggml_abort ()
#2  0x00007fad85c36a52 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /usr/local/lib/ollama/libggml_cuda_v11.so
#3  0x00007fad85c44f35 in ggml_cuda_pool_vmm::alloc(unsigned long, unsigned long*) () from /usr/local/lib/ollama/libggml_cuda_v11.so
#4  0x00007fad85c398d6 in ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*) () from /usr/local/lib/ollama/libggml_cuda_v11.so
#5  0x00007fad85c3efc9 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.0] () from /usr/local/lib/ollama/libggml_cuda_v11.so
#6  0x00007fad85c4408e in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /usr/local/lib/ollama/libggml_cuda_v11.so
#7  0x000055f2b80da011 in ggml_backend_sched_graph_compute_async ()
#8  0x000055f2b81bc781 in llama_decode ()
#9  0x000055f2b80d1ee1 in _cgo_08d1de4ea234_Cfunc_llama_decode ()
#10 0x000055f2b7ebca61 in ?? ()
#11 0x0000000000000510 in ?? ()
#12 0x000000c000110380 in ?? ()
#13 0x000055f2b7ebaf17 in ?? ()
#14 0x000055f2b7ebf4a5 in ?? ()
#15 0x00007fff3e2b6308 in ?? ()
#16 0x000055f2b7ebf4a5 in ?? ()
#17 0x000055f2b85dd8c0 in ?? ()
#18 0x00007fff3e2b63e0 in ?? ()
#19 0x000055f2b7ebad05 in ?? ()
#20 0x000055f2b7ebac93 in ?? ()
#21 0x00009ef90000000f in ?? ()
#22 0x00007fff3e2b6468 in ?? ()
#23 0x0002fa1900009efe in ?? ()
#24 0x000000000000000f in ?? ()
#25 0x00007fff3e2b6468 in ?? ()
#26 0x00007fad83d55088 in __libc_start_call_main () from /lib64/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
[Inferior 1 (process 16260) detached]
SIGABRT: abort
PC=0x7fad83dc4664 m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 7 gp=0xc000110380 m=0 mp=0x55f2b85dde80 [syscall]:
runtime.cgocall(0x55f2b80d1e90, 0xc000054b60)
        runtime/cgocall.go:157 +0x4b fp=0xc000054b38 sp=0xc000054b00 pc=0x55f2b7e543cb
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7fad14006a90, {0x200, 0x7fad140dba30, 0x0, 0x0, 0x7fad1430ace0, 0x7fad1430b4f0, 0x7fad1430bd00, 0x7face5999420, 0x0, ...})
        _cgo_gotypes.go:543 +0x52 fp=0xc000054b60 sp=0xc000054b38 pc=0x55f2b7f51952
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x55f2b80cdd4b?, 0x7fad14006a90?)
        github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000054c80 sp=0xc000054b60 pc=0x55f2b7f53e78
github.com/ollama/ollama/llama.(*Context).Decode(0x55f2b86c6060?, 0x0?)
        github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000054cc8 sp=0xc000054c80 pc=0x55f2b7f53cd7
main.(*Server).processBatch(0xc000126120, 0xc000124150, 0xc000046f10)
        github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000054ed0 sp=0xc000054cc8 pc=0x55f2b80ccd7e
main.(*Server).run(0xc000126120, {0x55f2b840fa40, 0xc00007a050})
        github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000054fb8 sp=0xc000054ed0 pc=0x55f2b80cc765
main.main.gowrap2()
        github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000054fe0 sp=0xc000054fb8 pc=0x55f2b80d0ec8
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x55f2b7ebcde1
created by main.main in goroutine 1
        github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b

goroutine 1 gp=0xc000006380 m=nil [IO wait]:
runtime.gopark(0xc000032008?, 0x0?, 0x80?, 0x63?, 0xc00002b8c0?)
        runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x55f2b7e8b00e
runtime.netpollblock(0xc00002b920?, 0xb7e53b26?, 0xf2?)
        runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x55f2b7e83257
internal/poll.runtime_pollWait(0x7fad85b47ff0, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x55f2b7eb7aa5
internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x55f2b7f079c7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000156080)
        internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x55f2b7f08e8c
net.(*netFD).accept(0xc000156080)
        net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x55f2b7f778a9
net.(*TCPListener).accept(0xc0000301e0)
        net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x55f2b7f885de
net.(*TCPListener).Accept(0xc0000301e0)
        net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x55f2b7f87930
net/http.(*onceCloseListener).Accept(0xc0001261b0?)
        <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x55f2b80aea44
net/http.(*Server).Serve(0xc0000161e0, {0x55f2b840f400, 0xc0000301e0})
        net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x55f2b80a585e
main.main()
        github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc00002bf50 sp=0xc00002bc08 pc=0x55f2b80d0c4c
runtime.main()
        runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x55f2b7e8abdd
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x55f2b7ebcde1

goroutine 2 gp=0xc000006e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000044fa8 sp=0xc000044f88 pc=0x55f2b7e8b00e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.forcegchelper()
        runtime/proc.go:326 +0xb8 fp=0xc000044fe0 sp=0xc000044fa8 pc=0x55f2b7e8ae98
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000044fe8 sp=0xc000044fe0 pc=0x55f2b7ebcde1
created by runtime.init.6 in goroutine 1
        runtime/proc.go:314 +0x1a

goroutine 3 gp=0xc000007340 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:402 +0xce fp=0xc000045780 sp=0xc000045760 pc=0x55f2b7e8b00e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.bgsweep(0xc00006c000)
        runtime/mgcsweep.go:278 +0x94 fp=0xc0000457c8 sp=0xc000045780 pc=0x55f2b7e75b54
runtime.gcenable.gowrap1()
        runtime/mgc.go:203 +0x25 fp=0xc0000457e0 sp=0xc0000457c8 pc=0x55f2b7e6a685
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000457e8 sp=0xc0000457e0 pc=0x55f2b7ebcde1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:203 +0x66

goroutine 4 gp=0xc000007500 m=nil [GC scavenge wait]:
runtime.gopark(0xc00006c000?, 0x55f2b830fe98?, 0x1?, 0x0?, 0xc000007500?)
        runtime/proc.go:402 +0xce fp=0xc000045f78 sp=0xc000045f58 pc=0x55f2b7e8b00e
runtime.goparkunlock(...)
        runtime/proc.go:408
runtime.(*scavengerState).park(0x55f2b85dd4c0)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000045fa8 sp=0xc000045f78 pc=0x55f2b7e73549
runtime.bgscavenge(0xc00006c000)
        runtime/mgcscavenge.go:653 +0x3c fp=0xc000045fc8 sp=0xc000045fa8 pc=0x55f2b7e73adc
runtime.gcenable.gowrap2()
        runtime/mgc.go:204 +0x25 fp=0xc000045fe0 sp=0xc000045fc8 pc=0x55f2b7e6a625
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000045fe8 sp=0xc000045fe0 pc=0x55f2b7ebcde1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0xa5

goroutine 5 gp=0xc000110000 m=nil [finalizer wait]:
runtime.gopark(0xc000044648?, 0x55f2b7e5df85?, 0xa8?, 0x1?, 0xc000006380?)
        runtime/proc.go:402 +0xce fp=0xc000044620 sp=0xc000044600 pc=0x55f2b7e8b00e
runtime.runfinq()
        runtime/mfinal.go:194 +0x107 fp=0xc0000447e0 sp=0xc000044620 pc=0x55f2b7e696c7
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0000447e8 sp=0xc0000447e0 pc=0x55f2b7ebcde1
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:164 +0x3d

goroutine 20 gp=0xc0001101c0 m=nil [IO wait]:
runtime.gopark(0x10?, 0x10?, 0xf0?, 0x7d?, 0xb?)
        runtime/proc.go:402 +0xce fp=0xc000047da8 sp=0xc000047d88 pc=0x55f2b7e8b00e
runtime.netpollblock(0x55f2b7ef1558?, 0xb7e53b26?, 0xf2?)
        runtime/netpoll.go:573 +0xf7 fp=0xc000047de0 sp=0xc000047da8 pc=0x55f2b7e83257
internal/poll.runtime_pollWait(0x7fad85b47ef8, 0x72)
        runtime/netpoll.go:345 +0x85 fp=0xc000047e00 sp=0xc000047de0 pc=0x55f2b7eb7aa5
internal/poll.(*pollDesc).wait(0xc000156100?, 0xc000108ee1?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000047e28 sp=0xc000047e00 pc=0x55f2b7f079c7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000156100, {0xc000108ee1, 0x1, 0x1})
        internal/poll/fd_unix.go:164 +0x27a fp=0xc000047ec0 sp=0xc000047e28 pc=0x55f2b7f0851a
net.(*netFD).Read(0xc000156100, {0xc000108ee1?, 0xc000047f48?, 0x55f2b7eb96d0?})
        net/fd_posix.go:55 +0x25 fp=0xc000047f08 sp=0xc000047ec0 pc=0x55f2b7f767a5
net.(*conn).Read(0xc0000480a0, {0xc000108ee1?, 0x0?, 0x55f2b86c6060?})
        net/net.go:185 +0x45 fp=0xc000047f50 sp=0xc000047f08 pc=0x55f2b7f80a65
net.(*TCPConn).Read(0x55f2b859e840?, {0xc000108ee1?, 0x0?, 0x0?})
        <autogenerated>:1 +0x25 fp=0xc000047f80 sp=0xc000047f50 pc=0x55f2b7f8c445
net/http.(*connReader).backgroundRead(0xc000108ed0)
        net/http/server.go:681 +0x37 fp=0xc000047fc8 sp=0xc000047f80 pc=0x55f2b809b1d7
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:677 +0x25 fp=0xc000047fe0 sp=0xc000047fc8 pc=0x55f2b809b105
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc000047fe8 sp=0xc000047fe0 pc=0x55f2b7ebcde1
created by net/http.(*connReader).startBackgroundRead in goroutine 8
        net/http/server.go:677 +0xba

goroutine 8 gp=0xc000110540 m=nil [select]:
runtime.gopark(0xc0001cda80?, 0x2?, 0x60?, 0x0?, 0xc0001cd824?)
        runtime/proc.go:402 +0xce fp=0xc0001cd698 sp=0xc0001cd678 pc=0x55f2b7e8b00e
runtime.selectgo(0xc0001cda80, 0xc0001cd820, 0x7df?, 0x0, 0x1?, 0x1)
        runtime/select.go:327 +0x725 fp=0xc0001cd7b8 sp=0xc0001cd698 pc=0x55f2b7e9c3e5
main.(*Server).completion(0xc000126120, {0x55f2b840f5b0, 0xc0001961c0}, 0xc000184240)
        github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc0001cdab8 sp=0xc0001cd7b8 pc=0x55f2b80ce6de
main.(*Server).completion-fm({0x55f2b840f5b0?, 0xc0001961c0?}, 0x55f2b80a9b8d?)
        <autogenerated>:1 +0x36 fp=0xc0001cdae8 sp=0xc0001cdab8 pc=0x55f2b80d16b6
net/http.HandlerFunc.ServeHTTP(0xc00010add0?, {0x55f2b840f5b0?, 0xc0001961c0?}, 0x10?)
        net/http/server.go:2171 +0x29 fp=0xc0001cdb10 sp=0xc0001cdae8 pc=0x55f2b80a2629
net/http.(*ServeMux).ServeHTTP(0x55f2b7e5df85?, {0x55f2b840f5b0, 0xc0001961c0}, 0xc000184240)
        net/http/server.go:2688 +0x1ad fp=0xc0001cdb60 sp=0xc0001cdb10 pc=0x55f2b80a44ad
net/http.serverHandler.ServeHTTP({0x55f2b840e900?}, {0x55f2b840f5b0?, 0xc0001961c0?}, 0x6?)
        net/http/server.go:3142 +0x8e fp=0xc0001cdb90 sp=0xc0001cdb60 pc=0x55f2b80a54ce
net/http.(*conn).serve(0xc0001261b0, {0x55f2b840fa08, 0xc000108db0})
        net/http/server.go:2044 +0x5e8 fp=0xc0001cdfb8 sp=0xc0001cdb90 pc=0x55f2b80a1268
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3290 +0x28 fp=0xc0001cdfe0 sp=0xc0001cdfb8 pc=0x55f2b80a5c48
runtime.goexit({})
        runtime/asm_amd64.s:1695 +0x1 fp=0xc0001cdfe8 sp=0xc0001cdfe0 pc=0x55f2b7ebcde1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3290 +0x4b4

rax    0x0
rbx    0x3f84
rcx    0x7fad83dc4664
rdx    0x6
rdi    0x3f84
rsi    0x3f84
rbp    0x7fff3e2b4ef0
rsp    0x7fff3e2b4eb0
r8     0x0
r9     0xfffffffb
r10    0x8
r11    0x246
r12    0x7fadc009b000
r13    0x84
r14    0x6
r15    0x61b920000
rip    0x7fad83dc4664
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

<!-- gh-comment-id:2481556066 --> @kripper commented on GitHub (Nov 17, 2024): And here with OLLAMA_GPU_OVERHEAD = 0,5 G and a OOM crash: ``` -sh-5.2$ export OLLAMA_GPU_OVERHEAD=500000000 -sh-5.2$ ollama serve 2024/11/17 18:03:14 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:500000000 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-11-17T18:03:14.143-03:00 level=INFO source=images.go:755 msg="total blobs: 17" time=2024-11-17T18:03:14.144-03:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0" time=2024-11-17T18:03:14.144-03:00 level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.1)" time=2024-11-17T18:03:14.145-03:00 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama3167376998/runners time=2024-11-17T18:03:14.365-03:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm cpu]" time=2024-11-17T18:03:14.365-03:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-11-17T18:03:14.455-03:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-347193f9-2627-a9eb-8c2e-e2158c820e98 library=cuda variant=v11 compute=5.0 driver=12.5 name="NVIDIA GeForce 940M" total="1.9 GiB" available="1.9 GiB" time=2024-11-17T18:03:22.816-03:00 level=INFO source=server.go:105 msg="system memory" total="11.1 GiB" free="10.0 GiB" free_swap="8.0 GiB" time=2024-11-17T18:03:22.817-03:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=6 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="476.8 MiB" memory.required.full="3.2 GiB" memory.required.partial="1.5 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.5 GiB]" memory.weights.total="1.8 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="256.5 MiB" memory.graph.partial="570.7 MiB" time=2024-11-17T18:03:22.819-03:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3167376998/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 2048 --batch-size 512 --n-gpu-layers 6 --threads 2 --parallel 1 --port 37113" time=2024-11-17T18:03:22.819-03:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-11-17T18:03:22.819-03:00 level=INFO source=server.go:562 msg="waiting for llama runner to start responding" time=2024-11-17T18:03:22.819-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" time=2024-11-17T18:03:22.832-03:00 level=INFO source=runner.go:863 msg="starting go runner" time=2024-11-17T18:03:22.832-03:00 level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=2 time=2024-11-17T18:03:22.832-03:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:37113" llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Llama-3.2 llama_model_loader: - kv 5: general.size_label str = 3B llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 8: llama.block_count u32 = 28 llama_model_loader: - kv 9: llama.context_length u32 = 131072 llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 15 llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 58 tensors llama_model_loader: - type q4_K: 168 tensors llama_model_loader: - type q6_K: 29 tensors time=2024-11-17T18:03:23.070-03:00 level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_head = 24 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 3.21 B llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) llm_load_print_meta: general.name = Llama 3.2 3B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce 940M, compute capability 5.0, VMM: yes llm_load_tensors: ggml ctx size = 0.24 MiB llm_load_tensors: offloading 6 repeating layers to GPU llm_load_tensors: offloaded 6/29 layers to GPU llm_load_tensors: CPU buffer size = 1918.35 MiB llm_load_tensors: CUDA0 buffer size = 358.95 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 176.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 48.00 MiB llama_new_context_with_model: KV self size = 224.00 MiB, K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 564.73 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB llama_new_context_with_model: graph nodes = 902 llama_new_context_with_model: graph splits = 290 time=2024-11-17T18:03:24.575-03:00 level=INFO source=server.go:601 msg="llama runner started in 1.76 seconds" CUDA error: out of memory current device: 0, in function alloc at ggml-cuda.cu:406 cuMemCreate(&handle, reserve_size, &prop, 0) ggml-cuda.cu:132: CUDA error [New LWP 16269] [New LWP 16268] [New LWP 16267] [New LWP 16266] [New LWP 16265] [New LWP 16264] [New LWP 16263] [New LWP 16262] [New LWP 16261] This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.fedoraproject.org/> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal] Debuginfod has been disabled. To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007fad83e2ee13 in wait4 () from /lib64/libc.so.6 #0 0x00007fad83e2ee13 in wait4 () from /lib64/libc.so.6 #1 0x000055f2b80f1639 in ggml_abort () #2 0x00007fad85c36a52 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /usr/local/lib/ollama/libggml_cuda_v11.so #3 0x00007fad85c44f35 in ggml_cuda_pool_vmm::alloc(unsigned long, unsigned long*) () from /usr/local/lib/ollama/libggml_cuda_v11.so #4 0x00007fad85c398d6 in ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*) () from /usr/local/lib/ollama/libggml_cuda_v11.so #5 0x00007fad85c3efc9 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.0] () from /usr/local/lib/ollama/libggml_cuda_v11.so #6 0x00007fad85c4408e in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /usr/local/lib/ollama/libggml_cuda_v11.so #7 0x000055f2b80da011 in ggml_backend_sched_graph_compute_async () #8 0x000055f2b81bc781 in llama_decode () #9 0x000055f2b80d1ee1 in _cgo_08d1de4ea234_Cfunc_llama_decode () #10 0x000055f2b7ebca61 in ?? () #11 0x0000000000000510 in ?? () #12 0x000000c000110380 in ?? () #13 0x000055f2b7ebaf17 in ?? () #14 0x000055f2b7ebf4a5 in ?? () #15 0x00007fff3e2b6308 in ?? () #16 0x000055f2b7ebf4a5 in ?? () #17 0x000055f2b85dd8c0 in ?? () #18 0x00007fff3e2b63e0 in ?? () #19 0x000055f2b7ebad05 in ?? () #20 0x000055f2b7ebac93 in ?? () #21 0x00009ef90000000f in ?? () #22 0x00007fff3e2b6468 in ?? () #23 0x0002fa1900009efe in ?? () #24 0x000000000000000f in ?? () #25 0x00007fff3e2b6468 in ?? () #26 0x00007fad83d55088 in __libc_start_call_main () from /lib64/libc.so.6 Backtrace stopped: previous frame inner to this frame (corrupt stack?) [Inferior 1 (process 16260) detached] SIGABRT: abort PC=0x7fad83dc4664 m=0 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 7 gp=0xc000110380 m=0 mp=0x55f2b85dde80 [syscall]: runtime.cgocall(0x55f2b80d1e90, 0xc000054b60) runtime/cgocall.go:157 +0x4b fp=0xc000054b38 sp=0xc000054b00 pc=0x55f2b7e543cb github.com/ollama/ollama/llama._Cfunc_llama_decode(0x7fad14006a90, {0x200, 0x7fad140dba30, 0x0, 0x0, 0x7fad1430ace0, 0x7fad1430b4f0, 0x7fad1430bd00, 0x7face5999420, 0x0, ...}) _cgo_gotypes.go:543 +0x52 fp=0xc000054b60 sp=0xc000054b38 pc=0x55f2b7f51952 github.com/ollama/ollama/llama.(*Context).Decode.func1(0x55f2b80cdd4b?, 0x7fad14006a90?) github.com/ollama/ollama/llama/llama.go:167 +0xd8 fp=0xc000054c80 sp=0xc000054b60 pc=0x55f2b7f53e78 github.com/ollama/ollama/llama.(*Context).Decode(0x55f2b86c6060?, 0x0?) github.com/ollama/ollama/llama/llama.go:167 +0x17 fp=0xc000054cc8 sp=0xc000054c80 pc=0x55f2b7f53cd7 main.(*Server).processBatch(0xc000126120, 0xc000124150, 0xc000046f10) github.com/ollama/ollama/llama/runner/runner.go:424 +0x29e fp=0xc000054ed0 sp=0xc000054cc8 pc=0x55f2b80ccd7e main.(*Server).run(0xc000126120, {0x55f2b840fa40, 0xc00007a050}) github.com/ollama/ollama/llama/runner/runner.go:338 +0x1a5 fp=0xc000054fb8 sp=0xc000054ed0 pc=0x55f2b80cc765 main.main.gowrap2() github.com/ollama/ollama/llama/runner/runner.go:901 +0x28 fp=0xc000054fe0 sp=0xc000054fb8 pc=0x55f2b80d0ec8 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000054fe8 sp=0xc000054fe0 pc=0x55f2b7ebcde1 created by main.main in goroutine 1 github.com/ollama/ollama/llama/runner/runner.go:901 +0xc2b goroutine 1 gp=0xc000006380 m=nil [IO wait]: runtime.gopark(0xc000032008?, 0x0?, 0x80?, 0x63?, 0xc00002b8c0?) runtime/proc.go:402 +0xce fp=0xc00002b888 sp=0xc00002b868 pc=0x55f2b7e8b00e runtime.netpollblock(0xc00002b920?, 0xb7e53b26?, 0xf2?) runtime/netpoll.go:573 +0xf7 fp=0xc00002b8c0 sp=0xc00002b888 pc=0x55f2b7e83257 internal/poll.runtime_pollWait(0x7fad85b47ff0, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc00002b8e0 sp=0xc00002b8c0 pc=0x55f2b7eb7aa5 internal/poll.(*pollDesc).wait(0x3?, 0x3fe?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002b908 sp=0xc00002b8e0 pc=0x55f2b7f079c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000156080) internal/poll/fd_unix.go:611 +0x2ac fp=0xc00002b9b0 sp=0xc00002b908 pc=0x55f2b7f08e8c net.(*netFD).accept(0xc000156080) net/fd_unix.go:172 +0x29 fp=0xc00002ba68 sp=0xc00002b9b0 pc=0x55f2b7f778a9 net.(*TCPListener).accept(0xc0000301e0) net/tcpsock_posix.go:159 +0x1e fp=0xc00002ba90 sp=0xc00002ba68 pc=0x55f2b7f885de net.(*TCPListener).Accept(0xc0000301e0) net/tcpsock.go:327 +0x30 fp=0xc00002bac0 sp=0xc00002ba90 pc=0x55f2b7f87930 net/http.(*onceCloseListener).Accept(0xc0001261b0?) <autogenerated>:1 +0x24 fp=0xc00002bad8 sp=0xc00002bac0 pc=0x55f2b80aea44 net/http.(*Server).Serve(0xc0000161e0, {0x55f2b840f400, 0xc0000301e0}) net/http/server.go:3260 +0x33e fp=0xc00002bc08 sp=0xc00002bad8 pc=0x55f2b80a585e main.main() github.com/ollama/ollama/llama/runner/runner.go:921 +0xfcc fp=0xc00002bf50 sp=0xc00002bc08 pc=0x55f2b80d0c4c runtime.main() runtime/proc.go:271 +0x29d fp=0xc00002bfe0 sp=0xc00002bf50 pc=0x55f2b7e8abdd runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc00002bfe8 sp=0xc00002bfe0 pc=0x55f2b7ebcde1 goroutine 2 gp=0xc000006e00 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000044fa8 sp=0xc000044f88 pc=0x55f2b7e8b00e runtime.goparkunlock(...) runtime/proc.go:408 runtime.forcegchelper() runtime/proc.go:326 +0xb8 fp=0xc000044fe0 sp=0xc000044fa8 pc=0x55f2b7e8ae98 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000044fe8 sp=0xc000044fe0 pc=0x55f2b7ebcde1 created by runtime.init.6 in goroutine 1 runtime/proc.go:314 +0x1a goroutine 3 gp=0xc000007340 m=nil [GC sweep wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:402 +0xce fp=0xc000045780 sp=0xc000045760 pc=0x55f2b7e8b00e runtime.goparkunlock(...) runtime/proc.go:408 runtime.bgsweep(0xc00006c000) runtime/mgcsweep.go:278 +0x94 fp=0xc0000457c8 sp=0xc000045780 pc=0x55f2b7e75b54 runtime.gcenable.gowrap1() runtime/mgc.go:203 +0x25 fp=0xc0000457e0 sp=0xc0000457c8 pc=0x55f2b7e6a685 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000457e8 sp=0xc0000457e0 pc=0x55f2b7ebcde1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:203 +0x66 goroutine 4 gp=0xc000007500 m=nil [GC scavenge wait]: runtime.gopark(0xc00006c000?, 0x55f2b830fe98?, 0x1?, 0x0?, 0xc000007500?) runtime/proc.go:402 +0xce fp=0xc000045f78 sp=0xc000045f58 pc=0x55f2b7e8b00e runtime.goparkunlock(...) runtime/proc.go:408 runtime.(*scavengerState).park(0x55f2b85dd4c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000045fa8 sp=0xc000045f78 pc=0x55f2b7e73549 runtime.bgscavenge(0xc00006c000) runtime/mgcscavenge.go:653 +0x3c fp=0xc000045fc8 sp=0xc000045fa8 pc=0x55f2b7e73adc runtime.gcenable.gowrap2() runtime/mgc.go:204 +0x25 fp=0xc000045fe0 sp=0xc000045fc8 pc=0x55f2b7e6a625 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000045fe8 sp=0xc000045fe0 pc=0x55f2b7ebcde1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0xa5 goroutine 5 gp=0xc000110000 m=nil [finalizer wait]: runtime.gopark(0xc000044648?, 0x55f2b7e5df85?, 0xa8?, 0x1?, 0xc000006380?) runtime/proc.go:402 +0xce fp=0xc000044620 sp=0xc000044600 pc=0x55f2b7e8b00e runtime.runfinq() runtime/mfinal.go:194 +0x107 fp=0xc0000447e0 sp=0xc000044620 pc=0x55f2b7e696c7 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0000447e8 sp=0xc0000447e0 pc=0x55f2b7ebcde1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:164 +0x3d goroutine 20 gp=0xc0001101c0 m=nil [IO wait]: runtime.gopark(0x10?, 0x10?, 0xf0?, 0x7d?, 0xb?) runtime/proc.go:402 +0xce fp=0xc000047da8 sp=0xc000047d88 pc=0x55f2b7e8b00e runtime.netpollblock(0x55f2b7ef1558?, 0xb7e53b26?, 0xf2?) runtime/netpoll.go:573 +0xf7 fp=0xc000047de0 sp=0xc000047da8 pc=0x55f2b7e83257 internal/poll.runtime_pollWait(0x7fad85b47ef8, 0x72) runtime/netpoll.go:345 +0x85 fp=0xc000047e00 sp=0xc000047de0 pc=0x55f2b7eb7aa5 internal/poll.(*pollDesc).wait(0xc000156100?, 0xc000108ee1?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000047e28 sp=0xc000047e00 pc=0x55f2b7f079c7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000156100, {0xc000108ee1, 0x1, 0x1}) internal/poll/fd_unix.go:164 +0x27a fp=0xc000047ec0 sp=0xc000047e28 pc=0x55f2b7f0851a net.(*netFD).Read(0xc000156100, {0xc000108ee1?, 0xc000047f48?, 0x55f2b7eb96d0?}) net/fd_posix.go:55 +0x25 fp=0xc000047f08 sp=0xc000047ec0 pc=0x55f2b7f767a5 net.(*conn).Read(0xc0000480a0, {0xc000108ee1?, 0x0?, 0x55f2b86c6060?}) net/net.go:185 +0x45 fp=0xc000047f50 sp=0xc000047f08 pc=0x55f2b7f80a65 net.(*TCPConn).Read(0x55f2b859e840?, {0xc000108ee1?, 0x0?, 0x0?}) <autogenerated>:1 +0x25 fp=0xc000047f80 sp=0xc000047f50 pc=0x55f2b7f8c445 net/http.(*connReader).backgroundRead(0xc000108ed0) net/http/server.go:681 +0x37 fp=0xc000047fc8 sp=0xc000047f80 pc=0x55f2b809b1d7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:677 +0x25 fp=0xc000047fe0 sp=0xc000047fc8 pc=0x55f2b809b105 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc000047fe8 sp=0xc000047fe0 pc=0x55f2b7ebcde1 created by net/http.(*connReader).startBackgroundRead in goroutine 8 net/http/server.go:677 +0xba goroutine 8 gp=0xc000110540 m=nil [select]: runtime.gopark(0xc0001cda80?, 0x2?, 0x60?, 0x0?, 0xc0001cd824?) runtime/proc.go:402 +0xce fp=0xc0001cd698 sp=0xc0001cd678 pc=0x55f2b7e8b00e runtime.selectgo(0xc0001cda80, 0xc0001cd820, 0x7df?, 0x0, 0x1?, 0x1) runtime/select.go:327 +0x725 fp=0xc0001cd7b8 sp=0xc0001cd698 pc=0x55f2b7e9c3e5 main.(*Server).completion(0xc000126120, {0x55f2b840f5b0, 0xc0001961c0}, 0xc000184240) github.com/ollama/ollama/llama/runner/runner.go:652 +0x8fe fp=0xc0001cdab8 sp=0xc0001cd7b8 pc=0x55f2b80ce6de main.(*Server).completion-fm({0x55f2b840f5b0?, 0xc0001961c0?}, 0x55f2b80a9b8d?) <autogenerated>:1 +0x36 fp=0xc0001cdae8 sp=0xc0001cdab8 pc=0x55f2b80d16b6 net/http.HandlerFunc.ServeHTTP(0xc00010add0?, {0x55f2b840f5b0?, 0xc0001961c0?}, 0x10?) net/http/server.go:2171 +0x29 fp=0xc0001cdb10 sp=0xc0001cdae8 pc=0x55f2b80a2629 net/http.(*ServeMux).ServeHTTP(0x55f2b7e5df85?, {0x55f2b840f5b0, 0xc0001961c0}, 0xc000184240) net/http/server.go:2688 +0x1ad fp=0xc0001cdb60 sp=0xc0001cdb10 pc=0x55f2b80a44ad net/http.serverHandler.ServeHTTP({0x55f2b840e900?}, {0x55f2b840f5b0?, 0xc0001961c0?}, 0x6?) net/http/server.go:3142 +0x8e fp=0xc0001cdb90 sp=0xc0001cdb60 pc=0x55f2b80a54ce net/http.(*conn).serve(0xc0001261b0, {0x55f2b840fa08, 0xc000108db0}) net/http/server.go:2044 +0x5e8 fp=0xc0001cdfb8 sp=0xc0001cdb90 pc=0x55f2b80a1268 net/http.(*Server).Serve.gowrap3() net/http/server.go:3290 +0x28 fp=0xc0001cdfe0 sp=0xc0001cdfb8 pc=0x55f2b80a5c48 runtime.goexit({}) runtime/asm_amd64.s:1695 +0x1 fp=0xc0001cdfe8 sp=0xc0001cdfe0 pc=0x55f2b7ebcde1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3290 +0x4b4 rax 0x0 rbx 0x3f84 rcx 0x7fad83dc4664 rdx 0x6 rdi 0x3f84 rsi 0x3f84 rbp 0x7fff3e2b4ef0 rsp 0x7fff3e2b4eb0 r8 0x0 r9 0xfffffffb r10 0x8 r11 0x246 r12 0x7fadc009b000 r13 0x84 r14 0x6 r15 0x61b920000 rip 0x7fad83dc4664 rflags 0x246 cs 0x33 fs 0x0 gs 0x0 ```
Author
Owner

@rick-github commented on GitHub (Nov 17, 2024):

OLLAMA_GPU_OVERHEAD=1200000000 and slow is expected: the GPU has 1.9G free, overhead removes 1.1G leaving 800M which is not enough to load KV, graph and a layer, so the whole lot gets moved to RAM.

OLLAMA_GPU_OVERHEAD=500000000 and OOM is unexpected because ollama only loads 6 layers and there is supposed to be unused VRAM that llama.cpp can allocate from. Is there a number for num_gpu where the model loads and doesn't OOM when being used?

<!-- gh-comment-id:2481609577 --> @rick-github commented on GitHub (Nov 17, 2024): `OLLAMA_GPU_OVERHEAD=1200000000` and slow is expected: the GPU has 1.9G free, overhead removes 1.1G leaving 800M which is not enough to load KV, graph and a layer, so the whole lot gets moved to RAM. `OLLAMA_GPU_OVERHEAD=500000000` and OOM is unexpected because ollama only loads 6 layers and there is supposed to be unused VRAM that llama.cpp can allocate from. Is there a number for `num_gpu` where the model loads and doesn't OOM when being used?
Author
Owner

@kripper commented on GitHub (Nov 17, 2024):

Using lm-studio with same hardware, model and prompt, it works fine and uses GPU + VRAM (1646MiB / 2048MiB) + some CPU + 4 GB RAM resident memory.

<!-- gh-comment-id:2481610610 --> @kripper commented on GitHub (Nov 17, 2024): Using lm-studio with same hardware, model and prompt, it works fine and uses GPU + VRAM (1646MiB / 2048MiB) + some CPU + 4 GB RAM resident memory.
Author
Owner

@rick-github commented on GitHub (Nov 17, 2024):

How many layers does lm-studio load in VRAM?

<!-- gh-comment-id:2481613009 --> @rick-github commented on GitHub (Nov 17, 2024): How many layers does lm-studio load in VRAM?
Author
Owner

@rick-github commented on GitHub (Nov 17, 2024):

I started an LM-Studio server and the performance/memory for full GPU offload is the same as ollama:

{"model":"lmstudio-community/llama-3.2-3b-instruct","tps":138.44,"memory_usage":["151 MiB","2586 MiB"]}
{"model":"llama3.2:3b","tps":140.47,"memory_usage":["2598 MiB"]}

I'll try setting up an environment similar to yours and see what happens.

<!-- gh-comment-id:2481679518 --> @rick-github commented on GitHub (Nov 17, 2024): I started an LM-Studio server and the performance/memory for full GPU offload is the same as ollama: ```json {"model":"lmstudio-community/llama-3.2-3b-instruct","tps":138.44,"memory_usage":["151 MiB","2586 MiB"]} {"model":"llama3.2:3b","tps":140.47,"memory_usage":["2598 MiB"]} ``` I'll try setting up an environment similar to yours and see what happens.
Author
Owner

@kripper commented on GitHub (Nov 18, 2024):

Sorry, I didn't notice LM-Studio was using the Vulkan Runtime instead of CUDA Runtime.
With Vulkan it worked fine with GPU Offload = 4/28 and sometimes with GPU Offload = 5/28.

<!-- gh-comment-id:2481735028 --> @kripper commented on GitHub (Nov 18, 2024): Sorry, I didn't notice LM-Studio was using the Vulkan Runtime instead of CUDA Runtime. With Vulkan it worked fine with GPU Offload = 4/28 and sometimes with GPU Offload = 5/28.
Author
Owner

@kripper commented on GitHub (Dec 4, 2024):

I'm closing this issue.
Accepted answer: https://github.com/ollama/ollama/issues/7673#issuecomment-2480813021

<!-- gh-comment-id:2518561113 --> @kripper commented on GitHub (Dec 4, 2024): I'm closing this issue. Accepted answer: https://github.com/ollama/ollama/issues/7673#issuecomment-2480813021
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66953