[GH-ISSUE #2835] CUDA out of memory error on Windows for ollama run starts up #63763

Closed
opened 2026-05-03 14:54:00 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @boluny on GitHub (Feb 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2835

Originally assigned to: @dhiltgen on GitHub.

Hi there,

I just installed ollama 0.1.27 and tried to run gemma:2b but it suggest CUDA out of memory error. Could you please investigate and figure out root cause?

I'm using CPU i7-4700HQ with RAM 16G.

attached log and nvidia-smi report

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.41 Driver Version: 531.41 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 960M WDDM | 00000000:02:00.0 Off | N/A |
| N/A 0C P0 N/A / N/A| 181MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 272 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 4520 C+G ....0_x64__8wekyb3d8bbwe\YourPhone.exe N/A |
| 0 N/A N/A 7580 C+G ....Experiences.TextInput.InputApp.exe N/A |
| 0 N/A N/A 9940 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
| 0 N/A N/A 11012 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 12428 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A |
| 0 N/A N/A 13100 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 13332 C+G ...guoyun\bin-7.1.3\NutstoreClient.exe N/A |
+---------------------------------------------------------------------------------------+

log:

[GIN] 2024/02/29 - 23:47:32 | 200 | 32.7µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/02/29 - 23:47:32 | 200 | 1.2447ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/02/29 - 23:47:32 | 200 | 2.4218ms | 127.0.0.1 | POST "/api/show"
time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-02-29T23:47:37.216+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll C:\WINDOWS\system32\nvml.dll]"
time=2024-02-29T23:47:37.236+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-02-29T23:47:37.236+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-29T23:47:37.248+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0"
time=2024-02-29T23:47:37.248+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-29T23:47:37.252+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0"
time=2024-02-29T23:47:37.253+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-29T23:47:37.253+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to
time=2024-02-29T23:47:37.328+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\bolun\AppData\Local\Temp\ollama625311207\cuda_v11.3\ext_server.dll"
time=2024-02-29T23:47:37.329+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 960M, compute capability 5.0, VMM: yes
llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from C:\Users\bolun.ollama\models\blobs\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma
llama_model_loader: - kv 1: general.name str = gemma-2b-it
llama_model_loader: - kv 2: gemma.context_length u32 = 8192
llama_model_loader: - kv 3: gemma.block_count u32 = 18
llama_model_loader: - kv 4: gemma.embedding_length u32 = 2048
llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384
llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8
llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1
llama_model_loader: - kv 8: gemma.attention.key_length u32 = 256
llama_model_loader: - kv 9: gemma.attention.value_length u32 = 256
llama_model_loader: - kv 10: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,256128] = ["", "", "", "", ...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,256128] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,256128] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - kv 20: general.file_type u32 = 2
llama_model_loader: - type f32: 37 tensors
llama_model_loader: - type q4_0: 126 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256128
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 8
llm_load_print_meta: n_head_kv = 1
llm_load_print_meta: n_layer = 18
llm_load_print_meta: n_rot = 256
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 16384
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 2B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 2.51 B
llm_load_print_meta: model size = 1.56 GiB (5.34 BPW)
llm_load_print_meta: general.name = gemma-2b-it
llm_load_print_meta: BOS token = 2 ''
llm_load_print_meta: EOS token = 1 ''
llm_load_print_meta: UNK token = 3 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 227 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.13 MiB
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors: CPU buffer size = 531.52 MiB
llm_load_tensors: CUDA0 buffer size = 1594.93 MiB
.....................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 36.00 MiB
llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 9.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 504.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB
llama_new_context_with_model: graph splits (measure): 3
CUDA error: out of memory
current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990
cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1)
GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

Originally created by @boluny on GitHub (Feb 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2835 Originally assigned to: @dhiltgen on GitHub. Hi there, I just installed ollama 0.1.27 and tried to run gemma:2b but it suggest CUDA out of memory error. Could you please investigate and figure out root cause? I'm using CPU `i7-4700HQ` with RAM 16G. attached log and nvidia-smi report >+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 531.41 Driver Version: 531.41 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 960M WDDM | 00000000:02:00.0 Off | N/A | | N/A 0C P0 N/A / N/A| 181MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 272 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 4520 C+G ....0_x64__8wekyb3d8bbwe\YourPhone.exe N/A | | 0 N/A N/A 7580 C+G ....Experiences.TextInput.InputApp.exe N/A | | 0 N/A N/A 9940 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11012 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 12428 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 13100 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 13332 C+G ...guoyun\bin-7.1.3\NutstoreClient.exe N/A | +---------------------------------------------------------------------------------------+ log: > [GIN] 2024/02/29 - 23:47:32 | 200 | 32.7µs | 127.0.0.1 | HEAD "/" [GIN] 2024/02/29 - 23:47:32 | 200 | 1.2447ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/02/29 - 23:47:32 | 200 | 2.4218ms | 127.0.0.1 | POST "/api/show" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-02-29T23:47:37.216+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\WINDOWS\\system32\\nvml.dll]" time=2024-02-29T23:47:37.236+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-29T23:47:37.236+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.248+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.248+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.252+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.253+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.253+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to time=2024-02-29T23:47:37.328+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\bolun\\AppData\\Local\\Temp\\ollama625311207\\cuda_v11.3\\ext_server.dll" time=2024-02-29T23:47:37.329+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 960M, compute capability 5.0, VMM: yes llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from C:\Users\bolun\.ollama\models\blobs\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-2b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.block_count u32 = 18 llama_model_loader: - kv 4: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 9: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,256128] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,256128] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,256128] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - kv 20: general.file_type u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_0: 126 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.56 GiB (5.34 BPW) llm_load_print_meta: general.name = gemma-2b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 19/19 layers to GPU llm_load_tensors: CPU buffer size = 531.52 MiB llm_load_tensors: CUDA0 buffer size = 1594.93 MiB ..................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 36.00 MiB llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 9.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.25 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 3 CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"
GiteaMirror added the bug label 2026-05-03 14:54:00 -05:00
Author
Owner

@pdevine commented on GitHub (Mar 1, 2024):

cc @dhiltgen

<!-- gh-comment-id:1972261815 --> @pdevine commented on GitHub (Mar 1, 2024): cc @dhiltgen
Author
Owner

@hbqclh commented on GitHub (Mar 2, 2024):

I am also experiencing the same error. Here is the error log:
time=2024-03-02T10:57:13.946+08:00 level=INFO source=images.go:710 msg="total blobs: 17"
time=2024-03-02T10:57:13.959+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-02T10:57:13.961+08:00 level=INFO source=routes.go:1019 msg="Listening on [::]:1123 (version 0.1.27)"
time=2024-03-02T10:57:13.961+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-02T10:57:14.141+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cuda_v11.3 cpu_avx]"
[GIN] 2024/03/02 - 10:57:14 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/02 - 10:57:14 | 200 | 2.3633ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/02 - 10:57:14 | 200 | 2.3207ms | 127.0.0.1 | POST "/api/show"
time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-02T10:57:14.909+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-02T10:57:14.926+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-02T10:57:14.926+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\huan\AppData\Local\Microsoft\WindowsApps;;C:\Users\huan\AppData\Local\Programs\Ollama"
time=2024-03-02T10:57:15.046+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll"
time=2024-03-02T10:57:15.047+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from C:\Users\huan.ollama\models\blobs\sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 3577.56 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 3
time=2024-03-02T10:57:27.527+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/03/02 - 10:57:27 | 200 | 13.1901436s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/03/02 - 10:57:42 | 200 | 453.2522ms | 127.0.0.1 | POST "/api/chat"
time=2024-03-02T10:57:56.201+08:00 level=INFO source=routes.go:78 msg="changing loaded model"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll"
time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from C:\Users\huan.ollama\models\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma
llama_model_loader: - kv 1: general.name str = gemma-7b-it
llama_model_loader: - kv 2: gemma.context_length u32 = 8192
llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072
llama_model_loader: - kv 4: gemma.block_count u32 = 28
llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576
llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16
llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256
llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - type f32: 57 tensors
llama_model_loader: - type q4_0: 196 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_rot = 192
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 24576
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.54 B
llm_load_print_meta: model size = 4.84 GiB (4.87 BPW)
llm_load_print_meta: general.name = gemma-7b-it
llm_load_print_meta: BOS token = 2 ''
llm_load_print_meta: EOS token = 1 ''
llm_load_print_meta: UNK token = 3 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 227 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.19 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 796.88 MiB
llm_load_tensors: CUDA0 buffer size = 4955.54 MiB
...........................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 11.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 506.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB
llama_new_context_with_model: graph splits (measure): 3
time=2024-03-02T10:58:13.011+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
CUDA error: out of memory
current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990
cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1)
GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

<!-- gh-comment-id:1974236658 --> @hbqclh commented on GitHub (Mar 2, 2024): I am also experiencing the same error. Here is the error log: time=2024-03-02T10:57:13.946+08:00 level=INFO source=images.go:710 msg="total blobs: 17" time=2024-03-02T10:57:13.959+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-02T10:57:13.961+08:00 level=INFO source=routes.go:1019 msg="Listening on [::]:1123 (version 0.1.27)" time=2024-03-02T10:57:13.961+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-02T10:57:14.141+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cuda_v11.3 cpu_avx]" [GIN] 2024/03/02 - 10:57:14 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 10:57:14 | 200 | 2.3633ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 10:57:14 | 200 | 2.3207ms | 127.0.0.1 | POST "/api/show" time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-02T10:57:14.909+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-02T10:57:14.926+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-02T10:57:14.926+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\huan\\AppData\\Local\\Temp\\ollama2241795987\\cuda_v11.3;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;;;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.3.1\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Users\\huan\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Users\\huan\\AppData\\Local\\Programs\\Ollama" time=2024-03-02T10:57:15.046+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\huan\\AppData\\Local\\Temp\\ollama2241795987\\cuda_v11.3\\ext_server.dll" time=2024-03-02T10:57:15.047+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from C:\Users\huan\.ollama\models\blobs\sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 3577.56 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-02T10:57:27.527+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" [GIN] 2024/03/02 - 10:57:27 | 200 | 13.1901436s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/03/02 - 10:57:42 | 200 | 453.2522ms | 127.0.0.1 | POST "/api/chat" time=2024-03-02T10:57:56.201+08:00 level=INFO source=routes.go:78 msg="changing loaded model" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\huan\\AppData\\Local\\Temp\\ollama2241795987\\cuda_v11.3\\ext_server.dll" time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from C:\Users\huan\.ollama\models\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.84 GiB (4.87 BPW) llm_load_print_meta: general.name = gemma-7b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.19 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 796.88 MiB llm_load_tensors: CUDA0 buffer size = 4955.54 MiB ........................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 11.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 506.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-02T10:58:13.011+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"
Author
Owner

@trandhbao commented on GitHub (Mar 2, 2024):

My spec: Ubuntu 22.04, 16.0 GiB RAM, GeForce 940MX (2048 MiB).

Using Gemma:2b, I face same CUDA OOM error when calling the Ollama API from my webapp. I do NOT face same error via Ollama run.

So I got Ollama working with Gemma from webapp by:

  • put NO user instruction in the System message of API payload (or maybe remove the system message entirely, I have not tried)
  • start the conversation with "Hello", any longer question like "Why is the sky blue" would not work
  • then, I can ask any long question

Strange, but it worked for me. Note that I do not face similar issue with other LLM like Mistral.

BTW this is my first ever Github comment. Many many thanks to the great Ollama team!

<!-- gh-comment-id:1974781340 --> @trandhbao commented on GitHub (Mar 2, 2024): My spec: Ubuntu 22.04, 16.0 GiB RAM, GeForce 940MX (2048 MiB). Using Gemma:2b, I face same CUDA OOM error when calling the Ollama API from my webapp. I do NOT face same error via Ollama run. So I got Ollama working with Gemma from webapp by: - put NO user instruction in the System message of API payload (or maybe remove the system message entirely, I have not tried) - start the conversation with "Hello", any longer question like "Why is the sky blue" would not work - then, I can ask any long question Strange, but it worked for me. Note that I do not face similar issue with other LLM like Mistral. BTW this is my first ever Github comment. Many many thanks to the great Ollama team!
Author
Owner

@ruca-radio commented on GitHub (Mar 3, 2024):

Similar issue here, with a Ryzen 5700X, 32gb RAM, and dual GPUs:

time=2024-03-02T23:09:06.654-05:00 level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-03-02T23:09:06.655-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-02T23:09:06.655-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-03-02T23:09:06.655-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-02T23:09:06.811-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx cpu_avx2 cpu]"
[GIN] 2024/03/02 - 23:09:23 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/02 - 23:09:23 | 404 | 528.3µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/02 - 23:09:24 | 200 | 492.9968ms | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/03/02 - 23:09:27 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/02 - 23:09:27 | 404 | 0s | 127.0.0.1 | POST "/api/show"
time=2024-03-02T23:09:28.622-05:00 level=INFO source=download.go:136 msg="downloading e8a35b5937a5 in 42 100 MB part(s)"
time=2024-03-02T23:10:36.482-05:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)"
time=2024-03-02T23:10:38.340-05:00 level=INFO source=download.go:136 msg="downloading e6836092461f in 1 42 B part(s)"
time=2024-03-02T23:10:41.345-05:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)"
time=2024-03-02T23:10:43.244-05:00 level=INFO source=download.go:136 msg="downloading f9b1e3196ecf in 1 483 B part(s)"
[GIN] 2024/03/02 - 23:10:47 | 200 | 1m20s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/03/02 - 23:10:47 | 200 | 524.1µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/02 - 23:10:47 | 200 | 528.6µs | 127.0.0.1 | POST "/api/show"
time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-02T23:10:47.840-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-02T23:10:47.855-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-02T23:10:47.859-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA1\AppData\Local\Temp\ollama991450673\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ "
time=2024-03-02T23:10:48.341-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA
1\AppData\Local\Temp\ollama991450673\cuda_v11.3\ext_server.dll"
time=2024-03-02T23:10:48.342-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: Quadro M6000, compute capability 5.2, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name = mistralai
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 1989.53 MiB
llm_load_tensors: CUDA1 buffer size = 1858.02 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 5
CUDA error: unspecified launch failure
current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953
cudaDeviceSynchronize()
GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"
time=2024-03-03T01:54:03.879-05:00 level=INFO source=images.go:710 msg="total blobs: 5"
time=2024-03-03T01:54:03.884-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-03T01:54:03.885-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-03-03T01:54:03.885-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-03T01:54:04.032-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11.3 cpu]"
[GIN] 2024/03/03 - 01:54:04 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/03 - 01:54:04 | 200 | 14.8896ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/03 - 01:54:04 | 200 | 505.4µs | 127.0.0.1 | POST "/api/show"
time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-03T01:54:04.889-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-03T01:54:04.895-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-03T01:54:04.907-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-03T01:54:04.942-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA1\AppData\Local\Temp\ollama2071667329\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ "
time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA
1\AppData\Local\Temp\ollama2071667329\cuda_v11.3\ext_server.dll"
time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: Quadro M6000, compute capability 5.2, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name = mistralai
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 1989.53 MiB
llm_load_tensors: CUDA1 buffer size = 1858.02 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 5
CUDA error: unspecified launch failure
current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953
cudaDeviceSynchronize()
GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"
time=2024-03-03T01:55:31.714-05:00 level=INFO source=images.go:710 msg="total blobs: 5"
time=2024-03-03T01:55:31.720-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-03T01:55:31.722-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-03-03T01:55:31.722-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-03T01:55:31.881-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu cpu_avx2 cuda_v11.3 cpu_avx]"
[GIN] 2024/03/03 - 01:59:06 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/03 - 01:59:06 | 200 | 18.2741ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/03 - 01:59:06 | 200 | 549.2µs | 127.0.0.1 | POST "/api/show"
time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-03T01:59:07.275-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-03T01:59:07.300-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-03T01:59:07.302-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-03T01:59:07.332-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA1\AppData\Local\Temp\ollama4153122201\cuda_v11.3;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\;C:\Users\rucaradio\AppData\Local\Programs\Ollama"
time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA
1\AppData\Local\Temp\ollama4153122201\cuda_v11.3\ext_server.dll"
time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: Quadro M6000, compute capability 5.2, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name = mistralai
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 1989.53 MiB
llm_load_tensors: CUDA1 buffer size = 1858.02 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 5
CUDA error: unspecified launch failure
current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953
cudaDeviceSynchronize()
GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro M6000 WDDM | 00000000:05:00.0 On | Off |
| 27% 51C P8 28W / 250W | 646MiB / 12288MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3060 WDDM | 00000000:0B:00.0 On | N/A |
| 0% 35C P8 8W / 170W | 118MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4436 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A |
| 0 N/A N/A 6612 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A |
| 0 N/A N/A 6792 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 8600 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 9404 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |

(base) C:\newpdev\ollama>NVCC -V
NVCC: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

<!-- gh-comment-id:1975067772 --> @ruca-radio commented on GitHub (Mar 3, 2024): Similar issue here, with a Ryzen 5700X, 32gb RAM, and dual GPUs: time=2024-03-02T23:09:06.654-05:00 level=INFO source=images.go:710 msg="total blobs: 0" time=2024-03-02T23:09:06.655-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-02T23:09:06.655-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-02T23:09:06.655-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-02T23:09:06.811-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx cpu_avx2 cpu]" [GIN] 2024/03/02 - 23:09:23 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 23:09:23 | 404 | 528.3µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 23:09:24 | 200 | 492.9968ms | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/02 - 23:09:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 23:09:27 | 404 | 0s | 127.0.0.1 | POST "/api/show" time=2024-03-02T23:09:28.622-05:00 level=INFO source=download.go:136 msg="downloading e8a35b5937a5 in 42 100 MB part(s)" time=2024-03-02T23:10:36.482-05:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)" time=2024-03-02T23:10:38.340-05:00 level=INFO source=download.go:136 msg="downloading e6836092461f in 1 42 B part(s)" time=2024-03-02T23:10:41.345-05:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)" time=2024-03-02T23:10:43.244-05:00 level=INFO source=download.go:136 msg="downloading f9b1e3196ecf in 1 483 B part(s)" [GIN] 2024/03/02 - 23:10:47 | 200 | 1m20s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/02 - 23:10:47 | 200 | 524.1µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 23:10:47 | 200 | 528.6µs | 127.0.0.1 | POST "/api/show" time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-02T23:10:47.840-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-02T23:10:47.855-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-02T23:10:47.859-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\RUCARA~1\\AppData\\Local\\Temp\\ollama991450673\\cuda_v11.3;C:\\Users\\rucaradio\\AppData\\Local\\Programs\\Ollama;C:\\Program Files\\NVIDIA\\CUDNN\\v9.0\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;;;;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\cmd;C:\\Users\\rucaradio\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs;C:\\Program Files\\WindowsPowerShell\\Scripts;C:\\ProgramData\\chocolatey\\bin;;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.3.1\\;C:\\Program Files\\Go\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files\\CMake\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Users\\rucaradio\\.cargo\\bin;C:\\Users\\rucaradio\\scoop\\shims;C:\\Users\\rucaradio\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\rucaradio\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\ " time=2024-03-02T23:10:48.341-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\RUCARA~1\\AppData\\Local\\Temp\\ollama991450673\\cuda_v11.3\\ext_server.dll" time=2024-03-02T23:10:48.342-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio\.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" time=2024-03-03T01:54:03.879-05:00 level=INFO source=images.go:710 msg="total blobs: 5" time=2024-03-03T01:54:03.884-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T01:54:03.885-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T01:54:03.885-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T01:54:04.032-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11.3 cpu]" [GIN] 2024/03/03 - 01:54:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/03 - 01:54:04 | 200 | 14.8896ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/03 - 01:54:04 | 200 | 505.4µs | 127.0.0.1 | POST "/api/show" time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-03T01:54:04.889-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-03T01:54:04.895-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-03T01:54:04.907-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\RUCARA~1\\AppData\\Local\\Temp\\ollama2071667329\\cuda_v11.3;C:\\Users\\rucaradio\\AppData\\Local\\Programs\\Ollama;C:\\Program Files\\NVIDIA\\CUDNN\\v9.0\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;;;;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\cmd;C:\\Users\\rucaradio\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs;C:\\Program Files\\WindowsPowerShell\\Scripts;C:\\ProgramData\\chocolatey\\bin;;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.3.1\\;C:\\Program Files\\Go\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files\\CMake\\bin;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Users\\rucaradio\\.cargo\\bin;C:\\Users\\rucaradio\\scoop\\shims;C:\\Users\\rucaradio\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\rucaradio\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\ " time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\RUCARA~1\\AppData\\Local\\Temp\\ollama2071667329\\cuda_v11.3\\ext_server.dll" time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio\.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" time=2024-03-03T01:55:31.714-05:00 level=INFO source=images.go:710 msg="total blobs: 5" time=2024-03-03T01:55:31.720-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T01:55:31.722-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T01:55:31.722-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T01:55:31.881-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu cpu_avx2 cuda_v11.3 cpu_avx]" [GIN] 2024/03/03 - 01:59:06 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/03 - 01:59:06 | 200 | 18.2741ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/03 - 01:59:06 | 200 | 549.2µs | 127.0.0.1 | POST "/api/show" time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-03T01:59:07.275-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-03T01:59:07.300-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-03T01:59:07.302-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\RUCARA~1\\AppData\\Local\\Temp\\ollama4153122201\\cuda_v11.3;C:\\Program Files\\NVIDIA\\CUDNN\\v9.0\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;;;;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\cmd;C:\\Users\\rucaradio\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs;C:\\Program Files\\WindowsPowerShell\\Scripts;C:\\ProgramData\\chocolatey\\bin;;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.3.1\\;C:\\Program Files\\Go\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files\\CMake\\bin;C:\\Program Files\\NVIDIA\\CUDNN\\v9.0\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\libnvvp;;;;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\dotnet\\;C:\\Program Files\\Git\\cmd;C:\\Users\\rucaradio\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs;C:\\Program Files\\WindowsPowerShell\\Scripts;C:\\ProgramData\\chocolatey\\bin;;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\NVIDIA Corporation\\NVIDIA NvDLISR;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2023.3.1\\;C:\\Program Files\\Go\\bin;C:\\Program Files\\PowerShell\\7\\;C:\\Program Files\\CMake\\bin;C:\\Users\\rucaradio\\.cargo\\bin;C:\\Users\\rucaradio\\scoop\\shims;C:\\Users\\rucaradio\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\rucaradio\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\rucaradio\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs;C:\\;C:\\Users\\rucaradio\\AppData\\Local\\Programs\\Ollama" time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\RUCARA~1\\AppData\\Local\\Temp\\ollama4153122201\\cuda_v11.3\\ext_server.dll" time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio\.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Quadro M6000 WDDM | 00000000:05:00.0 On | Off | | 27% 51C P8 28W / 250W | 646MiB / 12288MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 WDDM | 00000000:0B:00.0 On | N/A | | 0% 35C P8 8W / 170W | 118MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 4436 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6612 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6792 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 8600 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 9404 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | (base) C:\newpdev\ollama>NVCC -V NVCC: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0
Author
Owner

@nanshaws commented on GitHub (Apr 1, 2024):

time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library cudart64_*.dll"
time=2024-04-01T08:12:19.881+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [C:\Users\Administrator\AppData\Local\Programs\Ollama\cudart64_110.dll c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll]"
time=2024-04-01T08:12:19.940+08:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-04-01T08:12:19.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-01T08:12:20.082+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-01T08:12:20.082+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-01T08:12:20.083+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-01T08:12:20.083+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-01T08:12:20.083+08:00 level=INFO source=assets.go:108 msg="Updating PATH to C:\Users\ADMINI1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp;C:\Program Files (x86)\jdk/bin;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin;D:\WindowsVSC\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64\;C:\Program Files\PlasticSCM5\server;C:\Program Files\PlasticSCM5\client;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;D:\work\apache-tomcat-9.0.1-windows-x64\apache-tomcat-9.0.1\bin\;D:\work\apache-maven-3.8.8-bin\apache-maven-3.8.8\bin\;D:\work\gradle-8.2.1-all\gradle-8.2.1\bin;D:\work\apache-jmeter-5.5\bin;D:\work\w64devkit-1.19.0\w64devkit\bin;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\MySQL\MySQL Server 8.0\bin;D:\Git\cmd;D:\python\;D:\nvm;C:\Program Files\nodejs;D:\work\visualvm_216\bin;D:\HashiCorp\Vagrant\bin;D:\weixin\微信web开发者工具\dll;D:\work\netcat-win32-1.12;D:\work\VMware-ovftool-4.5.0-20459872-win.x86_64\ovftool;D:\work\lu;D:\work\kotlin-compiler-1.9.22\kotlinc\bin;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2020.3.0\;D:\miniconda3;D:\miniconda3\Library\mingw-w64\bin;D:\miniconda3\Library\usr\bin;D:\miniconda3\Library\bin;D:\miniconda3\Scripts;C:\Program Files\MySQL\MySQL Shell 8.0\bin\;C:\Users\Administrator\AppData\Local\Microsoft\WindowsApps;C:\Users\Administrator\AppData\Roaming\npm;D:\nvm;C:\Program Files\nodejs;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin\;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\jre\bin\;C:\Users\Administrator\AppData\Local\GitHubDesktop\bin;C:\Users\Administrator\.dotnet\tools;D:\work\mongosh\;;C:\Users\Administrator\AppData\Local\Programs\Ollama"
loading library C:\Users\ADMINI
1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll
time=2024-04-01T08:12:20.099+08:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll"
time=2024-04-01T08:12:20.100+08:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from D:\ollama\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma
llama_model_loader: - kv 1: general.name str = gemma-7b-it
llama_model_loader: - kv 2: gemma.context_length u32 = 8192
llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072
llama_model_loader: - kv 4: gemma.block_count u32 = 28
llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576
llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16
llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256
llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - type f32: 57 tensors
llama_model_loader: - type q4_0: 196 tensors
llama_model_loader: - type q8_0: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_rot = 192
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 24576
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.54 B
llm_load_print_meta: model size = 4.84 GiB (4.87 BPW)
llm_load_print_meta: general.name = gemma-7b-it
llm_load_print_meta: BOS token = 2 ''
llm_load_print_meta: EOS token = 1 ''
llm_load_print_meta: UNK token = 3 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 227 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.19 MiB
llm_load_tensors: offloading 11 repeating layers to GPU
llm_load_tensors: offloaded 11/29 layers to GPU
llm_load_tensors: CPU buffer size = 4955.54 MiB
llm_load_tensors: CUDA0 buffer size = 1633.76 MiB
...........................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 544.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 352.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 506.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1302.88 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 20.00 MiB
llama_new_context_with_model: graph nodes = 957
llama_new_context_with_model: graph splits = 191
CUDA error: CUBLAS_STATUS_ALLOC_FAILED
current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:659
cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error"

<!-- gh-comment-id:2028962459 --> @nanshaws commented on GitHub (Apr 1, 2024): time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library cudart64_*.dll" time=2024-04-01T08:12:19.881+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\cudart64_110.dll c:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2\\bin\\cudart64_110.dll C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2\\bin\\cudart64_110.dll]" time=2024-04-01T08:12:19.940+08:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart" time=2024-04-01T08:12:19.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.082+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-01T08:12:20.082+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.083+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-01T08:12:20.083+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.083+08:00 level=INFO source=assets.go:108 msg="Updating PATH to C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\ollama2476353147\\runners\\cuda_v11.3;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2\\libnvvp;C:\\Program Files (x86)\\jdk/bin;D:\\work\\graalvm-jdk-17_windows-x64_bin\\graalvm-jdk-17.0.9+11.1\\bin;D:\\WindowsVSC\\VC\\Tools\\MSVC\\14.36.32532\\bin\\Hostx64\\x64\\;C:\\Program Files\\PlasticSCM5\\server;C:\\Program Files\\PlasticSCM5\\client;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;D:\\work\\apache-tomcat-9.0.1-windows-x64\\apache-tomcat-9.0.1\\bin\\;D:\\work\\apache-maven-3.8.8-bin\\apache-maven-3.8.8\\bin\\;D:\\work\\gradle-8.2.1-all\\gradle-8.2.1\\bin;D:\\work\\apache-jmeter-5.5\\bin;D:\\work\\w64devkit-1.19.0\\w64devkit\\bin;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Program Files\\MySQL\\MySQL Server 8.0\\bin;D:\\Git\\cmd;D:\\python\\;D:\\nvm;C:\\Program Files\\nodejs;D:\\work\\visualvm_216\\bin;D:\\HashiCorp\\Vagrant\\bin;D:\\weixin\\微信web开发者工具\\dll;D:\\work\\netcat-win32-1.12;D:\\work\\VMware-ovftool-4.5.0-20459872-win.x86_64\\ovftool;D:\\work\\lu;D:\\work\\kotlin-compiler-1.9.22\\kotlinc\\bin;C:\\Program Files\\CMake\\bin;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2020.3.0\\;D:\\miniconda3;D:\\miniconda3\\Library\\mingw-w64\\bin;D:\\miniconda3\\Library\\usr\\bin;D:\\miniconda3\\Library\\bin;D:\\miniconda3\\Scripts;C:\\Program Files\\MySQL\\MySQL Shell 8.0\\bin\\;C:\\Users\\Administrator\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Administrator\\AppData\\Roaming\\npm;D:\\nvm;C:\\Program Files\\nodejs;D:\\work\\graalvm-jdk-17_windows-x64_bin\\graalvm-jdk-17.0.9+11.1\\bin\\;D:\\work\\graalvm-jdk-17_windows-x64_bin\\graalvm-jdk-17.0.9+11.1\\jre\\bin\\;C:\\Users\\Administrator\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Administrator\\.dotnet\\tools;D:\\work\\mongosh\\;;C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama" loading library C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll time=2024-04-01T08:12:20.099+08:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\ollama2476353147\\runners\\cuda_v11.3\\ext_server.dll" time=2024-04-01T08:12:20.100+08:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from D:\ollama\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.84 GiB (4.87 BPW) llm_load_print_meta: general.name = gemma-7b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.19 MiB llm_load_tensors: offloading 11 repeating layers to GPU llm_load_tensors: offloaded 11/29 layers to GPU llm_load_tensors: CPU buffer size = 4955.54 MiB llm_load_tensors: CUDA0 buffer size = 1633.76 MiB ........................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 544.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 352.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 506.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1302.88 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.00 MiB llama_new_context_with_model: graph nodes = 957 llama_new_context_with_model: graph splits = 191 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:659 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error"
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

I would suggest giving the latest release a try to see if that improves the situation. That said, these may ultimately be due to #4599 which I'm still working on.

<!-- gh-comment-id:2143585692 --> @dhiltgen commented on GitHub (Jun 1, 2024): I would suggest giving the latest release a try to see if that improves the situation. That said, these may ultimately be due to #4599 which I'm still working on.
Author
Owner

@dhiltgen commented on GitHub (Jun 22, 2024):

Please upgrade to the latest version (0.1.45) and this should be resolved now for CUDA cards.

<!-- gh-comment-id:2183593088 --> @dhiltgen commented on GitHub (Jun 22, 2024): Please upgrade to the latest version (0.1.45) and this should be resolved now for CUDA cards.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63763