[GH-ISSUE #2952] Windows CUDA OOM running llama2 on dual RTX 2070 #48325

Closed
opened 2026-04-28 07:44:17 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @iamtechysandy on GitHub (Mar 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2952

Originally assigned to: @dhiltgen on GitHub.

C:\Users\admin>ollama run llama2
Error: Post "http://127.0.0.1:11434/api/chat": read tcp 127.0.0.1:52764->127.0.0.1:11434: wsarecv: An existing connection was forcibly closed by the remote host.
Getting these error While running

Originally created by @iamtechysandy on GitHub (Mar 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2952 Originally assigned to: @dhiltgen on GitHub. C:\Users\admin>ollama run llama2 Error: Post "http://127.0.0.1:11434/api/chat": read tcp 127.0.0.1:52764->127.0.0.1:11434: wsarecv: An existing connection was forcibly closed by the remote host. Getting these error While running
Author
Owner

@dhiltgen commented on GitHub (Mar 6, 2024):

Can you share the server log so we can see why it crashed?

<!-- gh-comment-id:1981249482 --> @dhiltgen commented on GitHub (Mar 6, 2024): Can you share the server log so we can see why it crashed?
Author
Owner

@iamtechysandy commented on GitHub (Mar 7, 2024):

time=2024-03-07T10:12:21.465+03:00 level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-03-07T10:12:21.466+03:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-07T10:12:21.467+03:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)"
time=2024-03-07T10:12:21.467+03:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-07T10:12:21.648+03:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu cpu_avx2 cpu_avx]"
[GIN] 2024/03/07 - 10:12:40 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/07 - 10:12:40 | 404 | 620.9µs | 127.0.0.1 | POST "/api/show"
time=2024-03-07T10:12:43.972+03:00 level=INFO source=download.go:136 msg="downloading 8934d96d3f08 in 39 100 MB part(s)"
time=2024-03-07T10:15:32.256+03:00 level=INFO source=download.go:136 msg="downloading 8c17c2ebb0ea in 1 7.0 KB part(s)"
time=2024-03-07T10:15:35.455+03:00 level=INFO source=download.go:136 msg="downloading 7c23fb36d801 in 1 4.8 KB part(s)"
time=2024-03-07T10:15:38.607+03:00 level=INFO source=download.go:136 msg="downloading 2e0493f67d0c in 1 59 B part(s)"
time=2024-03-07T10:15:41.784+03:00 level=INFO source=download.go:136 msg="downloading fa304d675061 in 1 91 B part(s)"
time=2024-03-07T10:15:44.986+03:00 level=INFO source=download.go:136 msg="downloading 42ba7f8a01dd in 1 557 B part(s)"
[GIN] 2024/03/07 - 10:15:55 | 200 | 3m14s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/03/07 - 10:15:55 | 200 | 543.6µs | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/07 - 10:15:55 | 200 | 1.0454ms | 127.0.0.1 | POST "/api/show"
time=2024-03-07T10:15:55.607+03:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-07T10:15:55.607+03:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-07T10:15:55.611+03:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-07T10:15:55.616+03:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-07T10:15:55.616+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:15:55.643+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-07T10:15:55.643+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:15:55.652+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-07T10:15:55.652+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:15:55.652+03:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\admin\AppData\Local\Temp\ollama313388432\cuda_v11.3;C:\Users\admin\AppData\Local\Programs\Ollama;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Users\admin\AppData\Local\Programs\Python\Python312\Scripts\;C:\Users\admin\AppData\Local\Programs\Python\Python312\;C:\Users\admin\AppData\Local\Microsoft\WindowsApps;C:\Users\admin\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\admin\AppData\Local\Programs\Ollama "
time=2024-03-07T10:16:20.162+03:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\admin\AppData\Local\Temp\ollama313388432\cuda_v11.3\ext_server.dll"
time=2024-03-07T10:16:20.162+03:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: GeForce RTX 2070, compute capability 7.5, VMM: yes
Device 1: GeForce RTX 2070, compute capability 7.5, VMM: yes
CUDA error: out of memory
current device: 0, in function ggml_init_cublas at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8771
cudaStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01)
GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error"
time=2024-03-07T10:19:13.818+03:00 level=INFO source=images.go:710 msg="total blobs: 6"
time=2024-03-07T10:19:13.819+03:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-07T10:19:13.819+03:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)"
time=2024-03-07T10:19:13.820+03:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-07T10:19:13.980+03:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu cuda_v11.3]"
[GIN] 2024/03/07 - 10:19:14 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/07 - 10:19:14 | 200 | 2.1953ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/07 - 10:19:14 | 200 | 1.3772ms | 127.0.0.1 | POST "/api/show"
time=2024-03-07T10:19:14.834+03:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-07T10:19:14.834+03:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-07T10:19:14.837+03:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-07T10:19:14.842+03:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-07T10:19:14.842+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:19:14.867+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-07T10:19:14.867+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:19:14.876+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-07T10:19:14.876+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:19:14.876+03:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\admin\AppData\Local\Temp\ollama911492293\cuda_v11.3;C:\Users\admin\AppData\Local\Programs\Ollama;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Users\admin\AppData\Local\Programs\Python\Python312\Scripts\;C:\Users\admin\AppData\Local\Programs\Python\Python312\;C:\Users\admin\AppData\Local\Microsoft\WindowsApps;C:\Users\admin\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\admin\AppData\Local\Programs\Ollama "
time=2024-03-07T10:19:15.234+03:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\admin\AppData\Local\Temp\ollama911492293\cuda_v11.3\ext_server.dll"
time=2024-03-07T10:19:15.234+03:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: GeForce RTX 2070, compute capability 7.5, VMM: yes
Device 1: GeForce RTX 2070, compute capability 7.5, VMM: yes
CUDA error: out of memory
current device: 0, in function ggml_init_cublas at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8771
cudaStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01)
GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error"
time=2024-03-07T10:19:27.035+03:00 level=INFO source=images.go:710 msg="total blobs: 6"
time=2024-03-07T10:19:27.037+03:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-07T10:19:27.037+03:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)"
time=2024-03-07T10:19:27.037+03:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-07T10:19:27.328+03:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx2 cpu_avx cpu]"
[GIN] 2024/03/07 - 10:19:27 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/07 - 10:19:27 | 200 | 1.6972ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/03/07 - 10:19:27 | 200 | 1.043ms | 127.0.0.1 | POST "/api/show"
time=2024-03-07T10:19:28.046+03:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-07T10:19:28.046+03:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-07T10:19:28.050+03:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-07T10:19:28.054+03:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-07T10:19:28.054+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:19:28.075+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-07T10:19:28.075+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:19:28.083+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-07T10:19:28.084+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-07T10:19:28.084+03:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\admin\AppData\Local\Temp\ollama2757144221\cuda_v11.3;C:\Users\admin\AppData\Local\Programs\Ollama;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Users\admin\AppData\Local\Programs\Python\Python312\Scripts\;C:\Users\admin\AppData\Local\Programs\Python\Python312\;C:\Users\admin\AppData\Local\Microsoft\WindowsApps;C:\Users\admin\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\admin\AppData\Local\Programs\Ollama "
time=2024-03-07T10:19:28.438+03:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\admin\AppData\Local\Temp\ollama2757144221\cuda_v11.3\ext_server.dll"
time=2024-03-07T10:19:28.438+03:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: GeForce RTX 2070, compute capability 7.5, VMM: yes
Device 1: GeForce RTX 2070, compute capability 7.5, VMM: yes
CUDA error: out of memory
current device: 0, in function ggml_init_cublas at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8771
cudaStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01)
GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error"

<!-- gh-comment-id:1982712075 --> @iamtechysandy commented on GitHub (Mar 7, 2024): time=2024-03-07T10:12:21.465+03:00 level=INFO source=images.go:710 msg="total blobs: 0" time=2024-03-07T10:12:21.466+03:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-07T10:12:21.467+03:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)" time=2024-03-07T10:12:21.467+03:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-07T10:12:21.648+03:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu cpu_avx2 cpu_avx]" [GIN] 2024/03/07 - 10:12:40 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/07 - 10:12:40 | 404 | 620.9µs | 127.0.0.1 | POST "/api/show" time=2024-03-07T10:12:43.972+03:00 level=INFO source=download.go:136 msg="downloading 8934d96d3f08 in 39 100 MB part(s)" time=2024-03-07T10:15:32.256+03:00 level=INFO source=download.go:136 msg="downloading 8c17c2ebb0ea in 1 7.0 KB part(s)" time=2024-03-07T10:15:35.455+03:00 level=INFO source=download.go:136 msg="downloading 7c23fb36d801 in 1 4.8 KB part(s)" time=2024-03-07T10:15:38.607+03:00 level=INFO source=download.go:136 msg="downloading 2e0493f67d0c in 1 59 B part(s)" time=2024-03-07T10:15:41.784+03:00 level=INFO source=download.go:136 msg="downloading fa304d675061 in 1 91 B part(s)" time=2024-03-07T10:15:44.986+03:00 level=INFO source=download.go:136 msg="downloading 42ba7f8a01dd in 1 557 B part(s)" [GIN] 2024/03/07 - 10:15:55 | 200 | 3m14s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/07 - 10:15:55 | 200 | 543.6µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/07 - 10:15:55 | 200 | 1.0454ms | 127.0.0.1 | POST "/api/show" time=2024-03-07T10:15:55.607+03:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-07T10:15:55.607+03:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-07T10:15:55.611+03:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-07T10:15:55.616+03:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-07T10:15:55.616+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:15:55.643+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5" time=2024-03-07T10:15:55.643+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:15:55.652+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5" time=2024-03-07T10:15:55.652+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:15:55.652+03:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\admin\\AppData\\Local\\Temp\\ollama313388432\\cuda_v11.3;C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\;C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python312\\;C:\\Users\\admin\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\admin\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama " time=2024-03-07T10:16:20.162+03:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\admin\\AppData\\Local\\Temp\\ollama313388432\\cuda_v11.3\\ext_server.dll" time=2024-03-07T10:16:20.162+03:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: GeForce RTX 2070, compute capability 7.5, VMM: yes Device 1: GeForce RTX 2070, compute capability 7.5, VMM: yes CUDA error: out of memory current device: 0, in function ggml_init_cublas at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8771 cudaStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01) GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error" time=2024-03-07T10:19:13.818+03:00 level=INFO source=images.go:710 msg="total blobs: 6" time=2024-03-07T10:19:13.819+03:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-07T10:19:13.819+03:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)" time=2024-03-07T10:19:13.820+03:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-07T10:19:13.980+03:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cpu cuda_v11.3]" [GIN] 2024/03/07 - 10:19:14 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/07 - 10:19:14 | 200 | 2.1953ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/07 - 10:19:14 | 200 | 1.3772ms | 127.0.0.1 | POST "/api/show" time=2024-03-07T10:19:14.834+03:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-07T10:19:14.834+03:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-07T10:19:14.837+03:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-07T10:19:14.842+03:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-07T10:19:14.842+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:19:14.867+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5" time=2024-03-07T10:19:14.867+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:19:14.876+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5" time=2024-03-07T10:19:14.876+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:19:14.876+03:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\admin\\AppData\\Local\\Temp\\ollama911492293\\cuda_v11.3;C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\;C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python312\\;C:\\Users\\admin\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\admin\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama " time=2024-03-07T10:19:15.234+03:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\admin\\AppData\\Local\\Temp\\ollama911492293\\cuda_v11.3\\ext_server.dll" time=2024-03-07T10:19:15.234+03:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: GeForce RTX 2070, compute capability 7.5, VMM: yes Device 1: GeForce RTX 2070, compute capability 7.5, VMM: yes CUDA error: out of memory current device: 0, in function ggml_init_cublas at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8771 cudaStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01) GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error" time=2024-03-07T10:19:27.035+03:00 level=INFO source=images.go:710 msg="total blobs: 6" time=2024-03-07T10:19:27.037+03:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-07T10:19:27.037+03:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.1.28)" time=2024-03-07T10:19:27.037+03:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-07T10:19:27.328+03:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx2 cpu_avx cpu]" [GIN] 2024/03/07 - 10:19:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/07 - 10:19:27 | 200 | 1.6972ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/07 - 10:19:27 | 200 | 1.043ms | 127.0.0.1 | POST "/api/show" time=2024-03-07T10:19:28.046+03:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-07T10:19:28.046+03:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-07T10:19:28.050+03:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\\Windows\\System32\\nvml.dll C:\\Windows\\system32\\nvml.dll]" time=2024-03-07T10:19:28.054+03:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-07T10:19:28.054+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:19:28.075+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5" time=2024-03-07T10:19:28.075+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:19:28.083+03:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5" time=2024-03-07T10:19:28.084+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-07T10:19:28.084+03:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\\Users\\admin\\AppData\\Local\\Temp\\ollama2757144221\\cuda_v11.3;C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\;C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python312\\;C:\\Users\\admin\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\admin\\AppData\\Local\\Programs\\Microsoft VS Code\\bin;C:\\Users\\admin\\AppData\\Local\\Programs\\Ollama " time=2024-03-07T10:19:28.438+03:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\admin\\AppData\\Local\\Temp\\ollama2757144221\\cuda_v11.3\\ext_server.dll" time=2024-03-07T10:19:28.438+03:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: GeForce RTX 2070, compute capability 7.5, VMM: yes Device 1: GeForce RTX 2070, compute capability 7.5, VMM: yes CUDA error: out of memory current device: 0, in function ggml_init_cublas at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8771 cudaStreamCreateWithFlags(&g_cudaStreams[id][is], 0x01) GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error"
Author
Owner

@iamtechysandy commented on GitHub (Mar 7, 2024):

My system Configuration is i9 9th Gen 64GB RAM and Nvidia GTX 2070

<!-- gh-comment-id:1982714588 --> @iamtechysandy commented on GitHub (Mar 7, 2024): My system Configuration is i9 9th Gen 64GB RAM and Nvidia GTX 2070
Author
Owner

@dhiltgen commented on GitHub (Mar 7, 2024):

Unfortunately it looks like our memory prediction algorithm didn't work correctly for this setup and we attempted to load too many layers into the GPUs and it ran out of VRAM. We're continuing to improve our calculations to avoid this.

In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. OLLAMA_MAX_VRAM=<bytes> For example, I believe your GPUs are 8G cards, so you could start with 15G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=16106127360

<!-- gh-comment-id:1984101199 --> @dhiltgen commented on GitHub (Mar 7, 2024): Unfortunately it looks like our memory prediction algorithm didn't work correctly for this setup and we attempted to load too many layers into the GPUs and it ran out of VRAM. We're continuing to improve our calculations to avoid this. In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. `OLLAMA_MAX_VRAM=<bytes>` For example, I believe your GPUs are 8G cards, so you could start with 15G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. `OLLAMA_MAX_VRAM=16106127360`
Author
Owner

@jmorganca commented on GitHub (Mar 12, 2024):

Merging with #1952

<!-- gh-comment-id:1990938358 --> @jmorganca commented on GitHub (Mar 12, 2024): Merging with #1952
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48325