[GH-ISSUE #2898] v0.1.28 RC: CUDA error: out of memory #1773

Closed
opened 2026-04-12 11:47:33 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ovaisq on GitHub (Mar 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2898

Originally assigned to: @dhiltgen on GitHub.

Ollama v0.1.28 RC
Ryzen 7 1700 - 48GB RAM - 500GB SSD
GeForce GTX 1070ti 8GB VRAM - Driver v551.61
Windows 11 Pro

My Python code (running on a Debian 12 instance - making remote calls over local network) is looping through deepseek-llm, llama2, gemma LLMs doing this:

        client = AsyncClient(host='OLLAMA_API_URL')
        response = await client.chat(
                                     model=llm,
                                     stream=False,
                                     messages=[
                                             {
                                                 'role': 'user',
                                                 'content': content
                                             },
                                             ],
                                     options = {
                                                 'temperature' : 0
                                             }
                                     )

Ollama Server crashes at around 10th iteration.

Ollama Crash error:
CUDA error: out of memory
current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8587
cuMemAddressReserve(&g_cuda_pool_addr[device], CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error"

Please let me know if any further information is needed.

Originally created by @ovaisq on GitHub (Mar 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2898 Originally assigned to: @dhiltgen on GitHub. Ollama v0.1.28 RC Ryzen 7 1700 - 48GB RAM - 500GB SSD GeForce GTX 1070ti 8GB VRAM - Driver v551.61 Windows 11 Pro My Python code (running on a Debian 12 instance - making remote calls over local network) is looping through deepseek-llm, llama2, gemma LLMs doing this: client = AsyncClient(host='OLLAMA_API_URL') response = await client.chat( model=llm, stream=False, messages=[ { 'role': 'user', 'content': content }, ], options = { 'temperature' : 0 } ) Ollama Server crashes at around 10th iteration. Ollama Crash error: CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:8587 cuMemAddressReserve(&g_cuda_pool_addr[device], CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0) GGML_ASSERT: C:\Users\jmorg\git\ollama\llm\llama.cpp\ggml-cuda.cu:256: !"CUDA error" Please let me know if any further information is needed.
Author
Owner

@remy415 commented on GitHub (Mar 5, 2024):

1070ti 8Gb card will likely struggle with 7b or larger models, and especially when swapping between multiple models. Try grabbing models that are < 4Gb in size and see if that resolves your issue.

<!-- gh-comment-id:1979652328 --> @remy415 commented on GitHub (Mar 5, 2024): 1070ti 8Gb card will likely struggle with 7b or larger models, and especially when swapping between multiple models. Try grabbing models that are < 4Gb in size and see if that resolves your issue.
Author
Owner

@dhiltgen commented on GitHub (Mar 11, 2024):

Unfortunately it looks like our memory prediction algorithm didn't work correctly for this setup and we attempted to load too many layers into the GPUs and it ran out of VRAM. We're continuing to improve our calculations to avoid this.

In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. OLLAMA_MAX_VRAM=<bytes> For example, I believe your GPUs is an 8G card, so you could start with 7G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=7516192768

<!-- gh-comment-id:1989120190 --> @dhiltgen commented on GitHub (Mar 11, 2024): Unfortunately it looks like our memory prediction algorithm didn't work correctly for this setup and we attempted to load too many layers into the GPUs and it ran out of VRAM. We're continuing to improve our calculations to avoid this. In the next release (0.1.29) we'll be adding a new setting that can allow you to set a lower VRAM setting to workaround this type of crash until we get the prediction logic fixed. `OLLAMA_MAX_VRAM=<bytes>` For example, I believe your GPUs is an 8G card, so you could start with 7G and experiment until you find a setting that loads the as many layers as possible without hitting the OOM crash. OLLAMA_MAX_VRAM=7516192768
Author
Owner

@ovaisq commented on GitHub (Mar 11, 2024):

Hey Daniel, thanks for getting back to me. It's nice to reconnect after our time at VMware! I was surprised by the crashes too, considering the 1070ti has 8GB VRAM. Since it was affecting my POC project, I've switched the machine from Windows 11 to Debian 12, and version 0.1.28 has been running smoothly on the new setup without any crashes. Currently, I have a setup with three dedicated Ollama Servers: one M1 Max with 32GB RAM, one M1 Mac Mini with 16GB RAM, and one Ryzen 7 1700 with 48GB RAM, plus the 1070ti with 8GB VRAM. These servers are load balanced by Nginx, and it's giving me the performance I need for now. I plan to invest in more suitable hardware eventually.

<!-- gh-comment-id:1989140880 --> @ovaisq commented on GitHub (Mar 11, 2024): Hey Daniel, thanks for getting back to me. It's nice to reconnect after our time at VMware! I was surprised by the crashes too, considering the 1070ti has 8GB VRAM. Since it was affecting my POC project, I've switched the machine from Windows 11 to Debian 12, and version 0.1.28 has been running smoothly on the new setup without any crashes. Currently, I have a setup with three dedicated Ollama Servers: one M1 Max with 32GB RAM, one M1 Mac Mini with 16GB RAM, and one Ryzen 7 1700 with 48GB RAM, plus the 1070ti with 8GB VRAM. These servers are load balanced by Nginx, and it's giving me the performance I need for now. I plan to invest in more suitable hardware eventually.
Author
Owner

@ovaisq commented on GitHub (Mar 11, 2024):

FWIW - originally I started out with an Ubuntu VM on ESXi 8u2 with 1070ti set to passthrough - v0.1.24 and ran into lockups galore, which then led me to the path that I am on now.

<!-- gh-comment-id:1989146872 --> @ovaisq commented on GitHub (Mar 11, 2024): FWIW - originally I started out with an Ubuntu VM on ESXi 8u2 with 1070ti set to passthrough - v0.1.24 and ran into lockups galore, which then led me to the path that I am on now.
Author
Owner

@jmorganca commented on GitHub (Mar 12, 2024):

Merging with #1952

<!-- gh-comment-id:1989752043 --> @jmorganca commented on GitHub (Mar 12, 2024): Merging with #1952
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1773