[GH-ISSUE #3405] API server stops responging - CUDA error: out of memory #2096

Closed
opened 2026-04-12 12:20:20 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Zig1375 on GitHub (Mar 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3405

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

On a win 11 machine, when I try to make a request to API (Command R model) it stops responding.
But in the command line it works perfectly.
In the log file I can see errors:

CUDA error: out of memory
  current device: 0, in function alloc at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:532
  cuMemSetAccess(pool_addr + pool_size, reserve_size, &access, 1)
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error"

But as I said above it works perfectly in the command line! I can discuss with the chat for a long time. But api stops working in the first request. Restarting machine and ollama do not help.

What did you expect to see?

Api server should works the same as the command line

Steps to reproduce

Install Command R on a win 11 machine (nvidia 4070 12 GB, 32 GB ram). Send an api request (with about 1000 words).

Are there any recent changes that introduced the issue?

No response

OS

Windows

Architecture

amd64

Platform

No response

Ollama version

0.1.30

GPU

Nvidia

GPU info

No response

CPU

AMD

Other software

No response

Originally created by @Zig1375 on GitHub (Mar 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3405 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? On a win 11 machine, when I try to make a request to API (Command R model) it stops responding. But in the command line it works perfectly. In the log file I can see errors: ``` CUDA error: out of memory current device: 0, in function alloc at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:532 cuMemSetAccess(pool_addr + pool_size, reserve_size, &access, 1) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error" ``` But as I said above it works perfectly in the command line! I can discuss with the chat for a long time. But api stops working in the first request. Restarting machine and ollama do not help. ### What did you expect to see? Api server should works the same as the command line ### Steps to reproduce Install Command R on a win 11 machine (nvidia 4070 12 GB, 32 GB ram). Send an api request (with about 1000 words). ### Are there any recent changes that introduced the issue? _No response_ ### OS Windows ### Architecture amd64 ### Platform _No response_ ### Ollama version 0.1.30 ### GPU Nvidia ### GPU info _No response_ ### CPU AMD ### Other software _No response_
GiteaMirror added the bugnvidia labels 2026-04-12 12:20:20 -05:00
Author
Owner

@Zig1375 commented on GitHub (Apr 10, 2024):

I encounter the same issue from time to time when num_ctx is set to 2048.
If num_ctx is set to 4096 or higher, the error occurs consistently (using a Nvidia 4070 with 12GB of memory (RAM 64GB)).

<!-- gh-comment-id:2047253011 --> @Zig1375 commented on GitHub (Apr 10, 2024): I encounter the same issue from time to time when num_ctx is set to 2048. If num_ctx is set to 4096 or higher, the error occurs consistently (using a Nvidia 4070 with 12GB of memory (RAM 64GB)).
Author
Owner

@bilalshafim commented on GitHub (Apr 23, 2024):

I have been encountering the same issue. At first, using the API would start sending a stream but it would stop with error: unexpected server status: 1. Now, it does not send back a response at all.

I am trying to use Ollama with Flowise to connect to another service. The reason why I chose Ollama was because it has native support in Flowise, just need to add the server address. Is there any alternate for Ollama because I want to get the service up as soon as possible.

<!-- gh-comment-id:2072134218 --> @bilalshafim commented on GitHub (Apr 23, 2024): I have been encountering the same issue. At first, using the API would start sending a stream but it would stop with error: unexpected server status: 1. Now, it does not send back a response at all. I am trying to use Ollama with Flowise to connect to another service. The reason why I chose Ollama was because it has native support in Flowise, just need to add the server address. Is there any alternate for Ollama because I want to get the service up as soon as possible.
Author
Owner

@dhiltgen commented on GitHub (Jun 22, 2024):

Please upgrade to 0.1.45 and let us know how it goes. Multiple fixes have gone in that should help the situation.

<!-- gh-comment-id:2183595071 --> @dhiltgen commented on GitHub (Jun 22, 2024): Please upgrade to 0.1.45 and let us know how it goes. Multiple fixes have gone in that should help the situation.
Author
Owner

@dhiltgen commented on GitHub (Jul 3, 2024):

If you're still having out of memory problems on the latest release, please share an updated server log and I'll reopen the issue.

<!-- gh-comment-id:2207529735 --> @dhiltgen commented on GitHub (Jul 3, 2024): If you're still having out of memory problems on the latest release, please share an updated server log and I'll reopen the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2096