[GH-ISSUE #737] CUDA out of memory #46856

Closed
opened 2026-04-28 00:53:27 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @konstantin1722 on GitHub (Oct 8, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/737

Hi, I installed ollama by script in Arch Linux, downloaded the llama2 7b-chat model, when I run it with the prompt I get Error: error reading llm response: unexpected EOF and overall I can't get it to work. This error is caused by almost any command.

I'm not sure, but perhaps this is a related topic #668 and #618 ?

I have cuda automatically pulled up, I would like to know:

  1. How do I fix this problem and what is causing it?
  2. How to run the model using CPU only?

Characteristics of my computer:

CPU: Intel i5-8600K (6) @ 4.300GHz
GPU: NVIDIA GeForce GTX 1060 6GB
Memory: 64118MiB

Any help would be appreciated...

Log ollama:

oct 08 12:15:07 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:07 | 200 |      13.153µs |       127.0.0.1 | HEAD     "/"
oct 08 12:15:07 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:07 | 200 |     136.828µs |       127.0.0.1 | GET      "/api/tags"
oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 routes.go:76: loaded llm process not responding, closing now
oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 llama.go:239: 6144 MiB VRAM available, loading up to 54 GPU layers
oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 llama.go:313: starting llama runner
oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 llama.go:349: waiting for llama runner to start responding
oct 08 12:15:07 desktop-pc ollama[12487]: ggml_init_cublas: found 1 CUDA devices:
oct 08 12:15:07 desktop-pc ollama[12487]:   Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1
oct 08 12:15:07 desktop-pc ollama[12487]: {"timestamp":1696756507,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
oct 08 12:15:07 desktop-pc ollama[12487]: {"timestamp":1696756507,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":3,"total_threads":6,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "}
oct 08 12:15:07 desktop-pc ollama[12487]: llama.cpp: loading model from /usr/share/ollama/.ollama/models/blobs/sha256:b5749cc827d33b7cb4c8869cede7b296a0a28d9e5d1982705c2ba4c603258159
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: format     = ggjt v3 (latest)
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_vocab    = 32000
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_ctx      = 2048
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_embd     = 4096
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_mult     = 256
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_head     = 32
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_head_kv  = 32
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_layer    = 32
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_rot      = 128
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_gqa      = 1
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: rnorm_eps  = 5.0e-06
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_ff       = 11008
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: freq_base  = 10000.0
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: freq_scale = 1
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: ftype      = 2 (mostly Q4_0)
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: model size = 7B
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: ggml ctx size =    0.08 MB
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: using CUDA for GPU acceleration
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: mem required  =  468.40 MB (+ 1024.00 MB per state)
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading 32 repeating layers to GPU
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading non-repeating layers to GPU
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading v cache to GPU
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading k cache to GPU
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloaded 35/35 layers to GPU
oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: total VRAM used: 4954 MB
oct 08 12:15:08 desktop-pc ollama[12487]: llama_new_context_with_model: kv self size  = 1024.00 MB
oct 08 12:15:08 desktop-pc ollama[12487]: llama server listening at http://127.0.0.1:60658
oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":60658}
oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":56544,"status":200,"method":"HEAD","path":"/","params":{}}
oct 08 12:15:08 desktop-pc ollama[8633]: 2023/10/08 12:15:08 llama.go:365: llama runner started in 1.002318 seconds
oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":56544,"status":200,"method":"POST","path":"/tokenize","params":{}}
oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":56544,"status":200,"method":"POST","path":"/tokenize","params":{}}
oct 08 12:15:08 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:08 | 200 |  1.098666919s |       127.0.0.1 | POST     "/api/generate"
oct 08 12:15:21 desktop-pc ollama[12487]: {"timestamp":1696756521,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":35924,"status":200,"method":"HEAD","path":"/","params":{}}
oct 08 12:15:21 desktop-pc ollama[12487]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:6290: out of memory
oct 08 12:15:21 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:21 | 200 |  164.297406ms |       127.0.0.1 | POST     "/api/generate"
oct 08 12:15:21 desktop-pc ollama[8633]: 2023/10/08 12:15:21 llama.go:323: llama runner exited with error: exit status 1
Originally created by @konstantin1722 on GitHub (Oct 8, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/737 Hi, I installed ollama by script in Arch Linux, downloaded the llama2 7b-chat model, when I run it with the prompt I get `Error: error reading llm response: unexpected EOF` and overall I can't get it to work. This error is caused by almost any command. I'm not sure, but perhaps this is a related topic #668 and #618 ? I have cuda automatically pulled up, I would like to know: 1. How do I fix this problem and what is causing it? 2. How to run the model using CPU only? Characteristics of my computer: > CPU: Intel i5-8600K (6) @ 4.300GHz > GPU: NVIDIA GeForce GTX 1060 6GB > Memory: 64118MiB Any help would be appreciated... Log ollama: ``` oct 08 12:15:07 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:07 | 200 | 13.153µs | 127.0.0.1 | HEAD "/" oct 08 12:15:07 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:07 | 200 | 136.828µs | 127.0.0.1 | GET "/api/tags" oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 routes.go:76: loaded llm process not responding, closing now oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 llama.go:239: 6144 MiB VRAM available, loading up to 54 GPU layers oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 llama.go:313: starting llama runner oct 08 12:15:07 desktop-pc ollama[8633]: 2023/10/08 12:15:07 llama.go:349: waiting for llama runner to start responding oct 08 12:15:07 desktop-pc ollama[12487]: ggml_init_cublas: found 1 CUDA devices: oct 08 12:15:07 desktop-pc ollama[12487]: Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1 oct 08 12:15:07 desktop-pc ollama[12487]: {"timestamp":1696756507,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} oct 08 12:15:07 desktop-pc ollama[12487]: {"timestamp":1696756507,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":3,"total_threads":6,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} oct 08 12:15:07 desktop-pc ollama[12487]: llama.cpp: loading model from /usr/share/ollama/.ollama/models/blobs/sha256:b5749cc827d33b7cb4c8869cede7b296a0a28d9e5d1982705c2ba4c603258159 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: format = ggjt v3 (latest) oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_vocab = 32000 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_ctx = 2048 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_embd = 4096 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_mult = 256 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_head = 32 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_head_kv = 32 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_layer = 32 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_rot = 128 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_gqa = 1 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: rnorm_eps = 5.0e-06 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: n_ff = 11008 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: freq_base = 10000.0 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: freq_scale = 1 oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: ftype = 2 (mostly Q4_0) oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: model size = 7B oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: ggml ctx size = 0.08 MB oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: using CUDA for GPU acceleration oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: mem required = 468.40 MB (+ 1024.00 MB per state) oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading 32 repeating layers to GPU oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading non-repeating layers to GPU oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading v cache to GPU oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloading k cache to GPU oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: offloaded 35/35 layers to GPU oct 08 12:15:07 desktop-pc ollama[12487]: llama_model_load_internal: total VRAM used: 4954 MB oct 08 12:15:08 desktop-pc ollama[12487]: llama_new_context_with_model: kv self size = 1024.00 MB oct 08 12:15:08 desktop-pc ollama[12487]: llama server listening at http://127.0.0.1:60658 oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":60658} oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":56544,"status":200,"method":"HEAD","path":"/","params":{}} oct 08 12:15:08 desktop-pc ollama[8633]: 2023/10/08 12:15:08 llama.go:365: llama runner started in 1.002318 seconds oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":56544,"status":200,"method":"POST","path":"/tokenize","params":{}} oct 08 12:15:08 desktop-pc ollama[12487]: {"timestamp":1696756508,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":56544,"status":200,"method":"POST","path":"/tokenize","params":{}} oct 08 12:15:08 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:08 | 200 | 1.098666919s | 127.0.0.1 | POST "/api/generate" oct 08 12:15:21 desktop-pc ollama[12487]: {"timestamp":1696756521,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":35924,"status":200,"method":"HEAD","path":"/","params":{}} oct 08 12:15:21 desktop-pc ollama[12487]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:6290: out of memory oct 08 12:15:21 desktop-pc ollama[8633]: [GIN] 2023/10/08 - 12:15:21 | 200 | 164.297406ms | 127.0.0.1 | POST "/api/generate" oct 08 12:15:21 desktop-pc ollama[8633]: 2023/10/08 12:15:21 llama.go:323: llama runner exited with error: exit status 1 ```
GiteaMirror added the bug label 2026-04-28 00:53:27 -05:00
Author
Owner

@jmorganca commented on GitHub (Oct 8, 2023):

Hi sorry about this, we are looking into it now. Keep an eye on https://github.com/jmorganca/ollama/pull/724 which should fix this

<!-- gh-comment-id:1752109666 --> @jmorganca commented on GitHub (Oct 8, 2023): Hi sorry about this, we are looking into it now. Keep an eye on https://github.com/jmorganca/ollama/pull/724 which should fix this
Author
Owner

@konstantin1722 commented on GitHub (Oct 8, 2023):

Hi sorry about this, we are looking into it now. Keep an eye on #724 which should fix this

All right, thank you. I'll wait for this problem to be resolved.

<!-- gh-comment-id:1752127039 --> @konstantin1722 commented on GitHub (Oct 8, 2023): > Hi sorry about this, we are looking into it now. Keep an eye on #724 which should fix this All right, thank you. I'll wait for this problem to be resolved.
Author
Owner

@konstantin1722 commented on GitHub (Oct 9, 2023):

Hi sorry about this, we are looking into it now. Keep an eye on #724 which should fix this

  1. I've looked through the Modelfile guide and didn't find there the possibility to explicitly disable GPU usage or I just didn't understand which parameter is responsible for it. Is it possible?
  2. Also, I noticed that the model is not loaded into RAM, is there any way to specify that it should be loaded into memory fully or partially?
<!-- gh-comment-id:1753030689 --> @konstantin1722 commented on GitHub (Oct 9, 2023): > Hi sorry about this, we are looking into it now. Keep an eye on #724 which should fix this 1. I've looked through the Modelfile guide and didn't find there the possibility to explicitly disable GPU usage or I just didn't understand which parameter is responsible for it. Is it possible? 2. Also, I noticed that the model is not loaded into RAM, is there any way to specify that it should be loaded into memory fully or partially?
Author
Owner

@jmorganca commented on GitHub (Oct 11, 2023):

Hi @konstantin1722 you can use PARAMETER num_gpu <number of layers> to determine how many layers will get loaded. The specified number will get loaded to gpu and the rest loaded into RAM memory. Note that performance will slow down with more models loaded into RAM to be processed via the CPU

<!-- gh-comment-id:1756552582 --> @jmorganca commented on GitHub (Oct 11, 2023): Hi @konstantin1722 you can use `PARAMETER num_gpu <number of layers>` to determine how many layers will get loaded. The specified number will get loaded to gpu and the rest loaded into RAM memory. Note that performance will slow down with more models loaded into RAM to be processed via the CPU
Author
Owner

@jmorganca commented on GitHub (Oct 11, 2023):

Also, this should be fixed in the next release by #724 so I'm going to close it. But please do feel free to re-open it if it's still not working after the next release.

<!-- gh-comment-id:1756603741 --> @jmorganca commented on GitHub (Oct 11, 2023): Also, this should be fixed in the next release by #724 so I'm going to close it. But please do feel free to re-open it if it's still not working after the next release.
Author
Owner

@venturaEffect commented on GitHub (Jan 21, 2024):

Still not working. Upgraded Ollama on my WSL. Before it was working, needed to update because of some updates also on Langchain and now it doesn't matter which models I try I always get: Error: Post "http://127.0.0.1:11434/api/generate": EOF

Tried creating a modelfile with the name of the llm model (dolphin-mistral) and added: FROM dolphin-2.1-mistral-7b PARAMETER num_gpu 0 but nothing. Still doesn't work. searched everywhere, people are complaining about this but solutions seems not to fix this bug.

Appreciate any help.

<!-- gh-comment-id:1902599584 --> @venturaEffect commented on GitHub (Jan 21, 2024): Still not working. Upgraded Ollama on my WSL. Before it was working, needed to update because of some updates also on Langchain and now it doesn't matter which models I try I always get: `Error: Post "http://127.0.0.1:11434/api/generate": EOF` Tried creating a modelfile with the name of the llm model (dolphin-mistral) and added: `FROM dolphin-2.1-mistral-7b PARAMETER num_gpu 0` but nothing. Still doesn't work. searched everywhere, people are complaining about this but solutions seems not to fix this bug. Appreciate any help.
Author
Owner

@fabianslife commented on GitHub (Jan 25, 2024):

Same error as @venturaEffect. I am using llamaIndex with Mixtral 8x7B on Ubuntu 20.04.

Sometimes it returns responses, however most of the times i get the EOF error.

<!-- gh-comment-id:1910692517 --> @fabianslife commented on GitHub (Jan 25, 2024): Same error as @venturaEffect. I am using llamaIndex with Mixtral 8x7B on Ubuntu 20.04. Sometimes it returns responses, however most of the times i get the EOF error.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46856