[GH-ISSUE #817] unexpected EOF error #46905

Closed
opened 2026-04-28 01:56:47 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Duxon on GitHub (Oct 17, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/817

Originally assigned to: @BruceMacD on GitHub.

I encounter a bug where some models (e.g., mistral and zephyr) return an error after prompting a second time.
I am running ollama version 0.1.3. See this text log for an example:

~$ ollama run zephyr

hi, test
Hi, I'm unable to perform tests or experiments. However, [...]

test2
Error: error reading llm response: unexpected EOF

Originally created by @Duxon on GitHub (Oct 17, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/817 Originally assigned to: @BruceMacD on GitHub. I encounter a bug where some models (e.g., mistral and zephyr) return an error after prompting a second time. I am running ollama version 0.1.3. See this text log for an example: ~$ ollama run zephyr >>> hi, test Hi, I'm unable to perform tests or experiments. However, [...] >>> test2 Error: error reading llm response: unexpected EOF
GiteaMirror added the bug label 2026-04-28 01:56:47 -05:00
Author
Owner

@BruceMacD commented on GitHub (Oct 17, 2023):

Is there anything in your logs?

They will either be in ~/.ollama/logs/server.log or journalctl -u ollama.service

<!-- gh-comment-id:1766908488 --> @BruceMacD commented on GitHub (Oct 17, 2023): Is there anything in your logs? They will either be in `~/.ollama/logs/server.log` or `journalctl -u ollama.service`
Author
Owner

@Duxon commented on GitHub (Oct 17, 2023):

Here are the last lines from journalctl:

okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: ggml ctx size =    0.09 MB
okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: using CUDA for GPU acceleration
okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: mem required  =  758.12 MB (+  256.00 MB per state)
okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: offloading 27 repeating layers to GPU
okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: offloaded 27/35 layers to GPU
okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: VRAM used: 3160 MB
okt 17 20:08:29 nanobuntu-jacob ollama[107606]: ...................................................................................................
okt 17 20:08:29 nanobuntu-jacob ollama[107606]: llama_new_context_with_model: kv self size  =  256.00 MB
okt 17 20:08:29 nanobuntu-jacob ollama[107606]: llama_new_context_with_model: compute buffer total size =  153.47 MB
okt 17 20:08:29 nanobuntu-jacob ollama[107606]: llama_new_context_with_model: VRAM scratch buffer: 152.00 MB
okt 17 20:08:30 nanobuntu-jacob ollama[117371]: llama server listening at http://127.0.0.1:65430
okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"main","line":1602,"message":"HTTP server listening","hostname":"127.0.0.1","port":65430}
okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,>
okt 17 20:08:30 nanobuntu-jacob ollama[107606]: 2023/10/17 20:08:30 llama.go:422: llama runner started in 2.401782 seconds
okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,>
okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,>
okt 17 20:08:30 nanobuntu-jacob ollama[107606]: [GIN] 2023/10/17 - 20:08:30 | 200 |  2.730968875s |       127.0.0.1 | POST     "/api/generate"
okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,>
okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,>
okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings:        load time =  1899.60 ms
okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings:      sample time =     1.49 ms /     3 runs   (    0.50 ms per token,  2016.13 tokens per second)
okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings: prompt eval time =   999.79 ms /    31 tokens (   32.25 ms per token,    31.01 tokens per second)
okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings:        eval time =   155.18 ms /     2 runs   (   77.59 ms per token,    12.89 tokens per second)
okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings:       total time =  1158.19 ms
okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":48004,"status":200,>
okt 17 20:08:32 nanobuntu-jacob ollama[107606]: [GIN] 2023/10/17 - 20:08:32 | 200 |  1.160832025s |       127.0.0.1 | POST     "/api/generate"
okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":48004,"status":200,>
okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":48004,"status":200,>
okt 17 20:08:33 nanobuntu-jacob ollama[107606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5487: out of memory
okt 17 20:08:33 nanobuntu-jacob ollama[107606]: current device: 0
okt 17 20:08:33 nanobuntu-jacob ollama[107606]: [GIN] 2023/10/17 - 20:08:33 | 200 |  908.002038ms |       127.0.0.1 | POST     "/api/generate"

It worked fine until I upgraded to v0.1.3.

<!-- gh-comment-id:1766922739 --> @Duxon commented on GitHub (Oct 17, 2023): Here are the last lines from journalctl: ``` okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: ggml ctx size = 0.09 MB okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: using CUDA for GPU acceleration okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: mem required = 758.12 MB (+ 256.00 MB per state) okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: offloading 27 repeating layers to GPU okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: offloaded 27/35 layers to GPU okt 17 20:08:28 nanobuntu-jacob ollama[107606]: llm_load_tensors: VRAM used: 3160 MB okt 17 20:08:29 nanobuntu-jacob ollama[107606]: ................................................................................................... okt 17 20:08:29 nanobuntu-jacob ollama[107606]: llama_new_context_with_model: kv self size = 256.00 MB okt 17 20:08:29 nanobuntu-jacob ollama[107606]: llama_new_context_with_model: compute buffer total size = 153.47 MB okt 17 20:08:29 nanobuntu-jacob ollama[107606]: llama_new_context_with_model: VRAM scratch buffer: 152.00 MB okt 17 20:08:30 nanobuntu-jacob ollama[117371]: llama server listening at http://127.0.0.1:65430 okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"main","line":1602,"message":"HTTP server listening","hostname":"127.0.0.1","port":65430} okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,> okt 17 20:08:30 nanobuntu-jacob ollama[107606]: 2023/10/17 20:08:30 llama.go:422: llama runner started in 2.401782 seconds okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,> okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,> okt 17 20:08:30 nanobuntu-jacob ollama[107606]: [GIN] 2023/10/17 - 20:08:30 | 200 | 2.730968875s | 127.0.0.1 | POST "/api/generate" okt 17 20:08:30 nanobuntu-jacob ollama[117371]: {"timestamp":1697566110,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,> okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":47988,"status":200,> okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings: load time = 1899.60 ms okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings: sample time = 1.49 ms / 3 runs ( 0.50 ms per token, 2016.13 tokens per second) okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings: prompt eval time = 999.79 ms / 31 tokens ( 32.25 ms per token, 31.01 tokens per second) okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings: eval time = 155.18 ms / 2 runs ( 77.59 ms per token, 12.89 tokens per second) okt 17 20:08:32 nanobuntu-jacob ollama[107606]: llama_print_timings: total time = 1158.19 ms okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":48004,"status":200,> okt 17 20:08:32 nanobuntu-jacob ollama[107606]: [GIN] 2023/10/17 - 20:08:32 | 200 | 1.160832025s | 127.0.0.1 | POST "/api/generate" okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":48004,"status":200,> okt 17 20:08:32 nanobuntu-jacob ollama[117371]: {"timestamp":1697566112,"level":"INFO","function":"log_server_request","line":1204,"message":"request","remote_addr":"127.0.0.1","remote_port":48004,"status":200,> okt 17 20:08:33 nanobuntu-jacob ollama[107606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5487: out of memory okt 17 20:08:33 nanobuntu-jacob ollama[107606]: current device: 0 okt 17 20:08:33 nanobuntu-jacob ollama[107606]: [GIN] 2023/10/17 - 20:08:33 | 200 | 908.002038ms | 127.0.0.1 | POST "/api/generate" ``` It worked fine until I upgraded to v0.1.3.
Author
Owner

@AlexandrePoisson commented on GitHub (Oct 17, 2023):

Hi,

I am facing a quite similar issue.
I upgraded this morning using curl https://ollama.ai/install.sh | sh

I now get "unexpected EOF error" when a model which was running fine 2 weeks ago. Here are the last lines from journalctl:
Oct 17 19:25:24 _ ollama[536]: [GIN] 2023/10/17 - 19:25:24 | 200 | 14.291µs | 127.0.0.1 | HEAD "/" Oct 17 19:25:24 _ ollama[536]: [GIN] 2023/10/17 - 19:25:24 | 200 | 263.539µs | 127.0.0.1 | GET "/api/tags" Oct 17 19:26:12 _ ollama[536]: [GIN] 2023/10/17 - 19:26:12 | 200 | 14.455µs | 127.0.0.1 | HEAD "/" Oct 17 19:26:12 _ ollama[536]: [GIN] 2023/10/17 - 19:26:12 | 200 | 271.892µs | 127.0.0.1 | GET "/api/tags" Oct 17 19:26:12 _ ollama[536]: 2023/10/17 19:26:12 llama.go:252: 3231 MiB VRAM available, loading up to 16 GPU layers Oct 17 19:26:12 _ ollama[536]: 2023/10/17 19:26:12 llama.go:356: starting llama runner Oct 17 19:26:12 _ ollama[536]: 2023/10/17 19:26:12 llama.go:408: waiting for llama runner to start responding Oct 17 19:26:12 _ ollama[536]: ggml_init_cublas: found 1 CUDA devices: Oct 17 19:26:12 _ ollama[536]: Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5 Oct 17 19:26:12 _ ollama[47389]: {"timestamp":1697563572,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} Oct 17 19:26:12 _ ollama[47389]: {"timestamp":1697563572,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} Oct 17 19:26:12 _ ollama[536]: llama.cpp: loading model from /usr/share/ollama/.ollama/models/blobs/sha256:f79142715bc9539a2edbb4b253548db8b34fac22736593eeaa28555874476e30 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: format = ggjt v3 (latest) Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_vocab = 32000 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_ctx = 2048 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_embd = 5120 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_mult = 256 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_head = 40 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_head_kv = 40 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_layer = 40 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_rot = 128 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_gqa = 1 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: rnorm_eps = 5.0e-06 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_ff = 13824 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: freq_base = 10000.0 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: freq_scale = 1 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: ftype = 2 (mostly Q4_0) Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: model size = 13B Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: ggml ctx size = 0.11 MB Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: using CUDA for GPU acceleration Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: mem required = 4754.60 MB (+ 1600.00 MB per state) Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: offloading 16 repeating layers to GPU Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: offloaded 16/43 layers to GPU Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: total VRAM used: 3204 MB Oct 17 19:26:14 _ ollama[536]: WARNING: failed to allocate 1602.00 MB of pinned memory: out of memory Oct 17 19:26:14 _ ollama[536]: llama_new_context_with_model: kv self size = 1600.00 MB Oct 17 19:26:14 _ ollama[47389]: llama server listening at http://127.0.0.1:60201 Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":60201} Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":43498,"status":200,"method":"HEAD","path":"/","params":{}} Oct 17 19:26:14 _ ollama[536]: 2023/10/17 19:26:14 llama.go:422: llama runner started in 2.401466 seconds Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":43498,"status":200,"method":"POST","path":"/tokenize","params":{}} Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":43498,"status":200,"method":"POST","path":"/tokenize","params":{}} Oct 17 19:26:14 _ ollama[536]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory Oct 17 19:26:15 _ ollama[536]: [GIN] 2023/10/17 - 19:26:15 | 200 | 3.078410928s | 127.0.0.1 | POST "/api/generate" Oct 17 19:31:14 _ ollama[536]: 2023/10/17 19:31:14 llama.go:438: llama runner stopped with error: exit status 1

It looks like it's the same out of memory issue than Duxon, on a model which was running good on the same machine. I did not update any driver.
I got this error on llama2:13b and llama2:7b
Orca-mini runs fine on my machine

When i use PARAMETER num_gpu 0 those models are also working. This is my current workaround

<!-- gh-comment-id:1767014400 --> @AlexandrePoisson commented on GitHub (Oct 17, 2023): Hi, I am facing a quite similar issue. I upgraded this morning using `curl https://ollama.ai/install.sh | sh` I now get "unexpected EOF error" when a model which was running fine 2 weeks ago. Here are the last lines from journalctl: `Oct 17 19:25:24 _ ollama[536]: [GIN] 2023/10/17 - 19:25:24 | 200 | 14.291µs | 127.0.0.1 | HEAD "/" Oct 17 19:25:24 _ ollama[536]: [GIN] 2023/10/17 - 19:25:24 | 200 | 263.539µs | 127.0.0.1 | GET "/api/tags" Oct 17 19:26:12 _ ollama[536]: [GIN] 2023/10/17 - 19:26:12 | 200 | 14.455µs | 127.0.0.1 | HEAD "/" Oct 17 19:26:12 _ ollama[536]: [GIN] 2023/10/17 - 19:26:12 | 200 | 271.892µs | 127.0.0.1 | GET "/api/tags" Oct 17 19:26:12 _ ollama[536]: 2023/10/17 19:26:12 llama.go:252: 3231 MiB VRAM available, loading up to 16 GPU layers Oct 17 19:26:12 _ ollama[536]: 2023/10/17 19:26:12 llama.go:356: starting llama runner Oct 17 19:26:12 _ ollama[536]: 2023/10/17 19:26:12 llama.go:408: waiting for llama runner to start responding Oct 17 19:26:12 _ ollama[536]: ggml_init_cublas: found 1 CUDA devices: Oct 17 19:26:12 _ ollama[536]: Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5 Oct 17 19:26:12 _ ollama[47389]: {"timestamp":1697563572,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} Oct 17 19:26:12 _ ollama[47389]: {"timestamp":1697563572,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} Oct 17 19:26:12 _ ollama[536]: llama.cpp: loading model from /usr/share/ollama/.ollama/models/blobs/sha256:f79142715bc9539a2edbb4b253548db8b34fac22736593eeaa28555874476e30 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: format = ggjt v3 (latest) Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_vocab = 32000 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_ctx = 2048 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_embd = 5120 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_mult = 256 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_head = 40 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_head_kv = 40 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_layer = 40 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_rot = 128 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_gqa = 1 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: rnorm_eps = 5.0e-06 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: n_ff = 13824 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: freq_base = 10000.0 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: freq_scale = 1 Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: ftype = 2 (mostly Q4_0) Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: model size = 13B Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: ggml ctx size = 0.11 MB Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: using CUDA for GPU acceleration Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: mem required = 4754.60 MB (+ 1600.00 MB per state) Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 480 MB VRAM for the scratch buffer Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: offloading 16 repeating layers to GPU Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: offloaded 16/43 layers to GPU Oct 17 19:26:12 _ ollama[536]: llama_model_load_internal: total VRAM used: 3204 MB Oct 17 19:26:14 _ ollama[536]: WARNING: failed to allocate 1602.00 MB of pinned memory: out of memory Oct 17 19:26:14 _ ollama[536]: llama_new_context_with_model: kv self size = 1600.00 MB Oct 17 19:26:14 _ ollama[47389]: llama server listening at http://127.0.0.1:60201 Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":60201} Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":43498,"status":200,"method":"HEAD","path":"/","params":{}} Oct 17 19:26:14 _ ollama[536]: 2023/10/17 19:26:14 llama.go:422: llama runner started in 2.401466 seconds Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":43498,"status":200,"method":"POST","path":"/tokenize","params":{}} Oct 17 19:26:14 _ ollama[47389]: {"timestamp":1697563574,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":43498,"status":200,"method":"POST","path":"/tokenize","params":{}} Oct 17 19:26:14 _ ollama[536]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory Oct 17 19:26:15 _ ollama[536]: [GIN] 2023/10/17 - 19:26:15 | 200 | 3.078410928s | 127.0.0.1 | POST "/api/generate" Oct 17 19:31:14 _ ollama[536]: 2023/10/17 19:31:14 llama.go:438: llama runner stopped with error: exit status 1` It looks like it's the same out of memory issue than Duxon, on a model which was running good on the same machine. I did not update any driver. I got this error on llama2:13b and llama2:7b Orca-mini runs fine on my machine When i use PARAMETER num_gpu 0 those models are also working. This is my current workaround
Author
Owner

@jmorganca commented on GitHub (Oct 26, 2023):

Hi there, thanks so much for creating an issue. Given this is an OOM error on linux I will merge this with #737

<!-- gh-comment-id:1780251941 --> @jmorganca commented on GitHub (Oct 26, 2023): Hi there, thanks so much for creating an issue. Given this is an OOM error on linux I will merge this with #737
Author
Owner

@Manni1000 commented on GitHub (Mar 12, 2025):

i have the same issue and its not oom. the model uses jsut 12gb and i have 24 and i still get this error

<!-- gh-comment-id:2719330656 --> @Manni1000 commented on GitHub (Mar 12, 2025): i have the same issue and its not oom. the model uses jsut 12gb and i have 24 and i still get this error
Author
Owner

@ghost commented on GitHub (Mar 23, 2025):

I had the same problem, upgrading Ollama to 0.6.2 solved my problem. Thanks Ollama guys.

<!-- gh-comment-id:2746116044 --> @ghost commented on GitHub (Mar 23, 2025): I had the same problem, upgrading Ollama to 0.6.2 solved my problem. Thanks Ollama guys.
Author
Owner

@Kaki-In commented on GitHub (May 13, 2025):

I am getting this error specifically with gemma3:12b, but not with some others. More over, ollama is loading it inside the CPU directly, ignoring the GPU for memory reason, and I am still getting EOF. Absolutely sure that it is nothing with the GPU.

<!-- gh-comment-id:2877673713 --> @Kaki-In commented on GitHub (May 13, 2025): I am getting this error specifically with gemma3:12b, but not with some others. More over, ollama is loading it inside the CPU directly, ignoring the GPU for memory reason, and I am still getting EOF. Absolutely sure that it is nothing with the GPU.
Author
Owner

@Manni1000 commented on GitHub (May 14, 2025):

i think i was also using the same model as you

<!-- gh-comment-id:2880801325 --> @Manni1000 commented on GitHub (May 14, 2025): i think i was also using the same model as you
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#46905