[GH-ISSUE #1374] Out of memory error on model that previously worked fine after update to version 0.1.13 #62762

Closed
opened 2026-05-03 10:14:15 -05:00 by GiteaMirror · 27 comments
Owner

Originally created by @madsamjp on GitHub (Dec 4, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1374

I configured a model to run entirely in VRAM using the following Modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 65
PARAMETER num_ctx 2048

I had no issues with running this, it would use about 22GB of my 4090's 24GB VRAM without issue. It would generate responses very quickly which was very helpful for getting quick answers to short coding queries.

However, yesterday I updated Ollama (to 0.1.13), and now I cannot run the same model. I get an out of memory error, despite the model not needing more than 22.5GB (according to the logs below).

I run Ollama on a headless linux server, so there are no other applications using the GPU.

Was there an update that changes how much VRAM Ollama allocates to make it need more than before? Is there a way to configure Ollama so that it behaves the same way as before?

EDIT: Reverting back to ollama version 0.1.11 resolves the issue for now.

Error:

Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: offloaded 65/65 layers to GPU
Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: VRAM used: 21741.89 MiB
Dec 04 16:28:23 osm-server ollama[528776]: ....................................................................................................
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: n_ctx      = 2048
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_base  = 100000.0
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_scale = 0.25
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading v cache to GPU
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading k cache to GPU
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: VRAM kv self = 496.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: kv self size  =  496.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_build_graph: non-view tensors processed: 1430/1430
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: compute buffer total size = 273.07 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)
Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"main","line":2917,"message":"HTTP server listening","hostname":"127.0.0.1","port":57264}
Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":46990,"status":200,"method":"HEAD","path":"/","params":{}}
Dec 04 16:28:24 osm-server ollama[528776]: 2023/12/04 16:28:24 llama.go:493: llama runner started in 4.401485 seconds
Dec 04 16:28:24 osm-server ollama[528776]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory
Dec 04 16:28:24 osm-server ollama[528776]: current device: 0
Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory
Dec 04 16:28:25 osm-server ollama[528776]: current device: 0
Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:510: llama runner stopped successfully
Dec 04 16:28:25 osm-server ollama[528776]: [GIN] 2023/12/04 - 16:28:25 | 200 |  6.468638351s |       127.0.0.1 | POST     "/api/generate"
Originally created by @madsamjp on GitHub (Dec 4, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1374 I configured a model to run entirely in VRAM using the following Modelfile: ``` FROM deepseek-coder:33b-instruct-q5_K_S PARAMETER num_gpu 65 PARAMETER num_ctx 2048 ``` I had no issues with running this, it would use about 22GB of my 4090's 24GB VRAM without issue. It would generate responses very quickly which was very helpful for getting quick answers to short coding queries. However, yesterday I updated Ollama (to 0.1.13), and now I cannot run the same model. I get an out of memory error, despite the model not needing more than 22.5GB (according to the logs below). I run Ollama on a headless linux server, so there are no other applications using the GPU. Was there an update that changes how much VRAM Ollama allocates to make it need more than before? Is there a way to configure Ollama so that it behaves the same way as before? EDIT: Reverting back to ollama version 0.1.11 resolves the issue for now. Error: ``` Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: offloaded 65/65 layers to GPU Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: VRAM used: 21741.89 MiB Dec 04 16:28:23 osm-server ollama[528776]: .................................................................................................... Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: n_ctx = 2048 Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_base = 100000.0 Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_scale = 0.25 Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading v cache to GPU Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading k cache to GPU Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: VRAM kv self = 496.00 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: kv self size = 496.00 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_build_graph: non-view tensors processed: 1430/1430 Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: compute buffer total size = 273.07 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"main","line":2917,"message":"HTTP server listening","hostname":"127.0.0.1","port":57264} Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":46990,"status":200,"method":"HEAD","path":"/","params":{}} Dec 04 16:28:24 osm-server ollama[528776]: 2023/12/04 16:28:24 llama.go:493: llama runner started in 4.401485 seconds Dec 04 16:28:24 osm-server ollama[528776]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory Dec 04 16:28:24 osm-server ollama[528776]: current device: 0 Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory Dec 04 16:28:25 osm-server ollama[528776]: current device: 0 Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:510: llama runner stopped successfully Dec 04 16:28:25 osm-server ollama[528776]: [GIN] 2023/12/04 - 16:28:25 | 200 | 6.468638351s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

Is there a download for the older version somewhere? I'd like to try it.

<!-- gh-comment-id:1839144039 --> @phalexo commented on GitHub (Dec 4, 2023): Is there a download for the older version somewhere? I'd like to try it.
Author
Owner

@technovangelist commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

<!-- gh-comment-id:1839245502 --> @technovangelist commented on GitHub (Dec 4, 2023): How much ram does your machine have? You mentioned vram.
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

I know you are asking the original poster, but I have 330GiB on the host and 12.2GiB per GPU (4 devices) and I am seeing something similar, even with just freshly rebuilt ollama. Loading Mistral uses up less than 50% of VRAM.

It runs ok on the host though.

<!-- gh-comment-id:1839259012 --> @phalexo commented on GitHub (Dec 4, 2023): > How much ram does your machine have? You mentioned vram. I know you are asking the original poster, but I have 330GiB on the host and 12.2GiB per GPU (4 devices) and I am seeing something similar, even with just freshly rebuilt ollama. Loading Mistral uses up less than 50% of VRAM. It runs ok on the host though.
Author
Owner

@madsamjp commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

Ample - 96GB. It's working now since I've reverted back to version 0.1.11, which indicates to me something has changed with Ollama since the latest update that has changed how it allocates vram.

<!-- gh-comment-id:1839491746 --> @madsamjp commented on GitHub (Dec 4, 2023): > How much ram does your machine have? You mentioned vram. Ample - 96GB. It's working now since I've reverted back to version 0.1.11, which indicates to me something has changed with Ollama since the latest update that has changed how it allocates vram.
Author
Owner

@madsamjp commented on GitHub (Dec 4, 2023):

Is there a download for the older version somewhere? I'd like to try it.

This is the version I've rolled back to - you can download from here: https://github.com/jmorganca/ollama/releases/tag/v0.1.11

<!-- gh-comment-id:1839494612 --> @madsamjp commented on GitHub (Dec 4, 2023): > Is there a download for the older version somewhere? I'd like to try it. This is the version I've rolled back to - you can download from here: https://github.com/jmorganca/ollama/releases/tag/v0.1.11
Author
Owner

@phalexo commented on GitHub (Dec 4, 2023):

Fantastic. Before dropping to 0.1.11 it was printing junk, and dying on the second inquiry. Now it seems to work. Quickly too.

<!-- gh-comment-id:1839527866 --> @phalexo commented on GitHub (Dec 4, 2023): Fantastic. Before dropping to 0.1.11 it was printing junk, and dying on the second inquiry. Now it seems to work. Quickly too.
Author
Owner

@technovangelist commented on GitHub (Dec 5, 2023):

What OS are you running? How did you install it?

<!-- gh-comment-id:1839802134 --> @technovangelist commented on GitHub (Dec 5, 2023): What OS are you running? How did you install it?
Author
Owner

@madsamjp commented on GitHub (Dec 6, 2023):

What OS are you running? How did you install it?

Ubuntu 20.04
Using the install script:

curl https://ollama.ai/install.sh | sh
<!-- gh-comment-id:1843014595 --> @madsamjp commented on GitHub (Dec 6, 2023): > What OS are you running? How did you install it? Ubuntu 20.04 Using the install script: ``` curl https://ollama.ai/install.sh | sh ```
Author
Owner

@technovangelist commented on GitHub (Dec 9, 2023):

Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don't think that will be in there. Will let you know what we find. This is a bit strange.

<!-- gh-comment-id:1848062591 --> @technovangelist commented on GitHub (Dec 9, 2023): Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don't think that will be in there. Will let you know what we find. This is a bit strange.
Author
Owner

@igorschlum commented on GitHub (Dec 10, 2023):

@madsamjp did you try with 0.1.14 that is out now?

<!-- gh-comment-id:1849119669 --> @igorschlum commented on GitHub (Dec 10, 2023): @madsamjp did you try with 0.1.14 that is out now?
Author
Owner

@phalexo commented on GitHub (Dec 12, 2023):

@madsamjp did you try with 0.1.14 that is out now?

I have tried it with 0.1.14 as modified for Mixtral, the error is back from the dead.

So, if I use version 0.1.11 I don't get the out of memory error, but I get another error specific to Mixtral. And if I use the modified version 0.1.14 then the MIxtral error is gone, but I am back to the same cuBLAS fake OOM error.

<!-- gh-comment-id:1851107771 --> @phalexo commented on GitHub (Dec 12, 2023): > @madsamjp did you try with 0.1.14 that is out now? I have tried it with 0.1.14 as modified for Mixtral, the error is back from the dead. So, if I use version 0.1.11 I don't get the out of memory error, but I get another error specific to Mixtral. And if I use the modified version 0.1.14 then the MIxtral error is gone, but I am back to the same cuBLAS fake OOM error.
Author
Owner

@phalexo commented on GitHub (Dec 12, 2023):

@technovangelist Has anyone discovered anything new on this? Perhaps in other threads?

I have tried to use the Mixtral branch, derived from 0.1.14 (I assume), the error is still there. The Mixtral Q6_K loads ok, but fails after I enter some text and start generation. It does the same with a model that is only 5GiB as well.

<!-- gh-comment-id:1852295609 --> @phalexo commented on GitHub (Dec 12, 2023): @technovangelist Has anyone discovered anything new on this? Perhaps in other threads? I have tried to use the Mixtral branch, derived from 0.1.14 (I assume), the error is still there. The Mixtral Q6_K loads ok, but fails after I enter some text and start generation. It does the same with a model that is only 5GiB as well.
Author
Owner

@madsamjp commented on GitHub (Dec 15, 2023):

@igorschlum @technovangelist I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v0.1.13 (edit: v 0.1.11) works for this model.

Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: n_ctx      = 2048
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_base  = 100000.0
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_scale = 0.25
Dec 15 20:36:50 osm-server ollama[4101606]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: KV self size  =  496.00 MiB, K (f16):  248.00 MiB, V (f16):  248.00 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_build_graph: non-view tensors processed: 1306/1306
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: compute buffer total size = 273.32 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

---

Dec 15 20:39:32 osm-server ollama[4129881]: {"timestamp":1702672772,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"127.0.0.1","remote_port":41532,"status":200,"method":"HEAD","path":"/","params":{}}
Dec 15 20:39:32 osm-server ollama[4101606]: 2023/12/15 20:39:32 llama.go:577: loaded 0 images
Dec 15 20:39:32 osm-server ollama[4101606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory
Dec 15 20:39:32 osm-server ollama[4101606]: current device: 0
Dec 15 20:39:32 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error"
Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory
Dec 15 20:39:33 osm-server ollama[4101606]: current device: 0
Dec 15 20:39:33 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error"
Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:525: llama runner stopped successfully

<!-- gh-comment-id:1858458626 --> @madsamjp commented on GitHub (Dec 15, 2023): @igorschlum @technovangelist I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only ~v0.1.13~ (edit: v 0.1.11) works for this model. ``` Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: n_ctx = 2048 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_base = 100000.0 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_scale = 0.25 Dec 15 20:36:50 osm-server ollama[4101606]: llama_kv_cache_init: VRAM kv self = 496.00 MB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_build_graph: non-view tensors processed: 1306/1306 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: compute buffer total size = 273.32 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) --- Dec 15 20:39:32 osm-server ollama[4129881]: {"timestamp":1702672772,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"127.0.0.1","remote_port":41532,"status":200,"method":"HEAD","path":"/","params":{}} Dec 15 20:39:32 osm-server ollama[4101606]: 2023/12/15 20:39:32 llama.go:577: loaded 0 images Dec 15 20:39:32 osm-server ollama[4101606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory Dec 15 20:39:32 osm-server ollama[4101606]: current device: 0 Dec 15 20:39:32 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error" Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory Dec 15 20:39:33 osm-server ollama[4101606]: current device: 0 Dec 15 20:39:33 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error" Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:525: llama runner stopped successfully ```
Author
Owner

@phalexo commented on GitHub (Dec 15, 2023):

Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied
gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work.
Between 11 and 12 something was broken. I was also able to get the
same/very similar error with llama.cpp directly.

On Fri, Dec 15, 2023 at 3:38 PM madsamjp @.***> wrote:

@igorschlum https://github.com/igorschlum I've now tried with version
0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I
give it a prompt. Seems only v 0.1.13 works for this model.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1858465991 --> @phalexo commented on GitHub (Dec 15, 2023): Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly. On Fri, Dec 15, 2023 at 3:38 PM madsamjp ***@***.***> wrote: > @igorschlum <https://github.com/igorschlum> I've now tried with version > 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I > give it a prompt. Seems only v 0.1.13 works for this model. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@igorschlum commented on GitHub (Dec 15, 2023):

@phalexo Ollama 0.1.15 is released. It's worth a try.

<!-- gh-comment-id:1858499599 --> @igorschlum commented on GitHub (Dec 15, 2023): @phalexo Ollama 0.1.15 is released. It's worth a try.
Author
Owner

@phalexo commented on GitHub (Dec 15, 2023):

@madsamjp, tried it unsuccessfully with the next version up, v0.1.16,
v0.1.15 cannot possibly work.

On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger @.***>
wrote:

@phalexo https://github.com/phalexo Ollama 0.1.15 is released. It's
worth a try.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858499599,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLBVFFCM3X44PAF2KDYJS445AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4TSNJZHE
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1858506803 --> @phalexo commented on GitHub (Dec 15, 2023): @madsamjp, tried it unsuccessfully with the next version up, v0.1.16, v0.1.15 cannot possibly work. On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger ***@***.***> wrote: > @phalexo <https://github.com/phalexo> Ollama 0.1.15 is released. It's > worth a try. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858499599>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLBVFFCM3X44PAF2KDYJS445AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4TSNJZHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@madsamjp commented on GitHub (Dec 16, 2023):

Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly.

On Fri, Dec 15, 2023 at 3:38 PM madsamjp @.> wrote: @igorschlum https://github.com/igorschlum I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v 0.1.13 works for this model. — Reply to this email directly, view it on GitHub <#1374 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY . You are receiving this because you commented.Message ID: @.>

Apologies - I meant to say 0.1.11. I've tried 0.1.16 and I'm getting the OOM error. I've had to revert back to 0.1.11 .

Neither 0.1.13 or 0.1.16 worked for me. Only 0.1.11.

<!-- gh-comment-id:1858625470 --> @madsamjp commented on GitHub (Dec 16, 2023): > Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly. > […](#) > On Fri, Dec 15, 2023 at 3:38 PM madsamjp ***@***.***> wrote: @igorschlum <https://github.com/igorschlum> I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v 0.1.13 works for this model. — Reply to this email directly, view it on GitHub <[#1374 (comment)](https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY> . You are receiving this because you commented.Message ID: ***@***.***> Apologies - I meant to say 0.1.11. I've tried 0.1.16 and I'm getting the OOM error. I've had to revert back to _0.1.11_ . Neither 0.1.13 or 0.1.16 worked for me. Only 0.1.11.
Author
Owner

@phalexo commented on GitHub (Dec 16, 2023):

git clone --recursive https://github.com/jmorganca/ollama.git
cd ollama/llm/llama.cpp
vi generate_linux.go
//go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build ggml/build/cuda --target server --config Release
//go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build gguf/build/cuda --target server --config Release
//go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner
cd ../..
go generate ./...
go build .
<!-- gh-comment-id:1858643637 --> @phalexo commented on GitHub (Dec 16, 2023): ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build ggml/build/cuda --target server --config Release //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner //go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build gguf/build/cuda --target server --config Release //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner ``` ```bash cd ../.. go generate ./... go build . ```
Author
Owner

@madsamjp commented on GitHub (Dec 16, 2023):

@phalexo this works! It seems that adding the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me.

<!-- gh-comment-id:1858791883 --> @madsamjp commented on GitHub (Dec 16, 2023): @phalexo this works! It seems that adding the ` -DLLAMA_CUDA_FORCE_MMQ=on` flag solves the issue for me.
Author
Owner

@igorschlum commented on GitHub (Dec 16, 2023):

Hi @madsamjp

Great news. -DLLAMA_CUDA_FORCE_MMQ=on forces the usage of MMQ with the GPU even if GPU driver is not said to be compatible with CUDA; this parameter cannot be included in the Ollama Core and should stay as an optional parameter.

I think that you can close the Issue now. You could also try to update your GPU driver software and see if it's compatible with 0.1.15 version of Ollama.

<!-- gh-comment-id:1858830457 --> @igorschlum commented on GitHub (Dec 16, 2023): Hi @madsamjp Great news. -DLLAMA_CUDA_FORCE_MMQ=on forces the usage of MMQ with the GPU even if GPU driver is not said to be compatible with CUDA; this parameter cannot be included in the Ollama Core and should stay as an optional parameter. I think that you can close the Issue now. You could also try to update your GPU driver software and see if it's compatible with 0.1.15 version of Ollama.
Author
Owner

@phalexo commented on GitHub (Dec 16, 2023):

How is the performance though? Is it impacted by the change?

On Sat, Dec 16, 2023, 6:11 AM madsamjp @.***> wrote:

@phalexo https://github.com/phalexo this works! It seems that adding
the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858791883,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZICIVIAAYYK4RPCGMLYJV6XTAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG44TCOBYGM
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1858833696 --> @phalexo commented on GitHub (Dec 16, 2023): How is the performance though? Is it impacted by the change? On Sat, Dec 16, 2023, 6:11 AM madsamjp ***@***.***> wrote: > @phalexo <https://github.com/phalexo> this works! It seems that adding > the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858791883>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZICIVIAAYYK4RPCGMLYJV6XTAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG44TCOBYGM> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@technovangelist commented on GitHub (Jan 3, 2024):

Can you try repulling the models being used. We updated most of them in the last few weeks to address issues like this.

<!-- gh-comment-id:1875719779 --> @technovangelist commented on GitHub (Jan 3, 2024): Can you try repulling the models being used. We updated most of them in the last few weeks to address issues like this.
Author
Owner

@madsamjp commented on GitHub (Jan 3, 2024):

@technovangelist I've updated to the latest version of Ollama (0.1.17), and pulled the latest deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 63
PARAMETER num_ctx 2048

I can load the model into VRAM just fine. It uses 23697MiB:
image

The logs:

Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size =    0.21 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required  =  151.81 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: ....................................................................................................
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx      = 2048
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base  = 100000.0
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25
Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size  =  496.00 MiB, K (f16):  248.00 MiB, V (f16):  248.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

However, when I give it a prompt, it quickly dies:

Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"}
Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds
Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 |  4.138764899s |       127.0.0.1 | POST     "/api/generate"
Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images
Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0
Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0
Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully

The only way I can continue to use this model is to build from source with the -DLLAMA_CUDA_FORCE_MMQ=on flag.

<!-- gh-comment-id:1875770987 --> @madsamjp commented on GitHub (Jan 3, 2024): @technovangelist I've updated to the latest version of Ollama (0.1.17), and pulled the latest `deepseek-coder:33b-instruct-q5_K_S` model. Here is my modelfile: ``` FROM deepseek-coder:33b-instruct-q5_K_S PARAMETER num_gpu 63 PARAMETER num_ctx 2048 ``` I can load the model into VRAM just fine. It uses `23697MiB`: ![image](https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31) The logs: ``` Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB Jan 03 18:06:36 osm-server ollama[2633395]: .................................................................................................... Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25 Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) ``` However, when I give it a prompt, it quickly dies: ``` Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"} Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}} Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate" Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}} Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0 Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0 Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully ``` The only way I can continue to use this model is to build from source with the ` -DLLAMA_CUDA_FORCE_MMQ=on` flag.
Author
Owner

@phalexo commented on GitHub (Jan 3, 2024):

Did you ever test for performance with MMQ flag versus the 0.1.11?

On Wed, Jan 3, 2024, 1:11 PM madsamjp @.***> wrote:

@technovangelist https://github.com/technovangelist I've updated to the
latest version of Ollama (0.1.17), and pulled the latest
deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 63
PARAMETER num_ctx 2048

I can load the model into VRAM just fine. It uses 23697MiB:
image.png (view on web)
https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31

The logs:

Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: ....................................................................................................
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25
Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

However, when I give it a prompt, it quickly dies:

Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"}
Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds
Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate"
Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images
Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0
Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0
Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully

The only way I can continue to use this model is to build from source with
the -DLLAMA_CUDA_FORCE_MMQ=on flag.


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875770987,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLMLOH4BDQU3Y4Q2ITYMWNN5AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVG43TAOJYG4
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1875794493 --> @phalexo commented on GitHub (Jan 3, 2024): Did you ever test for performance with MMQ flag versus the 0.1.11? On Wed, Jan 3, 2024, 1:11 PM madsamjp ***@***.***> wrote: > @technovangelist <https://github.com/technovangelist> I've updated to the > latest version of Ollama (0.1.17), and pulled the latest > deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile: > > FROM deepseek-coder:33b-instruct-q5_K_S > > PARAMETER num_gpu 63 > PARAMETER num_ctx 2048 > > I can load the model into VRAM just fine. It uses 23697MiB: > image.png (view on web) > <https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31> > > The logs: > > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: .................................................................................................... > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) > > However, when I give it a prompt, it quickly dies: > > Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"} > Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}} > Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds > Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate" > Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images > Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}} > Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory > Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0 > Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" > Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory > Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0 > Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" > Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully > > The only way I can continue to use this model is to build from source with > the -DLLAMA_CUDA_FORCE_MMQ=on flag. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875770987>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLMLOH4BDQU3Y4Q2ITYMWNN5AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVG43TAOJYG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@madsamjp commented on GitHub (Jan 3, 2024):

@phalexo I haven't tested. Why, are you noticing degraded performance?

<!-- gh-comment-id:1875803099 --> @madsamjp commented on GitHub (Jan 3, 2024): @phalexo I haven't tested. Why, are you noticing degraded performance?
Author
Owner

@phalexo commented on GitHub (Jan 3, 2024):

All my testing is ad hoc, difficult to assess. I thought you run a largish
system so it may be noticeable.

I have a suspicion that there may be a performance hit. If my understanding
is correct, the flag shift away from cuBlas to different kernels. If cuBlas
is optimized better, there may be a difference.

On Wed, Jan 3, 2024, 1:39 PM madsamjp @.***> wrote:

@phalexo https://github.com/phalexo I haven't tested. Why, are you
noticing degraded performance?


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875803099,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZN6VRZSZDFKPZGYJW3YMWQW7AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVHAYDGMBZHE
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1875814892 --> @phalexo commented on GitHub (Jan 3, 2024): All my testing is ad hoc, difficult to assess. I thought you run a largish system so it may be noticeable. I have a suspicion that there may be a performance hit. If my understanding is correct, the flag shift away from cuBlas to different kernels. If cuBlas is optimized better, there may be a difference. On Wed, Jan 3, 2024, 1:39 PM madsamjp ***@***.***> wrote: > @phalexo <https://github.com/phalexo> I haven't tested. Why, are you > noticing degraded performance? > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875803099>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZN6VRZSZDFKPZGYJW3YMWQW7AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVHAYDGMBZHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@madsamjp commented on GitHub (Jan 4, 2024):

@phalexo I'm just running a meager 4090! VRAM is a massive issue. I've found I can squeeze deepseek-coder 33b Q5_K_S into my VRAM if I reduce the context window down to 2048, but it's right on the edge, using about 23.7GB. But the model is really good AND fast at answering coding questions - which I find I'm using more and more these days for my both my professional and personal work. If I have some time over the weekend I'll revert back to 0.1.11 and test it and report back here.

<!-- gh-comment-id:1876152038 --> @madsamjp commented on GitHub (Jan 4, 2024): @phalexo I'm just running a meager 4090! VRAM is a massive issue. I've found I can squeeze deepseek-coder 33b Q5_K_S into my VRAM if I reduce the context window down to 2048, but it's right on the edge, using about 23.7GB. But the model is really good AND fast at answering coding questions - which I find I'm using more and more these days for my both my professional and personal work. If I have some time over the weekend I'll revert back to 0.1.11 and test it and report back here.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62762