[GH-ISSUE #1374] Out of memory error on model that previously worked fine after update to version 0.1.13 #62762

New Issue

GiteaMirror · 2026-05-03T10:14:15-05:00

GiteaMirror commented

2026-05-03 10:14:15 -05:00

Originally created by @madsamjp on GitHub (Dec 4, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1374

I configured a model to run entirely in VRAM using the following Modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 65
PARAMETER num_ctx 2048

I had no issues with running this, it would use about 22GB of my 4090's 24GB VRAM without issue. It would generate responses very quickly which was very helpful for getting quick answers to short coding queries.

However, yesterday I updated Ollama (to 0.1.13), and now I cannot run the same model. I get an out of memory error, despite the model not needing more than 22.5GB (according to the logs below).

I run Ollama on a headless linux server, so there are no other applications using the GPU.

Was there an update that changes how much VRAM Ollama allocates to make it need more than before? Is there a way to configure Ollama so that it behaves the same way as before?

EDIT: Reverting back to ollama version 0.1.11 resolves the issue for now.

Error:

Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: offloaded 65/65 layers to GPU
Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: VRAM used: 21741.89 MiB
Dec 04 16:28:23 osm-server ollama[528776]: ....................................................................................................
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: n_ctx      = 2048
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_base  = 100000.0
Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_scale = 0.25
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading v cache to GPU
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading k cache to GPU
Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: VRAM kv self = 496.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: kv self size  =  496.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_build_graph: non-view tensors processed: 1430/1430
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: compute buffer total size = 273.07 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)
Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"main","line":2917,"message":"HTTP server listening","hostname":"127.0.0.1","port":57264}
Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":46990,"status":200,"method":"HEAD","path":"/","params":{}}
Dec 04 16:28:24 osm-server ollama[528776]: 2023/12/04 16:28:24 llama.go:493: llama runner started in 4.401485 seconds
Dec 04 16:28:24 osm-server ollama[528776]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory
Dec 04 16:28:24 osm-server ollama[528776]: current device: 0
Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory
Dec 04 16:28:25 osm-server ollama[528776]: current device: 0
Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:510: llama runner stopped successfully
Dec 04 16:28:25 osm-server ollama[528776]: [GIN] 2023/12/04 - 16:28:25 | 200 |  6.468638351s |       127.0.0.1 | POST     "/api/generate"

Originally created by @madsamjp on GitHub (Dec 4, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1374 I configured a model to run entirely in VRAM using the following Modelfile: ``` FROM deepseek-coder:33b-instruct-q5_K_S PARAMETER num_gpu 65 PARAMETER num_ctx 2048 ``` I had no issues with running this, it would use about 22GB of my 4090's 24GB VRAM without issue. It would generate responses very quickly which was very helpful for getting quick answers to short coding queries. However, yesterday I updated Ollama (to 0.1.13), and now I cannot run the same model. I get an out of memory error, despite the model not needing more than 22.5GB (according to the logs below). I run Ollama on a headless linux server, so there are no other applications using the GPU. Was there an update that changes how much VRAM Ollama allocates to make it need more than before? Is there a way to configure Ollama so that it behaves the same way as before? EDIT: Reverting back to ollama version 0.1.11 resolves the issue for now. Error: ``` Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: offloaded 65/65 layers to GPU Dec 04 16:28:20 osm-server ollama[528776]: llm_load_tensors: VRAM used: 21741.89 MiB Dec 04 16:28:23 osm-server ollama[528776]: .................................................................................................... Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: n_ctx = 2048 Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_base = 100000.0 Dec 04 16:28:23 osm-server ollama[528776]: llama_new_context_with_model: freq_scale = 0.25 Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading v cache to GPU Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: offloading k cache to GPU Dec 04 16:28:24 osm-server ollama[528776]: llama_kv_cache_init: VRAM kv self = 496.00 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: kv self size = 496.00 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_build_graph: non-view tensors processed: 1430/1430 Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: compute buffer total size = 273.07 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Dec 04 16:28:24 osm-server ollama[528776]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"main","line":2917,"message":"HTTP server listening","hostname":"127.0.0.1","port":57264} Dec 04 16:28:24 osm-server ollama[600735]: {"timestamp":1701707304,"level":"INFO","function":"log_server_request","line":2478,"message":"request","remote_addr":"127.0.0.1","remote_port":46990,"status":200,"method":"HEAD","path":"/","params":{}} Dec 04 16:28:24 osm-server ollama[528776]: 2023/12/04 16:28:24 llama.go:493: llama runner started in 4.401485 seconds Dec 04 16:28:24 osm-server ollama[528776]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory Dec 04 16:28:24 osm-server ollama[528776]: current device: 0 Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:436: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:5884: out of memory Dec 04 16:28:25 osm-server ollama[528776]: current device: 0 Dec 04 16:28:25 osm-server ollama[528776]: 2023/12/04 16:28:25 llama.go:510: llama runner stopped successfully Dec 04 16:28:25 osm-server ollama[528776]: [GIN] 2023/12/04 - 16:28:25 | 200 | 6.468638351s | 127.0.0.1 | POST "/api/generate" ```

GiteaMirror closed this issue

2026-05-03 10:14:16 -05:00

GiteaMirror commented

2026-05-03 10:14:18 -05:00

@phalexo commented on GitHub (Dec 4, 2023):

Is there a download for the older version somewhere? I'd like to try it.

@phalexo commented on GitHub (Dec 4, 2023): Is there a download for the older version somewhere? I'd like to try it.

GiteaMirror commented

2026-05-03 10:14:18 -05:00

@technovangelist commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

@technovangelist commented on GitHub (Dec 4, 2023): How much ram does your machine have? You mentioned vram.

GiteaMirror commented

2026-05-03 10:14:19 -05:00

@phalexo commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

I know you are asking the original poster, but I have 330GiB on the host and 12.2GiB per GPU (4 devices) and I am seeing something similar, even with just freshly rebuilt ollama. Loading Mistral uses up less than 50% of VRAM.

It runs ok on the host though.

@phalexo commented on GitHub (Dec 4, 2023): > How much ram does your machine have? You mentioned vram. I know you are asking the original poster, but I have 330GiB on the host and 12.2GiB per GPU (4 devices) and I am seeing something similar, even with just freshly rebuilt ollama. Loading Mistral uses up less than 50% of VRAM. It runs ok on the host though.

GiteaMirror commented

2026-05-03 10:14:20 -05:00

@madsamjp commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

Ample - 96GB. It's working now since I've reverted back to version 0.1.11, which indicates to me something has changed with Ollama since the latest update that has changed how it allocates vram.

@madsamjp commented on GitHub (Dec 4, 2023): > How much ram does your machine have? You mentioned vram. Ample - 96GB. It's working now since I've reverted back to version 0.1.11, which indicates to me something has changed with Ollama since the latest update that has changed how it allocates vram.

GiteaMirror commented

2026-05-03 10:14:21 -05:00

@madsamjp commented on GitHub (Dec 4, 2023):

Is there a download for the older version somewhere? I'd like to try it.

This is the version I've rolled back to - you can download from here: https://github.com/jmorganca/ollama/releases/tag/v0.1.11

@madsamjp commented on GitHub (Dec 4, 2023): > Is there a download for the older version somewhere? I'd like to try it. This is the version I've rolled back to - you can download from here: https://github.com/jmorganca/ollama/releases/tag/v0.1.11

GiteaMirror commented

2026-05-03 10:14:22 -05:00

@phalexo commented on GitHub (Dec 4, 2023):

Fantastic. Before dropping to 0.1.11 it was printing junk, and dying on the second inquiry. Now it seems to work. Quickly too.

@phalexo commented on GitHub (Dec 4, 2023): Fantastic. Before dropping to 0.1.11 it was printing junk, and dying on the second inquiry. Now it seems to work. Quickly too.

GiteaMirror commented

2026-05-03 10:14:22 -05:00

@technovangelist commented on GitHub (Dec 5, 2023):

What OS are you running? How did you install it?

@technovangelist commented on GitHub (Dec 5, 2023): What OS are you running? How did you install it?

GiteaMirror commented

2026-05-03 10:14:23 -05:00

@madsamjp commented on GitHub (Dec 6, 2023):

What OS are you running? How did you install it?

Ubuntu 20.04
Using the install script:

curl https://ollama.ai/install.sh | sh

@madsamjp commented on GitHub (Dec 6, 2023): > What OS are you running? How did you install it? Ubuntu 20.04 Using the install script: ``` curl https://ollama.ai/install.sh | sh ```

GiteaMirror commented

2026-05-03 10:14:24 -05:00

@technovangelist commented on GitHub (Dec 9, 2023):

Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don't think that will be in there. Will let you know what we find. This is a bit strange.

@technovangelist commented on GitHub (Dec 9, 2023): Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don't think that will be in there. Will let you know what we find. This is a bit strange.

GiteaMirror commented

2026-05-03 10:14:24 -05:00

@igorschlum commented on GitHub (Dec 10, 2023):

@madsamjp did you try with 0.1.14 that is out now?

@igorschlum commented on GitHub (Dec 10, 2023): @madsamjp did you try with 0.1.14 that is out now?

GiteaMirror commented

2026-05-03 10:14:24 -05:00

@phalexo commented on GitHub (Dec 12, 2023):

@madsamjp did you try with 0.1.14 that is out now?

I have tried it with 0.1.14 as modified for Mixtral, the error is back from the dead.

So, if I use version 0.1.11 I don't get the out of memory error, but I get another error specific to Mixtral. And if I use the modified version 0.1.14 then the MIxtral error is gone, but I am back to the same cuBLAS fake OOM error.

@phalexo commented on GitHub (Dec 12, 2023): > @madsamjp did you try with 0.1.14 that is out now? I have tried it with 0.1.14 as modified for Mixtral, the error is back from the dead. So, if I use version 0.1.11 I don't get the out of memory error, but I get another error specific to Mixtral. And if I use the modified version 0.1.14 then the MIxtral error is gone, but I am back to the same cuBLAS fake OOM error.

GiteaMirror commented

2026-05-03 10:14:25 -05:00

@phalexo commented on GitHub (Dec 12, 2023):

@technovangelist Has anyone discovered anything new on this? Perhaps in other threads?

I have tried to use the Mixtral branch, derived from 0.1.14 (I assume), the error is still there. The Mixtral Q6_K loads ok, but fails after I enter some text and start generation. It does the same with a model that is only 5GiB as well.

@phalexo commented on GitHub (Dec 12, 2023): @technovangelist Has anyone discovered anything new on this? Perhaps in other threads? I have tried to use the Mixtral branch, derived from 0.1.14 (I assume), the error is still there. The Mixtral Q6_K loads ok, but fails after I enter some text and start generation. It does the same with a model that is only 5GiB as well.

GiteaMirror commented

2026-05-03 10:14:26 -05:00

@madsamjp commented on GitHub (Dec 15, 2023):

@igorschlum @technovangelist I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only ~~v0.1.13~~ (edit: v 0.1.11) works for this model.

Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: n_ctx      = 2048
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_base  = 100000.0
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_scale = 0.25
Dec 15 20:36:50 osm-server ollama[4101606]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: KV self size  =  496.00 MiB, K (f16):  248.00 MiB, V (f16):  248.00 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_build_graph: non-view tensors processed: 1306/1306
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: compute buffer total size = 273.32 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

---

Dec 15 20:39:32 osm-server ollama[4129881]: {"timestamp":1702672772,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"127.0.0.1","remote_port":41532,"status":200,"method":"HEAD","path":"/","params":{}}
Dec 15 20:39:32 osm-server ollama[4101606]: 2023/12/15 20:39:32 llama.go:577: loaded 0 images
Dec 15 20:39:32 osm-server ollama[4101606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory
Dec 15 20:39:32 osm-server ollama[4101606]: current device: 0
Dec 15 20:39:32 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error"
Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory
Dec 15 20:39:33 osm-server ollama[4101606]: current device: 0
Dec 15 20:39:33 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error"
Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:525: llama runner stopped successfully

@madsamjp commented on GitHub (Dec 15, 2023): @igorschlum @technovangelist I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only ~v0.1.13~ (edit: v 0.1.11) works for this model. ``` Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: n_ctx = 2048 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_base = 100000.0 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_scale = 0.25 Dec 15 20:36:50 osm-server ollama[4101606]: llama_kv_cache_init: VRAM kv self = 496.00 MB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_build_graph: non-view tensors processed: 1306/1306 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: compute buffer total size = 273.32 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) --- Dec 15 20:39:32 osm-server ollama[4129881]: {"timestamp":1702672772,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"127.0.0.1","remote_port":41532,"status":200,"method":"HEAD","path":"/","params":{}} Dec 15 20:39:32 osm-server ollama[4101606]: 2023/12/15 20:39:32 llama.go:577: loaded 0 images Dec 15 20:39:32 osm-server ollama[4101606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory Dec 15 20:39:32 osm-server ollama[4101606]: current device: 0 Dec 15 20:39:32 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error" Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory Dec 15 20:39:33 osm-server ollama[4101606]: current device: 0 Dec 15 20:39:33 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error" Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:525: llama runner stopped successfully ```

GiteaMirror commented

2026-05-03 10:14:27 -05:00

@phalexo commented on GitHub (Dec 15, 2023):

Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied
gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work.
Between 11 and 12 something was broken. I was also able to get the
same/very similar error with llama.cpp directly.

On Fri, Dec 15, 2023 at 3:38 PM madsamjp @.***> wrote:

@igorschlum https://github.com/igorschlum I've now tried with version
0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I
give it a prompt. Seems only v 0.1.13 works for this model.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (Dec 15, 2023): Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly. On Fri, Dec 15, 2023 at 3:38 PM madsamjp ***@***.***> wrote: > @igorschlum <https://github.com/igorschlum> I've now tried with version > 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I > give it a prompt. Seems only v 0.1.13 works for this model. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-05-03 10:14:28 -05:00

@igorschlum commented on GitHub (Dec 15, 2023):

@phalexo Ollama 0.1.15 is released. It's worth a try.

@igorschlum commented on GitHub (Dec 15, 2023): @phalexo Ollama 0.1.15 is released. It's worth a try.

GiteaMirror commented

2026-05-03 10:14:31 -05:00

@phalexo commented on GitHub (Dec 15, 2023):

@madsamjp, tried it unsuccessfully with the next version up, v0.1.16,
v0.1.15 cannot possibly work.

On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger @.***>
wrote:

@phalexo https://github.com/phalexo Ollama 0.1.15 is released. It's
worth a try.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858499599,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLBVFFCM3X44PAF2KDYJS445AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4TSNJZHE
.
You are receiving this because you were mentioned.Message ID:
@.***>

@phalexo commented on GitHub (Dec 15, 2023): @madsamjp, tried it unsuccessfully with the next version up, v0.1.16, v0.1.15 cannot possibly work. On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger ***@***.***> wrote: > @phalexo <https://github.com/phalexo> Ollama 0.1.15 is released. It's > worth a try. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858499599>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLBVFFCM3X44PAF2KDYJS445AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4TSNJZHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

GiteaMirror commented

2026-05-03 10:14:33 -05:00

@madsamjp commented on GitHub (Dec 16, 2023):

Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly.
…
On Fri, Dec 15, 2023 at 3:38 PM madsamjp @.> wrote: @igorschlum https://github.com/igorschlum I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v 0.1.13 works for this model. — Reply to this email directly, view it on GitHub <#1374 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY . You are receiving this because you commented.Message ID: @.>

Apologies - I meant to say 0.1.11. I've tried 0.1.16 and I'm getting the OOM error. I've had to revert back to 0.1.11 .

Neither 0.1.13 or 0.1.16 worked for me. Only 0.1.11.

@madsamjp commented on GitHub (Dec 16, 2023): > Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly. > […](#) > On Fri, Dec 15, 2023 at 3:38 PM madsamjp ***@***.***> wrote: @igorschlum <https://github.com/igorschlum> I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v 0.1.13 works for this model. — Reply to this email directly, view it on GitHub <[#1374 (comment)](https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY> . You are receiving this because you commented.Message ID: ***@***.***> Apologies - I meant to say 0.1.11. I've tried 0.1.16 and I'm getting the OOM error. I've had to revert back to _0.1.11_ . Neither 0.1.13 or 0.1.16 worked for me. Only 0.1.11.

GiteaMirror commented

2026-05-03 10:14:35 -05:00

@phalexo commented on GitHub (Dec 16, 2023):

git clone --recursive https://github.com/jmorganca/ollama.git
cd ollama/llm/llama.cpp
vi generate_linux.go

//go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build ggml/build/cuda --target server --config Release
//go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build gguf/build/cuda --target server --config Release
//go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner

cd ../..
go generate ./...
go build .

@phalexo commented on GitHub (Dec 16, 2023): ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build ggml/build/cuda --target server --config Release //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner //go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build gguf/build/cuda --target server --config Release //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner ``` ```bash cd ../.. go generate ./... go build . ```

GiteaMirror commented

2026-05-03 10:14:38 -05:00

@madsamjp commented on GitHub (Dec 16, 2023):

@phalexo this works! It seems that adding the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me.

@madsamjp commented on GitHub (Dec 16, 2023): @phalexo this works! It seems that adding the ` -DLLAMA_CUDA_FORCE_MMQ=on` flag solves the issue for me.

GiteaMirror commented

2026-05-03 10:14:38 -05:00

@igorschlum commented on GitHub (Dec 16, 2023):

Hi @madsamjp

Great news. -DLLAMA_CUDA_FORCE_MMQ=on forces the usage of MMQ with the GPU even if GPU driver is not said to be compatible with CUDA; this parameter cannot be included in the Ollama Core and should stay as an optional parameter.

I think that you can close the Issue now. You could also try to update your GPU driver software and see if it's compatible with 0.1.15 version of Ollama.

@igorschlum commented on GitHub (Dec 16, 2023): Hi @madsamjp Great news. -DLLAMA_CUDA_FORCE_MMQ=on forces the usage of MMQ with the GPU even if GPU driver is not said to be compatible with CUDA; this parameter cannot be included in the Ollama Core and should stay as an optional parameter. I think that you can close the Issue now. You could also try to update your GPU driver software and see if it's compatible with 0.1.15 version of Ollama.

GiteaMirror commented

2026-05-03 10:14:39 -05:00

@phalexo commented on GitHub (Dec 16, 2023):

How is the performance though? Is it impacted by the change?

On Sat, Dec 16, 2023, 6:11 AM madsamjp @.***> wrote:

@phalexo https://github.com/phalexo this works! It seems that adding
the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858791883,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZICIVIAAYYK4RPCGMLYJV6XTAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG44TCOBYGM
.
You are receiving this because you were mentioned.Message ID:
@.***>

@phalexo commented on GitHub (Dec 16, 2023): How is the performance though? Is it impacted by the change? On Sat, Dec 16, 2023, 6:11 AM madsamjp ***@***.***> wrote: > @phalexo <https://github.com/phalexo> this works! It seems that adding > the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858791883>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZICIVIAAYYK4RPCGMLYJV6XTAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG44TCOBYGM> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

GiteaMirror commented

2026-05-03 10:14:40 -05:00

@technovangelist commented on GitHub (Jan 3, 2024):

Can you try repulling the models being used. We updated most of them in the last few weeks to address issues like this.

@technovangelist commented on GitHub (Jan 3, 2024): Can you try repulling the models being used. We updated most of them in the last few weeks to address issues like this.

GiteaMirror commented

2026-05-03 10:14:40 -05:00

@madsamjp commented on GitHub (Jan 3, 2024):

@technovangelist I've updated to the latest version of Ollama (0.1.17), and pulled the latest deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 63
PARAMETER num_ctx 2048

I can load the model into VRAM just fine. It uses 23697MiB:

The logs:

Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size =    0.21 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required  =  151.81 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: ....................................................................................................
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx      = 2048
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base  = 100000.0
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25
Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size  =  496.00 MiB, K (f16):  248.00 MiB, V (f16):  248.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

However, when I give it a prompt, it quickly dies:

Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"}
Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds
Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 |  4.138764899s |       127.0.0.1 | POST     "/api/generate"
Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images
Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0
Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0
Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully

The only way I can continue to use this model is to build from source with the -DLLAMA_CUDA_FORCE_MMQ=on flag.

@madsamjp commented on GitHub (Jan 3, 2024): @technovangelist I've updated to the latest version of Ollama (0.1.17), and pulled the latest `deepseek-coder:33b-instruct-q5_K_S` model. Here is my modelfile: ``` FROM deepseek-coder:33b-instruct-q5_K_S PARAMETER num_gpu 63 PARAMETER num_ctx 2048 ``` I can load the model into VRAM just fine. It uses `23697MiB`: ![image](https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31) The logs: ``` Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB Jan 03 18:06:36 osm-server ollama[2633395]: .................................................................................................... Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25 Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) ``` However, when I give it a prompt, it quickly dies: ``` Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"} Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}} Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate" Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}} Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0 Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0 Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully ``` The only way I can continue to use this model is to build from source with the ` -DLLAMA_CUDA_FORCE_MMQ=on` flag.

GiteaMirror commented

2026-05-03 10:14:41 -05:00

@phalexo commented on GitHub (Jan 3, 2024):

Did you ever test for performance with MMQ flag versus the 0.1.11?

On Wed, Jan 3, 2024, 1:11 PM madsamjp @.***> wrote:

@technovangelist https://github.com/technovangelist I've updated to the
latest version of Ollama (0.1.17), and pulled the latest
deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 63
PARAMETER num_ctx 2048

I can load the model into VRAM just fine. It uses 23697MiB:
image.png (view on web)
https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31

The logs:

Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: ....................................................................................................
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25
Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

However, when I give it a prompt, it quickly dies:

Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"}
Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds
Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate"
Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images
Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0
Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0
Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully

The only way I can continue to use this model is to build from source with
the -DLLAMA_CUDA_FORCE_MMQ=on flag.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875770987,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLMLOH4BDQU3Y4Q2ITYMWNN5AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVG43TAOJYG4
.
You are receiving this because you were mentioned.Message ID:
@.***>

@phalexo commented on GitHub (Jan 3, 2024): Did you ever test for performance with MMQ flag versus the 0.1.11? On Wed, Jan 3, 2024, 1:11 PM madsamjp ***@***.***> wrote: > @technovangelist <https://github.com/technovangelist> I've updated to the > latest version of Ollama (0.1.17), and pulled the latest > deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile: > > FROM deepseek-coder:33b-instruct-q5_K_S > > PARAMETER num_gpu 63 > PARAMETER num_ctx 2048 > > I can load the model into VRAM just fine. It uses 23697MiB: > image.png (view on web) > <https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31> > > The logs: > > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU > Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: .................................................................................................... > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306 > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB > Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) > > However, when I give it a prompt, it quickly dies: > > Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"} > Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}} > Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds > Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate" > Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images > Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}} > Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory > Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0 > Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" > Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory > Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0 > Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" > Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully > > The only way I can continue to use this model is to build from source with > the -DLLAMA_CUDA_FORCE_MMQ=on flag. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875770987>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLMLOH4BDQU3Y4Q2ITYMWNN5AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVG43TAOJYG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

GiteaMirror commented

2026-05-03 10:14:42 -05:00

@madsamjp commented on GitHub (Jan 3, 2024):

@phalexo I haven't tested. Why, are you noticing degraded performance?

@madsamjp commented on GitHub (Jan 3, 2024): @phalexo I haven't tested. Why, are you noticing degraded performance?

GiteaMirror commented

2026-05-03 10:14:43 -05:00

@phalexo commented on GitHub (Jan 3, 2024):

All my testing is ad hoc, difficult to assess. I thought you run a largish
system so it may be noticeable.

I have a suspicion that there may be a performance hit. If my understanding
is correct, the flag shift away from cuBlas to different kernels. If cuBlas
is optimized better, there may be a difference.

On Wed, Jan 3, 2024, 1:39 PM madsamjp @.***> wrote:

@phalexo https://github.com/phalexo I haven't tested. Why, are you
noticing degraded performance?

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875803099,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZN6VRZSZDFKPZGYJW3YMWQW7AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVHAYDGMBZHE
.
You are receiving this because you were mentioned.Message ID:
@.***>

@phalexo commented on GitHub (Jan 3, 2024): All my testing is ad hoc, difficult to assess. I thought you run a largish system so it may be noticeable. I have a suspicion that there may be a performance hit. If my understanding is correct, the flag shift away from cuBlas to different kernels. If cuBlas is optimized better, there may be a difference. On Wed, Jan 3, 2024, 1:39 PM madsamjp ***@***.***> wrote: > @phalexo <https://github.com/phalexo> I haven't tested. Why, are you > noticing degraded performance? > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1875803099>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZN6VRZSZDFKPZGYJW3YMWQW7AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVHAYDGMBZHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

GiteaMirror commented

2026-05-03 10:14:45 -05:00

@madsamjp commented on GitHub (Jan 4, 2024):

@phalexo I'm just running a meager 4090! VRAM is a massive issue. I've found I can squeeze deepseek-coder 33b Q5_K_S into my VRAM if I reduce the context window down to 2048, but it's right on the edge, using about 23.7GB. But the model is really good AND fast at answering coding questions - which I find I'm using more and more these days for my both my professional and personal work. If I have some time over the weekend I'll revert back to 0.1.11 and test it and report back here.

@madsamjp commented on GitHub (Jan 4, 2024): @phalexo I'm just running a meager 4090! VRAM is a massive issue. I've found I can squeeze deepseek-coder 33b Q5_K_S into my VRAM if I reduce the context window down to 2048, but it's right on the edge, using about 23.7GB. But the model is really good AND fast at answering coding questions - which I find I'm using more and more these days for my both my professional and personal work. If I have some time over the weekend I'll revert back to 0.1.11 and test it and report back here.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#62762