[GH-ISSUE #1374] Out of memory error on model that previously worked fine after update to version 0.1.13 #47237

New Issue

@phalexo commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

I know you are asking the original poster, but I have 330GiB on the host and 12.2GiB per GPU (4 devices) and I am seeing something similar, even with just freshly rebuilt ollama. Loading Mistral uses up less than 50% of VRAM.

It runs ok on the host though.

@phalexo commented on GitHub (Dec 4, 2023): > How much ram does your machine have? You mentioned vram. I know you are asking the original poster, but I have 330GiB on the host and 12.2GiB per GPU (4 devices) and I am seeing something similar, even with just freshly rebuilt ollama. Loading Mistral uses up less than 50% of VRAM. It runs ok on the host though.

GiteaMirror commented

@madsamjp commented on GitHub (Dec 4, 2023):

How much ram does your machine have? You mentioned vram.

Ample - 96GB. It's working now since I've reverted back to version 0.1.11, which indicates to me something has changed with Ollama since the latest update that has changed how it allocates vram.

@madsamjp commented on GitHub (Dec 4, 2023): > How much ram does your machine have? You mentioned vram. Ample - 96GB. It's working now since I've reverted back to version 0.1.11, which indicates to me something has changed with Ollama since the latest update that has changed how it allocates vram.

GiteaMirror commented

@madsamjp commented on GitHub (Dec 4, 2023):

Is there a download for the older version somewhere? I'd like to try it.

This is the version I've rolled back to - you can download from here: https://github.com/jmorganca/ollama/releases/tag/v0.1.11

@madsamjp commented on GitHub (Dec 4, 2023): > Is there a download for the older version somewhere? I'd like to try it. This is the version I've rolled back to - you can download from here: https://github.com/jmorganca/ollama/releases/tag/v0.1.11

GiteaMirror commented

2026-04-28 03:27:06 -05:00

@phalexo commented on GitHub (Dec 4, 2023):

Fantastic. Before dropping to 0.1.11 it was printing junk, and dying on the second inquiry. Now it seems to work. Quickly too.

@phalexo commented on GitHub (Dec 4, 2023): Fantastic. Before dropping to 0.1.11 it was printing junk, and dying on the second inquiry. Now it seems to work. Quickly too.

GiteaMirror commented

@technovangelist commented on GitHub (Dec 5, 2023):

What OS are you running? How did you install it?

@technovangelist commented on GitHub (Dec 5, 2023): What OS are you running? How did you install it?

GiteaMirror commented

2026-04-28 03:27:06 -05:00

@madsamjp commented on GitHub (Dec 6, 2023):

What OS are you running? How did you install it?

Ubuntu 20.04
Using the install script:

curl https://ollama.ai/install.sh | sh

@madsamjp commented on GitHub (Dec 6, 2023): > What OS are you running? How did you install it? Ubuntu 20.04 Using the install script: ``` curl https://ollama.ai/install.sh | sh ```

GiteaMirror commented

2026-04-28 03:27:06 -05:00

@technovangelist commented on GitHub (Dec 9, 2023):

Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don't think that will be in there. Will let you know what we find. This is a bit strange.

@technovangelist commented on GitHub (Dec 9, 2023): Thanks for sharing this. We are looking into it. There is a release coming soon which is 0.1.14, but I don't think that will be in there. Will let you know what we find. This is a bit strange.

GiteaMirror commented

@igorschlum commented on GitHub (Dec 10, 2023):

@madsamjp did you try with 0.1.14 that is out now?

@igorschlum commented on GitHub (Dec 10, 2023): @madsamjp did you try with 0.1.14 that is out now?

GiteaMirror commented

@phalexo commented on GitHub (Dec 12, 2023):

@madsamjp did you try with 0.1.14 that is out now?

I have tried it with 0.1.14 as modified for Mixtral, the error is back from the dead.

So, if I use version 0.1.11 I don't get the out of memory error, but I get another error specific to Mixtral. And if I use the modified version 0.1.14 then the MIxtral error is gone, but I am back to the same cuBLAS fake OOM error.

@phalexo commented on GitHub (Dec 12, 2023): > @madsamjp did you try with 0.1.14 that is out now? I have tried it with 0.1.14 as modified for Mixtral, the error is back from the dead. So, if I use version 0.1.11 I don't get the out of memory error, but I get another error specific to Mixtral. And if I use the modified version 0.1.14 then the MIxtral error is gone, but I am back to the same cuBLAS fake OOM error.

GiteaMirror commented

@phalexo commented on GitHub (Dec 12, 2023):

@technovangelist Has anyone discovered anything new on this? Perhaps in other threads?

I have tried to use the Mixtral branch, derived from 0.1.14 (I assume), the error is still there. The Mixtral Q6_K loads ok, but fails after I enter some text and start generation. It does the same with a model that is only 5GiB as well.

@phalexo commented on GitHub (Dec 12, 2023): @technovangelist Has anyone discovered anything new on this? Perhaps in other threads? I have tried to use the Mixtral branch, derived from 0.1.14 (I assume), the error is still there. The Mixtral Q6_K loads ok, but fails after I enter some text and start generation. It does the same with a model that is only 5GiB as well.

GiteaMirror commented

2026-04-28 03:27:08 -05:00

@madsamjp commented on GitHub (Dec 15, 2023):

@igorschlum @technovangelist I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only ~~v0.1.13~~ (edit: v 0.1.11) works for this model.

Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: n_ctx      = 2048
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_base  = 100000.0
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_scale = 0.25
Dec 15 20:36:50 osm-server ollama[4101606]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: KV self size  =  496.00 MiB, K (f16):  248.00 MiB, V (f16):  248.00 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_build_graph: non-view tensors processed: 1306/1306
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: compute buffer total size = 273.32 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

---

Dec 15 20:39:32 osm-server ollama[4129881]: {"timestamp":1702672772,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"127.0.0.1","remote_port":41532,"status":200,"method":"HEAD","path":"/","params":{}}
Dec 15 20:39:32 osm-server ollama[4101606]: 2023/12/15 20:39:32 llama.go:577: loaded 0 images
Dec 15 20:39:32 osm-server ollama[4101606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory
Dec 15 20:39:32 osm-server ollama[4101606]: current device: 0
Dec 15 20:39:32 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error"
Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory
Dec 15 20:39:33 osm-server ollama[4101606]: current device: 0
Dec 15 20:39:33 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error"
Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:525: llama runner stopped successfully

@madsamjp commented on GitHub (Dec 15, 2023): @igorschlum @technovangelist I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only ~v0.1.13~ (edit: v 0.1.11) works for this model. ``` Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: n_ctx = 2048 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_base = 100000.0 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: freq_scale = 0.25 Dec 15 20:36:50 osm-server ollama[4101606]: llama_kv_cache_init: VRAM kv self = 496.00 MB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_build_graph: non-view tensors processed: 1306/1306 Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: compute buffer total size = 273.32 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Dec 15 20:36:50 osm-server ollama[4101606]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) --- Dec 15 20:39:32 osm-server ollama[4129881]: {"timestamp":1702672772,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"127.0.0.1","remote_port":41532,"status":200,"method":"HEAD","path":"/","params":{}} Dec 15 20:39:32 osm-server ollama[4101606]: 2023/12/15 20:39:32 llama.go:577: loaded 0 images Dec 15 20:39:32 osm-server ollama[4101606]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory Dec 15 20:39:32 osm-server ollama[4101606]: current device: 0 Dec 15 20:39:32 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error" Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: out of memory Dec 15 20:39:33 osm-server ollama[4101606]: current device: 0 Dec 15 20:39:33 osm-server ollama[4101606]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6589: !"CUDA error" Dec 15 20:39:33 osm-server ollama[4101606]: 2023/12/15 20:39:33 llama.go:525: llama runner stopped successfully ```

GiteaMirror commented

@phalexo commented on GitHub (Dec 15, 2023):

Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied
gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work.
Between 11 and 12 something was broken. I was also able to get the
same/very similar error with llama.cpp directly.

On Fri, Dec 15, 2023 at 3:38 PM madsamjp @.***> wrote:

@igorschlum https://github.com/igorschlum I've now tried with version
0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I
give it a prompt. Seems only v 0.1.13 works for this model.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY
.
You are receiving this because you commented.Message ID:
@.***>

@phalexo commented on GitHub (Dec 15, 2023): Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly. On Fri, Dec 15, 2023 at 3:38 PM madsamjp ***@***.***> wrote: > @igorschlum <https://github.com/igorschlum> I've now tried with version > 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I > give it a prompt. Seems only v 0.1.13 works for this model. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-28 03:27:08 -05:00

@igorschlum commented on GitHub (Dec 15, 2023):

@phalexo Ollama 0.1.15 is released. It's worth a try.

@igorschlum commented on GitHub (Dec 15, 2023): @phalexo Ollama 0.1.15 is released. It's worth a try.

GiteaMirror commented

2026-04-28 03:27:08 -05:00

@phalexo commented on GitHub (Dec 15, 2023):

@madsamjp, tried it unsuccessfully with the next version up, v0.1.16,
v0.1.15 cannot possibly work.

On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger @.***>
wrote:

@phalexo https://github.com/phalexo Ollama 0.1.15 is released. It's
worth a try.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858499599,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZLBVFFCM3X44PAF2KDYJS445AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4TSNJZHE
.
You are receiving this because you were mentioned.Message ID:
@.***>

@phalexo commented on GitHub (Dec 15, 2023): @madsamjp, tried it unsuccessfully with the next version up, v0.1.16, v0.1.15 cannot possibly work. On Fri, Dec 15, 2023 at 4:17 PM Igor Schlumberger ***@***.***> wrote: > @phalexo <https://github.com/phalexo> Ollama 0.1.15 is released. It's > worth a try. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858499599>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZLBVFFCM3X44PAF2KDYJS445AVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ4TSNJZHE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

GiteaMirror commented

@madsamjp commented on GitHub (Dec 16, 2023):

Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly.
…
On Fri, Dec 15, 2023 at 3:38 PM madsamjp @.> wrote: @igorschlum https://github.com/igorschlum I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v 0.1.13 works for this model. — Reply to this email directly, view it on GitHub <#1374 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY . You are receiving this because you commented.Message ID: @.>

Apologies - I meant to say 0.1.11. I've tried 0.1.16 and I'm getting the OOM error. I've had to revert back to 0.1.11 .

Neither 0.1.13 or 0.1.16 worked for me. Only 0.1.11.

@madsamjp commented on GitHub (Dec 16, 2023): > Did you to say 0.1.13 works? It is the v0.1.11 that works for me. I copied gguf folder from v0.1.11 to v0.1.12, recompiled and it made v0.1.12 work. Between 11 and 12 something was broken. I was also able to get the same/very similar error with llama.cpp directly. > […](#) > On Fri, Dec 15, 2023 at 3:38 PM madsamjp ***@***.***> wrote: @igorschlum <https://github.com/igorschlum> I've now tried with version 0.1.16. The model loads into VRAM fine - takes 23688MiB, but dies when I give it a prompt. Seems only v 0.1.13 works for this model. — Reply to this email directly, view it on GitHub <[#1374 (comment)](https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858458626)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDD3ZIRXBHZFYCWEMXRS73YJSYODAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYGQ2TQNRSGY> . You are receiving this because you commented.Message ID: ***@***.***> Apologies - I meant to say 0.1.11. I've tried 0.1.16 and I'm getting the OOM error. I've had to revert back to _0.1.11_ . Neither 0.1.13 or 0.1.16 worked for me. Only 0.1.11.

GiteaMirror commented

@phalexo commented on GitHub (Dec 16, 2023):

git clone --recursive https://github.com/jmorganca/ollama.git
cd ollama/llm/llama.cpp
vi generate_linux.go

//go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build ggml/build/cuda --target server --config Release
//go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build gguf/build/cuda --target server --config Release
//go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner

cd ../..
go generate ./...
go build .

@phalexo commented on GitHub (Dec 16, 2023): ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build ggml/build/cuda --target server --config Release //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner //go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build gguf/build/cuda --target server --config Release //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner ``` ```bash cd ../.. go generate ./... go build . ```

GiteaMirror commented

@madsamjp commented on GitHub (Dec 16, 2023):

@phalexo this works! It seems that adding the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me.

@madsamjp commented on GitHub (Dec 16, 2023): @phalexo this works! It seems that adding the ` -DLLAMA_CUDA_FORCE_MMQ=on` flag solves the issue for me.

GiteaMirror commented

@igorschlum commented on GitHub (Dec 16, 2023):

Hi @madsamjp

Great news. -DLLAMA_CUDA_FORCE_MMQ=on forces the usage of MMQ with the GPU even if GPU driver is not said to be compatible with CUDA; this parameter cannot be included in the Ollama Core and should stay as an optional parameter.

I think that you can close the Issue now. You could also try to update your GPU driver software and see if it's compatible with 0.1.15 version of Ollama.

@igorschlum commented on GitHub (Dec 16, 2023): Hi @madsamjp Great news. -DLLAMA_CUDA_FORCE_MMQ=on forces the usage of MMQ with the GPU even if GPU driver is not said to be compatible with CUDA; this parameter cannot be included in the Ollama Core and should stay as an optional parameter. I think that you can close the Issue now. You could also try to update your GPU driver software and see if it's compatible with 0.1.15 version of Ollama.

GiteaMirror commented

@phalexo commented on GitHub (Dec 16, 2023):

How is the performance though? Is it impacted by the change?

On Sat, Dec 16, 2023, 6:11 AM madsamjp @.***> wrote:

@phalexo https://github.com/phalexo this works! It seems that adding
the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me.

—
Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858791883,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZICIVIAAYYK4RPCGMLYJV6XTAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG44TCOBYGM
.
You are receiving this because you were mentioned.Message ID:
@.***>

@phalexo commented on GitHub (Dec 16, 2023): How is the performance though? Is it impacted by the change? On Sat, Dec 16, 2023, 6:11 AM madsamjp ***@***.***> wrote: > @phalexo <https://github.com/phalexo> this works! It seems that adding > the -DLLAMA_CUDA_FORCE_MMQ=on flag solves the issue for me. > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1374#issuecomment-1858791883>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZICIVIAAYYK4RPCGMLYJV6XTAVCNFSM6AAAAABAGI3WDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG44TCOBYGM> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

GiteaMirror commented

@technovangelist commented on GitHub (Jan 3, 2024):

Can you try repulling the models being used. We updated most of them in the last few weeks to address issues like this.

@technovangelist commented on GitHub (Jan 3, 2024): Can you try repulling the models being used. We updated most of them in the last few weeks to address issues like this.

GiteaMirror commented

@madsamjp commented on GitHub (Jan 3, 2024):

@technovangelist I've updated to the latest version of Ollama (0.1.17), and pulled the latest deepseek-coder:33b-instruct-q5_K_S model. Here is my modelfile:

FROM deepseek-coder:33b-instruct-q5_K_S

PARAMETER num_gpu 63
PARAMETER num_ctx 2048

I can load the model into VRAM just fine. It uses 23697MiB:

The logs:

Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size =    0.21 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required  =  151.81 MiB
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU
Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: ....................................................................................................
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx      = 2048
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base  = 100000.0
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25
Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size  =  496.00 MiB, K (f16):  248.00 MiB, V (f16):  248.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB
Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB)

However, when I give it a prompt, it quickly dies:

Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"}
Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds
Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 |  4.138764899s |       127.0.0.1 | POST     "/api/generate"
Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images
Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}}
Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0
Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory
Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0
Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error"
Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully

The only way I can continue to use this model is to build from source with the -DLLAMA_CUDA_FORCE_MMQ=on flag.

@madsamjp commented on GitHub (Jan 3, 2024): @technovangelist I've updated to the latest version of Ollama (0.1.17), and pulled the latest `deepseek-coder:33b-instruct-q5_K_S` model. Here is my modelfile: ``` FROM deepseek-coder:33b-instruct-q5_K_S PARAMETER num_gpu 63 PARAMETER num_ctx 2048 ``` I can load the model into VRAM just fine. It uses `23697MiB`: ![image](https://github.com/jmorganca/ollama/assets/49611363/bb0ab4af-6b8a-47af-b889-50293b5c1c31) The logs: ``` Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: ggml ctx size = 0.21 MiB Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: using CUDA for GPU acceleration Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: mem required = 151.81 MiB Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading 62 repeating layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloading non-repeating layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: offloaded 63/63 layers to GPU Jan 03 18:06:33 osm-server ollama[2633395]: llm_load_tensors: VRAM used: 21741.89 MiB Jan 03 18:06:36 osm-server ollama[2633395]: .................................................................................................... Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: n_ctx = 2048 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_base = 100000.0 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: freq_scale = 0.25 Jan 03 18:06:36 osm-server ollama[2633395]: llama_kv_cache_init: VRAM kv self = 496.00 MB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: KV self size = 496.00 MiB, K (f16): 248.00 MiB, V (f16): 248.00 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_build_graph: non-view tensors processed: 1306/1306 Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: compute buffer total size = 273.19 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: VRAM scratch buffer: 270.00 MiB Jan 03 18:06:36 osm-server ollama[2633395]: llama_new_context_with_model: total VRAM used: 22507.89 MiB (model: 21741.89 MiB, context: 766.00 MiB) ``` However, when I give it a prompt, it quickly dies: ``` Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"58560","hostname":"127.0.0.1"} Jan 03 18:06:36 osm-server ollama[2635863]: {"timestamp":1704305196,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":50516,"status":200,"method":"HEAD","path":"/","params":{}} Jan 03 18:06:36 osm-server ollama[2633395]: 2024/01/03 18:06:36 llama.go:508: llama runner started in 4.000542 seconds Jan 03 18:06:36 osm-server ollama[2633395]: [GIN] 2024/01/03 - 18:06:36 | 200 | 4.138764899s | 127.0.0.1 | POST "/api/generate" Jan 03 18:07:59 osm-server ollama[2633395]: 2024/01/03 18:07:59 llama.go:577: loaded 0 images Jan 03 18:07:59 osm-server ollama[2635863]: {"timestamp":1704305279,"level":"INFO","function":"log_server_request","line":2608,"message":"request","remote_addr":"127.0.0.1","remote_port":54074,"status":200,"method":"HEAD","path":"/","params":{}} Jan 03 18:07:59 osm-server ollama[2633395]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory Jan 03 18:07:59 osm-server ollama[2633395]: current device: 0 Jan 03 18:07:59 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: out of memory Jan 03 18:08:00 osm-server ollama[2633395]: current device: 0 Jan 03 18:08:00 osm-server ollama[2633395]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:6600: !"CUDA error" Jan 03 18:08:00 osm-server ollama[2633395]: 2024/01/03 18:08:00 llama.go:525: llama runner stopped successfully ``` The only way I can continue to use this model is to build from source with the ` -DLLAMA_CUDA_FORCE_MMQ=on` flag.

GiteaMirror commented