[GH-ISSUE #9957] [ ROCm error: out of memory ] Runner Terminated: num_ctx within model / hardware limits reliably crashes #68575

Closed
opened 2026-05-04 14:28:31 -05:00 by GiteaMirror · 41 comments
Owner

Originally created by @digitalextremist on GitHub (Mar 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9957

tl;dr -- Rather than work within the VRAM+RAM, num_ctx within hardware and model limits causes crash at an unknown point with AMD.


This is using the :rocm tag of the ollama image on docker

Host system has 64GB RAM and GPU has 16GB VRAM

ollama ps with a test case looks like this:

Image

That is using sikamikanikobg/OlympicCoder-7B which is FP16

The model is based on qwen2 and being run with 32768 for num_ctx

I can cause this with many other models, including gemma3 but I am focusing on the last instance of the runner crash.

Have included gist well before the crash and likely shared more than necessary in that since this is probably a known issue.

Was not able to discern which known issue though, when searching through open issues.

Does not seem to truly split across both VRAM and system RAM as it pertains to num_ctx

This is using the F/OSS driver since the proprietary driver seems to be meh, am avoiding.

Relevant log output

Stupid-long version of log: https://gist.github.com/digitalextremist/b59e033c67d28a13f6ab7689131e98e4

Point where it initially hits the fan:

zero_llm_ollama  | ROCm error: out of memory
zero_llm_ollama  |   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366
zero_llm_ollama  |   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
zero_llm_ollama  | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error
zero_llm_ollama  | Memory critical error by agent node-0 (Agent handle: 0x648878515ea0) on address 0x7b8dd8300000. Reason: Memory in use. 
zero_llm_ollama  | SIGABRT: abort
zero_llm_ollama  | PC=0x7b8e42fba00b m=11 sigcode=18446744073709551610
zero_llm_ollama  | signal arrived during cgo execution

OS

Docker

GPU

AMD

CPU

Intel

Ollama version

0.6.2

Originally created by @digitalextremist on GitHub (Mar 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9957 tl;dr -- Rather than work within the VRAM+RAM, num_ctx within hardware and model limits causes crash at an unknown point with AMD. --- This is using the `:rocm` tag of the `ollama` image on `docker` Host system has `64GB` RAM and GPU has `16GB` VRAM `ollama ps` with a test case looks like this: ![Image](https://github.com/user-attachments/assets/09e08725-6578-42c2-8aca-ce3e0424d09f) That is using [`sikamikanikobg/OlympicCoder-7B`](https://ollama.com/sikamikanikobg/OlympicCoder-7B) which is FP16 The model is based on `qwen2` and being run with `32768` for `num_ctx` I can cause this with many other models, including `gemma3` but I am focusing on the last instance of the runner crash. Have included gist well before the crash and likely shared more than necessary in that since this is probably a known issue. Was not able to discern which known issue though, when searching through open issues. Does not seem to truly split across both VRAM and system RAM as it pertains to `num_ctx` This is using the F/OSS driver since the proprietary driver seems to be meh, am avoiding. ### Relevant log output Stupid-long version of log: https://gist.github.com/digitalextremist/b59e033c67d28a13f6ab7689131e98e4 Point where it initially hits the fan: ```shell zero_llm_ollama | ROCm error: out of memory zero_llm_ollama | current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366 zero_llm_ollama | ggml_cuda_device_malloc(&ptr, look_ahead_size, device) zero_llm_ollama | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error zero_llm_ollama | Memory critical error by agent node-0 (Agent handle: 0x648878515ea0) on address 0x7b8dd8300000. Reason: Memory in use. zero_llm_ollama | SIGABRT: abort zero_llm_ollama | PC=0x7b8e42fba00b m=11 sigcode=18446744073709551610 zero_llm_ollama | signal arrived during cgo execution ``` ### OS Docker ### GPU AMD ### CPU Intel ### Ollama version 0.6.2
GiteaMirror added the bug label 2026-05-04 14:28:31 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 23, 2025):

zero_llm_ollama  | time=2025-03-23T21:52:07.578Z level=INFO source=server.go:138 msg=offload library=rocm
 layers.requested=-1 layers.model=29 layers.offload=25 layers.split="" memory.available="[15.8 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="18.1 GiB" memory.required.partial="15.6 GiB"
 memory.required.kv="1.8 GiB" memory.required.allocations="[15.6 GiB]" memory.weights.total="12.2 GiB"
 memory.weights.repeating="12.2 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.8 GiB"
 memory.graph.partial="2.3 GiB"

ollama estimated it needed 15.6G to offload 25 of 29 layers, with 15.8G available on the device. It looks like it OOM'ed on the first completion call, so the runner tried to allocate something > 200M. ollama estimations are sometimes off, you can see here for some mitigations. With the change in the ollama runner architecture in the 0.6 series there's likely some work to be done on the estimation logic.

<!-- gh-comment-id:2746512857 --> @rick-github commented on GitHub (Mar 23, 2025): ``` zero_llm_ollama | time=2025-03-23T21:52:07.578Z level=INFO source=server.go:138 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=25 layers.split="" memory.available="[15.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.1 GiB" memory.required.partial="15.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[15.6 GiB]" memory.weights.total="12.2 GiB" memory.weights.repeating="12.2 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.8 GiB" memory.graph.partial="2.3 GiB" ``` ollama estimated it needed 15.6G to offload 25 of 29 layers, with 15.8G available on the device. It looks like it OOM'ed on the first completion call, so the runner tried to allocate something > 200M. ollama estimations are sometimes off, you can see [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for some mitigations. With the change in the ollama runner architecture in the 0.6 series there's likely some work to be done on the estimation logic.
Author
Owner

@digitalextremist commented on GitHub (Mar 23, 2025):

Thanks a lot @rick-github for the quick diagnosis with mitigation tips; will try those.

<!-- gh-comment-id:2746518027 --> @digitalextremist commented on GitHub (Mar 23, 2025): Thanks a lot @rick-github for the quick diagnosis with mitigation tips; will try those.
Author
Owner

@digitalextremist commented on GitHub (Mar 26, 2025):

@jessegross would this also be fixed by the solution in f66216e399 resolving #9890?

<!-- gh-comment-id:2755677359 --> @digitalextremist commented on GitHub (Mar 26, 2025): @jessegross would this also be fixed by the solution in f66216e3990b73869341c58ac9561b26c468c558 resolving #9890?
Author
Owner

@jessegross commented on GitHub (Mar 26, 2025):

@jessegross would this also be fixed by the solution in f66216e resolving #9890?

Probably not, that commit doesn't have any effect on qwen2.

<!-- gh-comment-id:2755704495 --> @jessegross commented on GitHub (Mar 26, 2025): > [@jessegross](https://github.com/jessegross) would this also be fixed by the solution in [f66216e](https://github.com/ollama/ollama/commit/f66216e3990b73869341c58ac9561b26c468c558) resolving [#9890](https://github.com/ollama/ollama/issues/9890)? Probably not, that commit doesn't have any effect on qwen2.
Author
Owner

@digitalextremist commented on GitHub (Mar 26, 2025):

Thanks for the quick follow-up @jessegross!

<!-- gh-comment-id:2755711726 --> @digitalextremist commented on GitHub (Mar 26, 2025): Thanks for the quick follow-up @jessegross!
Author
Owner

@rtaic-coder commented on GitHub (Apr 9, 2025):

I have similar crash with two AMD GPU with 44GB and CPU memory 64GB using Gemma 3 and Deepseek-r1-32b quantized in ollama:0.6.5-rocm running in docker. ollama ps showing the model loaded 100% in GPU using 23GB only. But when completion request with about 81k characters was sent.

ollama_amd | time=2025-04-09T20:30:25.470Z level=DEBUG source=process_text_spm.go:184 msg="adding bos token to prompt" id=2
ollama_amd | time=2025-04-09T20:30:25.471Z level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=19357 used=0 remaining=19357
ollama_amd | ROCm error: out of memory
ollama_amd | current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366
ollama_amd | ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
ollama_amd | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error
ollama_amd | Memory critical error by agent node-0 (Agent handle: 0x7e3b74665c70) on address 0x7e3b94300000. Reason: Memory in use.
ollama_amd | SIGABRT: abort
ollama_amd | PC=0x7e3c4d55c00b m=27 sigcode=18446744073709551610
ollama_amd | signal arrived during cgo execution

<!-- gh-comment-id:2791054636 --> @rtaic-coder commented on GitHub (Apr 9, 2025): I have similar crash with two AMD GPU with 44GB and CPU memory 64GB using Gemma 3 and Deepseek-r1-32b quantized in `ollama:0.6.5-rocm` running in docker. `ollama ps` showing the model loaded 100% in GPU using 23GB only. But when completion request with about 81k characters was sent. > ollama_amd | time=2025-04-09T20:30:25.470Z level=DEBUG source=process_text_spm.go:184 msg="adding bos token to prompt" id=2 ollama_amd | time=2025-04-09T20:30:25.471Z level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=19357 used=0 remaining=19357 ollama_amd | ROCm error: out of memory ollama_amd | current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366 ollama_amd | ggml_cuda_device_malloc(&ptr, look_ahead_size, device) ollama_amd | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error ollama_amd | Memory critical error by agent node-0 (Agent handle: 0x7e3b74665c70) on address 0x7e3b94300000. Reason: Memory in use. ollama_amd | SIGABRT: abort ollama_amd | PC=0x7e3c4d55c00b m=27 sigcode=18446744073709551610 ollama_amd | signal arrived during cgo execution
Author
Owner

@digitalextremist commented on GitHub (Apr 9, 2025):

Thanks for confirming @rtaic-coder

I still also reproducing this with many different models

It seems memory estimation for model + context + compute is still unreliable

<!-- gh-comment-id:2791097085 --> @digitalextremist commented on GitHub (Apr 9, 2025): Thanks for confirming @rtaic-coder I still also reproducing this with many different models It seems memory estimation for model + context + compute is still unreliable
Author
Owner

@Anaphylaxis commented on GitHub (May 3, 2025):

I experience similar behavior with a 7900xtx and 80GB RAM. The available GPU RAM goes down and down with each different model load, which is all the time as it unloads for the embedding model. It is unstable, and I have to kill Ollama and restart the process to clear the VRAM so I can use 100% GPU again.

<!-- gh-comment-id:2848369498 --> @Anaphylaxis commented on GitHub (May 3, 2025): I experience similar behavior with a 7900xtx and 80GB RAM. The available GPU RAM goes down and down with each different model load, which is all the time as it unloads for the embedding model. It is unstable, and I have to kill Ollama and restart the process to clear the VRAM so I can use 100% GPU again.
Author
Owner

@digitalextremist commented on GitHub (May 3, 2025):

Thanks for confirming that also @Anaphylaxis

@ParthSareen dropped by discord today and said he nudged this issue along among the maintainers, so hopefully those of us watching this issue will see a fix come up soon

The additional aspect you are adding I will be watching for also; you make it sound more like a leak than memory estimation being inconsistent only. Will look at that aspect too

<!-- gh-comment-id:2848413713 --> @digitalextremist commented on GitHub (May 3, 2025): Thanks for confirming that also @Anaphylaxis @ParthSareen dropped by `discord` today and said he nudged this issue along among the maintainers, so hopefully those of us watching this issue will see a fix come up soon The additional aspect you are adding I will be watching for also; you make it sound more like a leak than memory estimation being inconsistent only. Will look at that aspect too
Author
Owner

@Anaphylaxis commented on GitHub (May 4, 2025):

For example,3tok/s vs 25tok/s if I don't kill the process every so often

> PS C:\Users\user> ollama ps          
> NAME    ID    SIZE    PROCESSOR    UNTIL 
> PS C:\Users\user> ollama run qwen3:32b    
> >>> /set verbose
> Set 'verbose' mode.
> >>> Hello, how are you?
> <think>
> Let me analyze this simple greeting. Hello is a standard opening that 
> shows politeness and respect. It's a good opportunity to demonstrate 
> warmth and approachability in our conversation. I should respond in a
> friendly and engaging way that invites further discussion.
> 
> I need to consider the user's perspective - they might be testing the
> system, or they might want to have a genuine conversation. Either way, I        
> should remain consistent in my response style, showing empathy and
> understanding.
> 
> The greeting "Hello, how are you?" is simple but effective. It's a 
> standard way to start a conversation that shows consideration for the 
> other person. I should acknowledge their greeting in a similar way, while       
> also showing my own personality.
> 
> I can add a bit of personality to my response by including a friendly 
> emoji to show warmth. This makes the conversation feel more natural and
> personable. I should also make sure to ask them back how they're doing,
> creating a natural flow to the conversation.
> 
> I should be mindful of keeping my response concise but not too brief. The       
> user might be looking for a conversation partner, so I want to be engaging      
> but not overwhelming. I should stay friendly and approachable in my tone.
> 
> The user might be looking for different things in this conversation - 
> maybe just a friendly chat, or perhaps they have questions or need help
> with something. I should be prepared to adapt to whatever direction the
> conversation takes.
> 
> I need to make sure my response is welcoming and open-ended. That way, the      
> user feels comfortable to talk about whatever they want. I should avoid         
> any assumptions about what they're looking for, and instead keep my
> response flexible.
> 
> The user might also be testing how I respond to simple greetings, so I 
> need to make sure my response is appropriate but not overly formal. I want      
> to show that I can have natural, friendly conversations while still being       
> professional and helpful.
> </think>
> 
> Hello! I'm doing well, thank you for asking. I'm always excited to chat         
> and help out. How have you been? I'd love to hear what's on your mind! 😊
> 
> total duration:       2m15.524901s
> load duration:        18.1326ms
> prompt eval count:    14 token(s)
> prompt eval duration: 3.1519391s
> prompt eval rate:     4.44 tokens/s
> eval count:           416 token(s)
> eval duration:        2m12.3537658s
> eval rate:            3.14 tokens/s
> >>> /bye
> PS C:\Users\user> ollama ps           
> NAME         ID              SIZE     PROCESSOR          UNTIL               
> qwen3:32b    e1c9f234c6eb    22 GB    37%/63% CPU/GPU    29 minutes from now    
> PS C:\Users\user> Stop-Process -name "ollama" && ollama` app
> PS C:\Users\user> ollama run qwen3:32b                      
> >>> /set verbose
> Set 'verbose' mode.
> >>> Hello, how are you?
> <think>
> Alright, the user just greeted me and asked how I am. I need to respond in      
> a friendly and welcoming way. Let me make sure I acknowledge their 
> greeting and offer a positive response.
> 
> First, I should thank them for asking. It's a nice touch to show
> appreciation. Then, I can mention that I'm doing well, which keeps the
> conversation upbeat. Including an emoji like a smiley face adds a friendly      
> tone.
> 
> Next, I should invite them to share how they're feeling. This opens the
> door for a more personal conversation and shows I'm interested in their
> well-being. Ending with an offer to help with any questions or topics 
> keeps the interaction helpful and supportive.
> 
> I need to check the flow to ensure it's natural and not too robotic. Also,      
> keeping the response concise but warm is key. Let me put it all together        
> and make sure it sounds genuine.
> </think>
> 
> Hi there! Thank you for asking - I'm doing well, just happy to be here! 😊      
> How are *you* feeling today? I'd love to hear what's on your mind or help       
> with any questions you might have. What would you like to chat about?
> 
> total duration:       9.6188248s
> load duration:        17.9104ms
> prompt eval count:    14 token(s)
> prompt eval duration: 118.3738ms
> prompt eval rate:     118.27 tokens/s
> eval count:           237 token(s)
> eval duration:        9.4708065s
> eval rate:            25.02 tokens/s


<!-- gh-comment-id:2848927893 --> @Anaphylaxis commented on GitHub (May 4, 2025): For example,3tok/s vs 25tok/s if I don't kill the process every so often ``` > PS C:\Users\user> ollama ps > NAME ID SIZE PROCESSOR UNTIL > PS C:\Users\user> ollama run qwen3:32b > >>> /set verbose > Set 'verbose' mode. > >>> Hello, how are you? > <think> > Let me analyze this simple greeting. Hello is a standard opening that > shows politeness and respect. It's a good opportunity to demonstrate > warmth and approachability in our conversation. I should respond in a > friendly and engaging way that invites further discussion. > > I need to consider the user's perspective - they might be testing the > system, or they might want to have a genuine conversation. Either way, I > should remain consistent in my response style, showing empathy and > understanding. > > The greeting "Hello, how are you?" is simple but effective. It's a > standard way to start a conversation that shows consideration for the > other person. I should acknowledge their greeting in a similar way, while > also showing my own personality. > > I can add a bit of personality to my response by including a friendly > emoji to show warmth. This makes the conversation feel more natural and > personable. I should also make sure to ask them back how they're doing, > creating a natural flow to the conversation. > > I should be mindful of keeping my response concise but not too brief. The > user might be looking for a conversation partner, so I want to be engaging > but not overwhelming. I should stay friendly and approachable in my tone. > > The user might be looking for different things in this conversation - > maybe just a friendly chat, or perhaps they have questions or need help > with something. I should be prepared to adapt to whatever direction the > conversation takes. > > I need to make sure my response is welcoming and open-ended. That way, the > user feels comfortable to talk about whatever they want. I should avoid > any assumptions about what they're looking for, and instead keep my > response flexible. > > The user might also be testing how I respond to simple greetings, so I > need to make sure my response is appropriate but not overly formal. I want > to show that I can have natural, friendly conversations while still being > professional and helpful. > </think> > > Hello! I'm doing well, thank you for asking. I'm always excited to chat > and help out. How have you been? I'd love to hear what's on your mind! 😊 > > total duration: 2m15.524901s > load duration: 18.1326ms > prompt eval count: 14 token(s) > prompt eval duration: 3.1519391s > prompt eval rate: 4.44 tokens/s > eval count: 416 token(s) > eval duration: 2m12.3537658s > eval rate: 3.14 tokens/s > >>> /bye > PS C:\Users\user> ollama ps > NAME ID SIZE PROCESSOR UNTIL > qwen3:32b e1c9f234c6eb 22 GB 37%/63% CPU/GPU 29 minutes from now > PS C:\Users\user> Stop-Process -name "ollama" && ollama` app > PS C:\Users\user> ollama run qwen3:32b > >>> /set verbose > Set 'verbose' mode. > >>> Hello, how are you? > <think> > Alright, the user just greeted me and asked how I am. I need to respond in > a friendly and welcoming way. Let me make sure I acknowledge their > greeting and offer a positive response. > > First, I should thank them for asking. It's a nice touch to show > appreciation. Then, I can mention that I'm doing well, which keeps the > conversation upbeat. Including an emoji like a smiley face adds a friendly > tone. > > Next, I should invite them to share how they're feeling. This opens the > door for a more personal conversation and shows I'm interested in their > well-being. Ending with an offer to help with any questions or topics > keeps the interaction helpful and supportive. > > I need to check the flow to ensure it's natural and not too robotic. Also, > keeping the response concise but warm is key. Let me put it all together > and make sure it sounds genuine. > </think> > > Hi there! Thank you for asking - I'm doing well, just happy to be here! 😊 > How are *you* feeling today? I'd love to hear what's on your mind or help > with any questions you might have. What would you like to chat about? > > total duration: 9.6188248s > load duration: 17.9104ms > prompt eval count: 14 token(s) > prompt eval duration: 118.3738ms > prompt eval rate: 118.27 tokens/s > eval count: 237 token(s) > eval duration: 9.4708065s > eval rate: 25.02 tokens/s ```
Author
Owner

@digitalextremist commented on GitHub (May 4, 2025):

Curious to see what happens with 0.6.8 based on the prerelease changelog!

Will check that @Anaphylaxis; did have some issues like that but they were resolved in 0.6.1

<!-- gh-comment-id:2849303596 --> @digitalextremist commented on GitHub (May 4, 2025): Curious to see what happens with `0.6.8` based on the prerelease changelog! Will check that @Anaphylaxis; did have some issues like that but they were resolved in `0.6.1`
Author
Owner

@andrew-aladjev commented on GitHub (Jul 22, 2025):

Same issue is still here:

  • 64 GB RAM
  • 16 GB VRAM (7600 XT)
  • rocm 6.4.2.60402-120~24.04
  • ollama 0.9.6
  • Qwen 3 32B (num_ctx 32k)
  • Actual context size is 21k

log:

Jul 22 05:09:21 puchuu-home-2 ollama[2188]: ROCm error: out of memory
Jul 22 05:09:21 puchuu-home-2 ollama[2188]:   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:372
Jul 22 05:09:21 puchuu-home-2 ollama[2188]:   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Jul 22 05:09:21 puchuu-home-2 ollama[2188]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:76: ROCm error
Jul 22 05:09:21 puchuu-home-2 ollama[2188]: Memory critical error by agent node-0 (Agent handle: 0x645e052700b0) on address 0x7636d0200000. Reason: Memory in use.
Jul 22 05:09:21 puchuu-home-2 ollama[2188]: SIGABRT: abort

@digitalextremist it looks like nothing changed.

<!-- gh-comment-id:3101034352 --> @andrew-aladjev commented on GitHub (Jul 22, 2025): Same issue is still here: - 64 GB RAM - 16 GB VRAM (7600 XT) - rocm 6.4.2.60402-120~24.04 - ollama 0.9.6 - Qwen 3 32B (`num_ctx` 32k) - Actual context size is 21k log: ``` Jul 22 05:09:21 puchuu-home-2 ollama[2188]: ROCm error: out of memory Jul 22 05:09:21 puchuu-home-2 ollama[2188]: current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:372 Jul 22 05:09:21 puchuu-home-2 ollama[2188]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Jul 22 05:09:21 puchuu-home-2 ollama[2188]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:76: ROCm error Jul 22 05:09:21 puchuu-home-2 ollama[2188]: Memory critical error by agent node-0 (Agent handle: 0x645e052700b0) on address 0x7636d0200000. Reason: Memory in use. Jul 22 05:09:21 puchuu-home-2 ollama[2188]: SIGABRT: abort ``` @digitalextremist it looks like nothing changed.
Author
Owner

@rick-github commented on GitHub (Jul 22, 2025):

A full server log would help with debugging, but based on ROCm error: out of memory, see https://github.com/ollama/ollama/issues/9957#issuecomment-2746512857

<!-- gh-comment-id:3101149536 --> @rick-github commented on GitHub (Jul 22, 2025): A full server log would help with debugging, but based on `ROCm error: out of memory`, see https://github.com/ollama/ollama/issues/9957#issuecomment-2746512857
Author
Owner

@PedroHLC commented on GitHub (Aug 15, 2025):

It shouldn't crash when a 16368 MiB GPU runs a 14Gb model, right? New memory estimates seems to be in-place according to logs:

time=2025-08-15T10:44:55.096-03:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped runner --ollama-engine --model /var/lib/o
llama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 40251"
time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:657 msg="loading model" "model layers"=25 requested=-1
time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:663 msg="system memory" total="62.7 GiB" free="41.1 GiB" free_swap="65.0 GiB"
time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:667 msg="gpu memory" id=GPU-11883c28f37e073e available="15.2 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:40251"
time=2025-08-15T10:44:55.108-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[ID
:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-15T10:44:55.146-03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, ID: GPU-11883c28f37e073e
load_backend: loaded ROCm backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so
load_backend: loaded CPU backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-cpu-haswell.so
time=2025-08-15T10:44:57.855-03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_V
MM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-15T10:44:58.012-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[
ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25
[ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="492.0 MiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="2.1 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:342 msg="total memory" size="15.4 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-15T10:44:58.058-03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
time=2025-08-15T10:44:58.059-03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-15T10:45:01.080-03:00 level=INFO source=server.go:1270 msg="llama runner started in 5.98 seconds"
ROCm error: out of memory
  current device: 0, in function alloc at /build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424
  ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
/build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(+0x2fc78) [0x7f3d302ebc78]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_print_backtrace+0x231) [0x7f3d302ebc51]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_abort+0x109) [0x7f3d302ebda9]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x106b52) [0x7f3cf8a97b52]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x116432) [0x7f3cf8aa7432]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10f406) [0x7f3cf8aa0406]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10eb42) [0x7f3cf8a9fb42]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10cd7c) [0x7f3cf8a9dd7c]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c5f16b]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c0c96b]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0xf3d464]
SIGABRT: abort
PC=0x7f3d79a6af3c m=16 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 51 gp=0xc000582e00 m=16 mp=0xc000101808

Used v0.11.5-rc2 and gpt-oss:20b. GPU is a RX 6800 and ROCm 6.3.3.

EDIT: I can do small talk with model 100% on GPU, it breaks with anything more complex.

EDIT2: Ah okay, it works if I limit myself in ~7104 tokens. (But I would love to mix usage with RAM and increase this)

<!-- gh-comment-id:3191576304 --> @PedroHLC commented on GitHub (Aug 15, 2025): It shouldn't crash when a 16368 MiB GPU runs a 14Gb model, right? New memory estimates seems to be in-place according to logs: ```txt time=2025-08-15T10:44:55.096-03:00 level=INFO source=server.go:166 msg="enabling new memory estimates" time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped runner --ollama-engine --model /var/lib/o llama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 40251" time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:657 msg="loading model" "model layers"=25 requested=-1 time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:663 msg="system memory" total="62.7 GiB" free="41.1 GiB" free_swap="65.0 GiB" time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:667 msg="gpu memory" id=GPU-11883c28f37e073e available="15.2 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:40251" time=2025-08-15T10:44:55.108-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[ID :GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-15T10:44:55.146-03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, ID: GPU-11883c28f37e073e load_backend: loaded ROCm backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so load_backend: loaded CPU backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-cpu-haswell.so time=2025-08-15T10:44:57.855-03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_V MM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-15T10:44:58.012-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[ ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-15T10:44:58.058-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25 [ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="492.0 MiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="2.1 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:342 msg="total memory" size="15.4 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-15T10:44:58.058-03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" time=2025-08-15T10:44:58.059-03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" time=2025-08-15T10:45:01.080-03:00 level=INFO source=server.go:1270 msg="llama runner started in 5.98 seconds" ROCm error: out of memory current device: 0, in function alloc at /build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) /build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(+0x2fc78) [0x7f3d302ebc78] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_print_backtrace+0x231) [0x7f3d302ebc51] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_abort+0x109) [0x7f3d302ebda9] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x106b52) [0x7f3cf8a97b52] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x116432) [0x7f3cf8aa7432] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10f406) [0x7f3cf8aa0406] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10eb42) [0x7f3cf8a9fb42] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10cd7c) [0x7f3cf8a9dd7c] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c5f16b] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c0c96b] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0xf3d464] SIGABRT: abort PC=0x7f3d79a6af3c m=16 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 51 gp=0xc000582e00 m=16 mp=0xc000101808 ``` Used v0.11.5-rc2 and `gpt-oss:20b`. GPU is a RX 6800 and ROCm 6.3.3. EDIT: I can do small talk with model 100% on GPU, it breaks with anything more complex. EDIT2: Ah okay, it works if I limit myself in ~7104 tokens. (But I would love to mix usage with RAM and increase this)
Author
Owner

@jessegross commented on GitHub (Aug 15, 2025):

@PedroHLC Can you post the full log?

<!-- gh-comment-id:3192645906 --> @jessegross commented on GitHub (Aug 15, 2025): @PedroHLC Can you post the full log?
Author
Owner

@PedroHLC commented on GitHub (Aug 17, 2025):

@PedroHLC Can you post the full log?

ollama-rocm-oom.log

<!-- gh-comment-id:3194541294 --> @PedroHLC commented on GitHub (Aug 17, 2025): > [@PedroHLC](https://github.com/PedroHLC) Can you post the full log? [ollama-rocm-oom.log](https://github.com/user-attachments/files/21823742/ollama-rocm-oom.log)
Author
Owner

@PedroHLC commented on GitHub (Aug 29, 2025):

@jessegross I'm unable to reproduce this error any more with v0.11.8-rc0.

<!-- gh-comment-id:3237668028 --> @PedroHLC commented on GitHub (Aug 29, 2025): @jessegross I'm unable to reproduce this error any more with `v0.11.8-rc0`.
Author
Owner

@jessegross commented on GitHub (Aug 29, 2025):

@PedroHLC Flash attention was turned on by default for gpt-oss in 0.11.8, which reduces the size of those allocations.

<!-- gh-comment-id:3237802822 --> @jessegross commented on GitHub (Aug 29, 2025): @PedroHLC Flash attention was turned on by default for gpt-oss in 0.11.8, which reduces the size of those allocations.
Author
Owner

@digitalextremist commented on GitHub (Aug 29, 2025):

For the record, I am no longer experiencing this issue since 0.11.6 or so, possibly earlier.

Right now gpt-oss:20b and qwen3-coder:30b have both performed at the maximum possible context, reliably, for 18-hour stretches in code sessions for days on end. I would call this one resolved on my side @jessegross.

That memory estimation system you dropped on us changed the game it feels like!

Will keep open until @PedroHLC verifies he is good to go also.

<!-- gh-comment-id:3237877855 --> @digitalextremist commented on GitHub (Aug 29, 2025): For the record, I am no longer experiencing this issue since `0.11.6` or so, possibly earlier. Right now `gpt-oss:20b` and `qwen3-coder:30b` have both performed at the maximum possible context, reliably, for 18-hour stretches in code sessions for days on end. I would call this one resolved on my side @jessegross. That memory estimation system you dropped on us changed the game it feels like! Will keep open until @PedroHLC verifies he is good to go also.
Author
Owner

@ShaunaGordon commented on GitHub (Sep 3, 2025):

I just updated to 0.11.8 and am still getting this error. ☹️

<!-- gh-comment-id:3250195701 --> @ShaunaGordon commented on GitHub (Sep 3, 2025): I just updated to `0.11.8` and am still getting this error. ☹️
Author
Owner

@digitalextremist commented on GitHub (Sep 4, 2025):

Sorry to hear that! Curious @ShaunaGordon:

  • Which GPU are you using?
  • Are you using a docker image?
  • What model do you see this most with?
  • What is your num_ctx setting on the request?
  • Are you using flash attention and/or kv cache?
  • Are you using the new memory estimation approach?
<!-- gh-comment-id:3251440384 --> @digitalextremist commented on GitHub (Sep 4, 2025): Sorry to hear that! Curious @ShaunaGordon: - Which GPU are you using? - Are you using a `docker` image? - What model do you see this most with? - What is your `num_ctx` setting on the request? - Are you using `flash attention` and/or `kv cache`? - Are you using the new memory estimation approach?
Author
Owner

@ShaunaGordon commented on GitHub (Sep 4, 2025):

It seems to be intermittent now, after I happened to notice my autocomplete (cogito:3b) started working again.

  • Radeon 7900XTX 24gb vram (Ryzen 9 9950X3D, 64gb ram)
  • No docker images, ollama-rocm straight from Arch repos on the host system
  • I was getting it with my cogito:8b, initially, but now my smaller models seem to be okay, so I tried with larger models, command-r and qwen3 30b. Oddly, the larger models clock in at 30gb and 26gb respectively when started from Continue, despite only being 20gb according to their respective pages. Both models trigger the OOM error from ollama, but only when Continue calls them. Nothing else reports using the remaining system resources. System ram never seems to break 35gb usage, and the CPU barely has to try. Continue is obviously doing something to push the models' usage up (something I'll talk to them about as +50% seems excessive to me), but even so, there's no apparent reason for ollama to go OOM. When I try models that split with a similar percentage in the ollama CLI or in the Newelle client (such as Mixtral 8x7b, as both command-r and qwen fit when run from these clients), I don't get OOM errors, though Mixtral only shows 4096 for context length.
  • I've left num_ctx as default for both ollama and Continue (the client I'm working through). ollama ps is reporting 32768 for the context size for all tests from Continue (which is way low for command-r and cogito (128k), and qwen3:30b (256k) and way high for command-r7b (8k), though ollama appears to properly detect the lengths set by the models). I want to say Continue uses that as its default to allow space for most models, since it sets a max somewhere and lets the user set up to that max, but in this case, it's coming in way below the big models, so I wouldn't expect ollama to really care.
  • I'm not setting those and it doesn't look like Continue is, either. So it's whatever ollama defaults to.
  • Is there a particular flag for it? I'm willing to try if I need to explicitly activate it. I just assumed 0.11.8 switched to it by default when it was released.
<!-- gh-comment-id:3254255598 --> @ShaunaGordon commented on GitHub (Sep 4, 2025): It seems to be intermittent now, after I happened to notice my autocomplete (cogito:3b) started working again. - Radeon 7900XTX 24gb vram (Ryzen 9 9950X3D, 64gb ram) - No docker images, ollama-rocm straight from Arch repos on the host system - I was getting it with my cogito:8b, initially, but now my smaller models seem to be okay, so I tried with larger models, [command-r](https://ollama.com/library/command-r) and [qwen3 30b](https://ollama.com/library/qwen3). Oddly, the larger models clock in at 30gb and 26gb respectively when started from Continue, despite only being 20gb according to their respective pages. Both models trigger the OOM error from ollama, but only when Continue calls them. Nothing else reports using the remaining system resources. System ram never seems to break 35gb usage, and the CPU barely has to try. Continue is obviously doing *something* to push the models' usage up (something I'll talk to them about as +50% seems excessive to me), but even so, there's no apparent reason for ollama to go OOM. When I try models that split with a similar percentage in the ollama CLI or in the Newelle client (such as Mixtral 8x7b, as both command-r and qwen fit when run from these clients), I don't get OOM errors, though Mixtral only shows 4096 for context length. - I've left `num_ctx` as default for both ollama and Continue (the client I'm working through). `ollama ps` is reporting 32768 for the context size for all tests from Continue (which is way low for command-r and cogito (128k), and qwen3:30b (256k) and way high for command-r7b (8k), though ollama appears to properly detect the lengths set by the models). I want to say Continue uses that as its default to allow space for most models, since it sets a max somewhere and lets the user set up to that max, but in this case, it's coming in way below the big models, so I wouldn't expect ollama to really care. - I'm not setting those and it doesn't look like Continue is, either. So it's whatever ollama defaults to. - Is there a particular flag for it? I'm willing to try if I need to explicitly activate it. I just assumed 0.11.8 switched to it by default when it was released.
Author
Owner

@digitalextremist commented on GitHub (Sep 4, 2025):

Thanks for the details @ShaunaGordon

Increeased size sounds like context length probably. And those context numbers sound high, even for 24GB. That was when I started to feel the issue before, when the context length would technically fit, but because the estimate of what was needed in reality was off, it would crash.

Are you using the new memory management system #11090? And flash memory, and/or KV Cache?

I am using Zed myself. That has tightly configured num_ctx which is clear to see, so I'm not sure on other IDEs. But the numbers you are seeing in ps seem high. Curious how gpt-oss:20b looks for you since so much emphasis went into optimizing for that, as with gemma3 ... you might also try the qat tags on those models just to see how it goes.

<!-- gh-comment-id:3254290614 --> @digitalextremist commented on GitHub (Sep 4, 2025): Thanks for the details @ShaunaGordon Increeased size sounds like `context length` probably. And those context numbers sound high, even for 24GB. That was when I started to feel the issue before, when the context length would technically fit, but because the estimate of what was needed in reality was off, it would crash. Are you using the new memory management system #11090? And flash memory, and/or KV Cache? I am using `Zed` myself. That has tightly configured `num_ctx` which is clear to see, so I'm not sure on other IDEs. But the numbers you are seeing in `ps` seem high. Curious how `gpt-oss:20b` looks for you since so much emphasis went into optimizing for that, as with `gemma3` ... you might also try the `qat` tags on those models just to see how it goes.
Author
Owner

@jessegross commented on GitHub (Sep 4, 2025):

@ShaunaGordon The new memory management isn't on by default yet - you need to set the environment variable OLLAMA_NEW_ESTIMATES=1. However, it's also only supported in the new engine and most of the models you listed are only available in the old engine. gpt-oss would be new engine.

<!-- gh-comment-id:3254574554 --> @jessegross commented on GitHub (Sep 4, 2025): @ShaunaGordon The new memory management isn't on by default yet - you need to set the environment variable OLLAMA_NEW_ESTIMATES=1. However, it's also only supported in the new engine and most of the models you listed are only available in the old engine. `gpt-oss` would be new engine.
Author
Owner

@ShaunaGordon commented on GitHub (Sep 8, 2025):

I just got an OOM error with qwen2.5-coder:14b for some reason. It's coming in at 18gb even through Continue, so even with the bloating, it fits well within the GPU (and ollama allocates accordingly). There's no evident reason it should be erroring.

gpt-oss is reporting 14gb, so that seems to be working. It also so far isn't triggering OOM errors, so that's good.

<!-- gh-comment-id:3267108164 --> @ShaunaGordon commented on GitHub (Sep 8, 2025): I just got an OOM error with qwen2.5-coder:14b for some reason. It's coming in at 18gb even through Continue, so even with the bloating, it fits well within the GPU (and ollama allocates accordingly). There's no evident reason it should be erroring. `gpt-oss` is reporting 14gb, so that seems to be working. It also so far isn't triggering OOM errors, so that's good.
Author
Owner

@andrew-aladjev commented on GitHub (Sep 8, 2025):

ollama is based on llama.cpp. From my experience it is completely impossible to predict memory allocation in llama.cpp for selected backend. You may predict memory amount for llama.cpp, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm:

#!/bin/bash
set -e

# Setup INT handler.
trap 'exit 130' INT

...

# Prepare HF repository name.
HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL'

# Prepare token counts.
PROMPT_TOKEN_COUNT=<TODO>
MAX_COT_TOKEN_COUNT=$(( 2 ** 12 ))
MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 ))

# Prepare GPU layer counts.
MIN_GPU_LAYER_COUNT=0
MAX_GPU_LAYER_COUNT=49

# Prepare additional options.
ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU')

# Prepare max token count.
MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT ))
echo "Max token count: ${MAX_TOKEN_COUNT}"

# Find GPU layer counts.
GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT
BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT

while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$GPU_LAYER_COUNT" \
    --no-warmup \
    "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \
    RESULT=$?

  if [ $RESULT -eq 0 ]; then
    BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"

    MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 ))
  elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $GPU_LAYER_COUNT"

    MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 ))
  else
    exit $RESULT
  fi

  GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 ))
done

# Process prompt file.
while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --file "$PROMPT_FILE_PATH" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \
    "${ADDITIONAL_OPTIONS[@]}" || \
    RESULT=$?

  if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT"

    BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 ))
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"
  else
    exit $RESULT
  fi
done

Where qwen3-coder.sh is llama.cpp wrapper with some custom system prompt.

It works perfect, but requires 3-10 seconds to find appropriate gpu layers count, from my experience it is fine.

<!-- gh-comment-id:3267160333 --> @andrew-aladjev commented on GitHub (Sep 8, 2025): `ollama` is based on `llama.cpp`. From my experience it is completely impossible to predict memory allocation in `llama.cpp` for selected backend. You may predict memory amount for `llama.cpp`, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm: ``` #!/bin/bash set -e # Setup INT handler. trap 'exit 130' INT ... # Prepare HF repository name. HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL' # Prepare token counts. PROMPT_TOKEN_COUNT=<TODO> MAX_COT_TOKEN_COUNT=$(( 2 ** 12 )) MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 )) # Prepare GPU layer counts. MIN_GPU_LAYER_COUNT=0 MAX_GPU_LAYER_COUNT=49 # Prepare additional options. ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU') # Prepare max token count. MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT )) echo "Max token count: ${MAX_TOKEN_COUNT}" # Find GPU layer counts. GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$GPU_LAYER_COUNT" \ --no-warmup \ "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \ RESULT=$? if [ $RESULT -eq 0 ]; then BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 )) elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $GPU_LAYER_COUNT" MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 )) else exit $RESULT fi GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 )) done # Process prompt file. while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --file "$PROMPT_FILE_PATH" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \ "${ADDITIONAL_OPTIONS[@]}" || \ RESULT=$? if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT" BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 )) echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" else exit $RESULT fi done ``` Where `qwen3-coder.sh` is `llama.cpp` wrapper with some custom system prompt. It works perfect, but requires 3-10 seconds to find appropriate gpu layers count, from my experience it is fine.
Author
Owner

@andrew-aladjev commented on GitHub (Sep 8, 2025):

My complete solution looks as follows:

First of all you need to build llama.cpp with rocm support in docker:

build-llama.cpp.sh:

#!/bin/bash
set -e

# Clone llama.cpp.
cd ~/workspace
git clone 'git@github.com:ggml-org/llama.cpp.git' --depth '1' || :
cd llama.cpp
git fetch --all

# Checkout the latest stable tag.
git fetch --tags
LATEST_STABLE_TAG=$(git tag | grep '^b' | sort --human-numeric-sort --reverse | head -n '1')
git checkout "$LATEST_STABLE_TAG"

# Build image for AMD GPU.
docker build \
  --tag 'local/llama.cpp:light-rocm' \
  --target 'light' \
  --file '.devops/rocm.Dockerfile' \
  --progress 'plain' \
  '.'

# Docker cleanup.
docker system prune --force

You need to run build-llama.cpp.sh to update your rocm llama.cpp image.

Then llama.cpp.sh:

#!/bin/bash
set -e

# Prepare nproc.
NPROC=$(nproc)

# Run image.
docker run \
  --rm \
  --network 'host' \
  --group-add 'video' \
  --ipc 'host' \
  --cap-add 'SYS_PTRACE' \
  --security-opt 'seccomp=unconfined' \
  --device '/dev/kfd' \
  --device '/dev/dri' \
  --volume '/home/llama.cpp:/root/.cache/llama.cpp' \
  --volume '/tmp:/tmp' \
  'local/llama.cpp:light-rocm' \
  --threads "$NPROC" \
  --no-escape \
  --reasoning-format 'none' \
  "$@"

Then qwen3-coder.sh:

#!/bin/bash
set -e

# Prepare directory path.
DIR_PATH=$(dirname "${BASH_SOURCE[0]}")

# Read system prompt.
SYSTEM_PROMPT_FILE_PATH="${DIR_PATH}/qwen3-system-prompt.txt"
SYSTEM_PROMPT=$(cat "$SYSTEM_PROMPT_FILE_PATH")

# Run llama.cpp.
llama.cpp.sh \
  --temp '0.7' \
  --top-k '20' \
  --top-p '0.8' \
  --min-p '0.0' \
  --repeat-penalty '1.05' \
  --system-prompt "$SYSTEM_PROMPT" \
  "$@"

qwen3-system-prompt.txt is up to you.

qwen3-coder-30b-a3b.sh:

#!/bin/bash                                         
set -e                             
                                                                                                        
# Setup INT handler.                      
trap 'exit 130' INT                                 
                                                                                                        
# Prepare prompt file path.
PROMPT_FILE_PATH="$1"                                                                                   
                                                                                                        
# Prepare HF repository name.                       
HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL'
                                                    
# Prepare max token counts.
MAX_COT_TOKEN_COUNT=$(( 2 ** 12 ))        
MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 ))

# Prepare additional options.
ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU')

# Find prompt token count.
PROMPT_TOKEN_OUTPUT=$(
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --file "$PROMPT_FILE_PATH" \
    --ctx-size '-2' \
    --no-warmup \
    --no-conversation \
    "${ADDITIONAL_OPTIONS[@]}" 2>&1 || :
)

# Prepare prompt token count.
PROMPT_TOKEN_COUNT=$(
  grep 'prompt is too long' <<< "$PROMPT_TOKEN_OUTPUT" |
    grep -o '[0-9]\+' |
    head -n '1'
)

# Prepare max GPU layer count.
MAX_GPU_LAYER_COUNT=$(
  grep 'n_layer\s*=' <<< "$PROMPT_TOKEN_OUTPUT" |
    grep -o '[0-9]\+' |
    head -n '1'
)

# Prepare max token count.
MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT ))
echo "Max token count: ${MAX_TOKEN_COUNT}"

# Find GPU layer counts.
MIN_GPU_LAYER_COUNT=0
GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT
BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT

while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$GPU_LAYER_COUNT" \
    --no-warmup \
    "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \ 
    RESULT=$?

  if [ $RESULT -eq 0 ]; then
    BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"

    MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 ))
  elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $GPU_LAYER_COUNT"

    MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 ))
  else
    exit $RESULT
  fi

  GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 ))
done

# Process prompt file.
while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --file "$PROMPT_FILE_PATH" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \
    "${ADDITIONAL_OPTIONS[@]}" || \
    RESULT=$?

  if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT"

    BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 ))
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"
  else
    exit $RESULT
  fi
done

You can run qwen3-coder-30b-a3b.sh with one argument: file path with prompt.

My solution introduces anti-enterprise solution of running qwen3-coder using llama.cpp.

Enterprise solution means you are loading model once with constant context size. Enterprise solution is a good one when your context size is close to limits every time your are querying llm. Otherwise it is a very bad solution.

I am reloading model for each query with variable context size in order to run it with maximum possible offloaded gpu layers. As a result I am receiving excellent performance on just single budget gpu: 7600 xt with 16 gb VRAM when context size is around 64k.

<!-- gh-comment-id:3267755958 --> @andrew-aladjev commented on GitHub (Sep 8, 2025): My complete solution looks as follows: First of all you need to build `llama.cpp` with `rocm` support in `docker`: `build-llama.cpp.sh`: ``` #!/bin/bash set -e # Clone llama.cpp. cd ~/workspace git clone 'git@github.com:ggml-org/llama.cpp.git' --depth '1' || : cd llama.cpp git fetch --all # Checkout the latest stable tag. git fetch --tags LATEST_STABLE_TAG=$(git tag | grep '^b' | sort --human-numeric-sort --reverse | head -n '1') git checkout "$LATEST_STABLE_TAG" # Build image for AMD GPU. docker build \ --tag 'local/llama.cpp:light-rocm' \ --target 'light' \ --file '.devops/rocm.Dockerfile' \ --progress 'plain' \ '.' # Docker cleanup. docker system prune --force ``` You need to run `build-llama.cpp.sh` to update your rocm `llama.cpp` image. Then `llama.cpp.sh`: ``` #!/bin/bash set -e # Prepare nproc. NPROC=$(nproc) # Run image. docker run \ --rm \ --network 'host' \ --group-add 'video' \ --ipc 'host' \ --cap-add 'SYS_PTRACE' \ --security-opt 'seccomp=unconfined' \ --device '/dev/kfd' \ --device '/dev/dri' \ --volume '/home/llama.cpp:/root/.cache/llama.cpp' \ --volume '/tmp:/tmp' \ 'local/llama.cpp:light-rocm' \ --threads "$NPROC" \ --no-escape \ --reasoning-format 'none' \ "$@" ``` Then `qwen3-coder.sh`: ``` #!/bin/bash set -e # Prepare directory path. DIR_PATH=$(dirname "${BASH_SOURCE[0]}") # Read system prompt. SYSTEM_PROMPT_FILE_PATH="${DIR_PATH}/qwen3-system-prompt.txt" SYSTEM_PROMPT=$(cat "$SYSTEM_PROMPT_FILE_PATH") # Run llama.cpp. llama.cpp.sh \ --temp '0.7' \ --top-k '20' \ --top-p '0.8' \ --min-p '0.0' \ --repeat-penalty '1.05' \ --system-prompt "$SYSTEM_PROMPT" \ "$@" ``` `qwen3-system-prompt.txt` is up to you. `qwen3-coder-30b-a3b.sh`: ``` #!/bin/bash set -e # Setup INT handler. trap 'exit 130' INT # Prepare prompt file path. PROMPT_FILE_PATH="$1" # Prepare HF repository name. HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL' # Prepare max token counts. MAX_COT_TOKEN_COUNT=$(( 2 ** 12 )) MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 )) # Prepare additional options. ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU') # Find prompt token count. PROMPT_TOKEN_OUTPUT=$( qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --file "$PROMPT_FILE_PATH" \ --ctx-size '-2' \ --no-warmup \ --no-conversation \ "${ADDITIONAL_OPTIONS[@]}" 2>&1 || : ) # Prepare prompt token count. PROMPT_TOKEN_COUNT=$( grep 'prompt is too long' <<< "$PROMPT_TOKEN_OUTPUT" | grep -o '[0-9]\+' | head -n '1' ) # Prepare max GPU layer count. MAX_GPU_LAYER_COUNT=$( grep 'n_layer\s*=' <<< "$PROMPT_TOKEN_OUTPUT" | grep -o '[0-9]\+' | head -n '1' ) # Prepare max token count. MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT )) echo "Max token count: ${MAX_TOKEN_COUNT}" # Find GPU layer counts. MIN_GPU_LAYER_COUNT=0 GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$GPU_LAYER_COUNT" \ --no-warmup \ "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \ RESULT=$? if [ $RESULT -eq 0 ]; then BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 )) elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $GPU_LAYER_COUNT" MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 )) else exit $RESULT fi GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 )) done # Process prompt file. while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --file "$PROMPT_FILE_PATH" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \ "${ADDITIONAL_OPTIONS[@]}" || \ RESULT=$? if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT" BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 )) echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" else exit $RESULT fi done ``` You can run `qwen3-coder-30b-a3b.sh` with one argument: file path with prompt. My solution introduces anti-enterprise solution of running `qwen3-coder` using `llama.cpp`. Enterprise solution means you are loading model once with constant context size. Enterprise solution is a good one when your context size is close to limits every time your are querying llm. Otherwise it is a very bad solution. I am reloading model for each query with variable context size in order to run it with maximum possible offloaded gpu layers. As a result I am receiving excellent performance on just single budget gpu: 7600 xt with 16 gb VRAM when context size is around 64k.
Author
Owner

@jessegross commented on GitHub (Sep 8, 2025):

@ShaunaGordon qwen2.5-coder runs on the old engine by default and therefore cannot take advantage of the new memory allocation system. However, it is implemented in the new engine and you can turn it on with OLLAMA_NEW_ENGINE=1 (plus OLLAMA_NEW_ESTIMATES=1).

<!-- gh-comment-id:3267770547 --> @jessegross commented on GitHub (Sep 8, 2025): @ShaunaGordon qwen2.5-coder runs on the old engine by default and therefore cannot take advantage of the new memory allocation system. However, it is implemented in the new engine and you can turn it on with OLLAMA_NEW_ENGINE=1 (plus OLLAMA_NEW_ESTIMATES=1).
Author
Owner

@jessegross commented on GitHub (Sep 8, 2025):

@andrew-aladjev

ollama is based on llama.cpp. From my experience it is completely impossible to predict memory allocation in llama.cpp for selected backend. You may predict memory amount for llama.cpp, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm:

This is only true for the old engine, which is based on llama.cpp. Ollama's new engine does not use llama.cpp and has much more accurate allocations, which is currently in preview. It can be enabled with OLLAMA_NEW_ESTIMATES=1, as noted above.

<!-- gh-comment-id:3267791143 --> @jessegross commented on GitHub (Sep 8, 2025): @andrew-aladjev > `ollama` is based on `llama.cpp`. From my experience it is completely impossible to predict memory allocation in `llama.cpp` for selected backend. You may predict memory amount for `llama.cpp`, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm: This is only true for the old engine, which is based on llama.cpp. Ollama's new engine does not use llama.cpp and has much more accurate allocations, which is currently in preview. It can be enabled with OLLAMA_NEW_ESTIMATES=1, as noted above.
Author
Owner

@ShaunaGordon commented on GitHub (Sep 9, 2025):

@jessegross Yes, I understand that. I have gpt-oss using the new system. I was confirming that it is correctly reporting the allocation and seems to be working (though today, I'm getting other crashes with super cryptic error messages, so I don't even know if it's at all related).

I mentioned qwen not because I was expecting it to run on the new system, but because I had originally only been able to replicate the issue with models that split onto the CPU, yet with qwen, the model fits well within, by all accounts, and yet still throws OOM errors, despite ostensibly having 6+gb of space in the vram.

What exactly is the proper course of action at this point, here? It seems unreasonable to expect people to only use gpt-oss, yet alternatives still run into this, because they're on the older system. Are we just stuck choosing between gpt-oss and undersized models until someone does whatever needs to be done for other models to use the new system?

<!-- gh-comment-id:3272409223 --> @ShaunaGordon commented on GitHub (Sep 9, 2025): @jessegross Yes, I understand that. I have `gpt-oss` using the new system. I was confirming that it is correctly reporting the allocation and seems to be working (though today, I'm getting other crashes with super cryptic error messages, so I don't even know if it's at all related). I mentioned qwen not because I was expecting it to run on the new system, but because I had originally only been able to replicate the issue with models that split onto the CPU, yet with qwen, the model fits well within, by all accounts, and yet still throws OOM errors, despite ostensibly having 6+gb of space in the vram. What exactly is the proper course of action at this point, here? It seems unreasonable to expect people to only use `gpt-oss`, yet alternatives still run into this, because they're on the older system. Are we just stuck choosing between `gpt-oss` and undersized models until someone does whatever needs to be done for other models to use the new system?
Author
Owner

@jessegross commented on GitHub (Sep 9, 2025):

@ShaunaGordon qwen2 can run on the new engine if you set OLLAMA_NEW_ENGINE=1

<!-- gh-comment-id:3272502581 --> @jessegross commented on GitHub (Sep 9, 2025): @ShaunaGordon qwen2 can run on the new engine if you set OLLAMA_NEW_ENGINE=1
Author
Owner

@digitalextremist commented on GitHub (Sep 10, 2025):

@andrew-aladjev I have an RX 7800 XT 16GB and your notes got my attention:

Do you mind moving your solution to a repository or group of gists outside this issue?

I am happy with Ollama, especially after @jessegross changed the fabric of reality in our favor, but I feel like there is something there in what you posted @andrew-aladjev when it comes to pinning a certain model. I started this issue wanting to have a reliable coding agent based on a local LLM, and that would be with an "enterprise solution" so-called.


@jessegross is there a compatibility list somewhere showing which models run on which engine?

I am generally using gpt-oss:20b and qwen3-coder:30b and gemma3:4b-qat ( my wife's Janet themed chat/research bot ) but also started experimenting with devstral ( which seems meh to me ) ... I notice that two I really want to run behave completely differently than the ones in the official registry, already named:

But then this one seems to be fine:

Wondering if there is a way we can identify which engine a model is using.

<!-- gh-comment-id:3273253382 --> @digitalextremist commented on GitHub (Sep 10, 2025): @andrew-aladjev I have an `RX 7800 XT 16GB` and your notes got my attention: Do you mind moving your solution to a repository or group of gists outside this issue? I am happy with `Ollama`, especially after @jessegross changed the fabric of reality in our favor, but I feel like there is something there in what you posted @andrew-aladjev when it comes to pinning a certain model. I started this issue wanting to have a reliable coding agent based on a local LLM, and that would be with an "enterprise solution" so-called. --- @jessegross is there a compatibility list somewhere showing which models run on which engine? I am generally using `gpt-oss:20b` and `qwen3-coder:30b` and `gemma3:4b-qat` ( my wife's [`Janet`](https://thegoodplace.fandom.com/wiki/Janet) themed chat/research bot ) but also started experimenting with `devstral` ( which seems meh to me ) ... I notice that two I really want to run behave completely differently than the ones in the official registry, already named: - https://ollama.com/huihui_ai/qwen3-coder-abliterated:30b - https://ollama.com/huihui_ai/gpt-oss-abliterated:20b But then this one seems to be fine: - https://ollama.com/danielsheep/Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS Wondering if there is a way we can identify which engine a model is using.
Author
Owner

@jessegross commented on GitHub (Sep 10, 2025):

@digitalextremist

You can see the architectures that have been implemented in the Ollama engine here:
https://github.com/ollama/ollama/blob/main/model/models/models.go

Not all of these are currently enabled by default. We are working to verify them so that the Ollama engine is used automatically when an implementation exists. Here are the ones that are used automatically:
20b53eaa72/fs/ggml/ggml.go (L241)

You can verify whether a particular model is using the Ollama engine by looking for this line in the log file:
level=INFO source=runner.go:1305 msg="starting ollama engine"

In some cases, models share the same architecture string but are actually different so the log message is the best way to know for certain which engine is being used.

<!-- gh-comment-id:3275945997 --> @jessegross commented on GitHub (Sep 10, 2025): @digitalextremist You can see the architectures that have been implemented in the Ollama engine here: https://github.com/ollama/ollama/blob/main/model/models/models.go Not all of these are currently enabled by default. We are working to verify them so that the Ollama engine is used automatically when an implementation exists. Here are the ones that are used automatically: https://github.com/ollama/ollama/blob/20b53eaa726a4c08043c7af1fa6a322295edcde2/fs/ggml/ggml.go#L241 You can verify whether a particular model is using the Ollama engine by looking for this line in the log file: `level=INFO source=runner.go:1305 msg="starting ollama engine"` In some cases, models share the same architecture string but are actually different so the log message is the best way to know for certain which engine is being used.
Author
Owner

@digitalextremist commented on GitHub (Sep 11, 2025):

It is awesome of you to share your knowledge on this @jessegross; thanks for being cool and generous with your time. I was definitely not expecting both to have memory estimation solved for most models I use now, and get time-traveled to where I can rtfm in code directly. Been seeing */OSS head off a cliff lately with rampant inhumanity. Pardon my being taken aback!

One of if not the reason I stick close with Ollama ( while knowing there is more out there ) is that quality, along with actual ninja skill. Anyone can become a rockstar coder, very few can be genuinely cool. Anyway, thanks. I will rtfm from there now. Though I am not a go coder, I am starting to want to superficially understanding the code base since this is awesome. Like loving your chickens or cow. I trip out at the substance I experience from being here. I hope Ollama stays cool long-term!

Thanks again; and sticking with this one until the others on this issue feel solidly on the other side of crashes.

<!-- gh-comment-id:3277261623 --> @digitalextremist commented on GitHub (Sep 11, 2025): It is awesome of you to share your knowledge on this @jessegross; thanks for being cool and generous with your time. I was definitely not expecting both to have memory estimation solved for most models I use now, and get time-traveled to where I can `rtfm` in code directly. Been seeing `*/OSS` head off a cliff lately with rampant inhumanity. Pardon my being taken aback! One of if not _the_ reason I stick close with `Ollama` ( while knowing there is more out there ) is that quality, along with actual ninja skill. Anyone can become a rockstar coder, very few can be genuinely cool. Anyway, thanks. I will `rtfm` from there now. Though I am not a `go` coder, I am starting to want to superficially understanding the code base since this is awesome. Like loving your chickens or cow. I trip out at the substance I experience from being here. I hope `Ollama` stays cool long-term! Thanks again; and sticking with this one until the others on this issue feel solidly on the other side of crashes.
Author
Owner

@jessegross commented on GitHub (Sep 11, 2025):

@digitalextremist Thank you for your support!

<!-- gh-comment-id:3281890719 --> @jessegross commented on GitHub (Sep 11, 2025): @digitalextremist Thank you for your support!
Author
Owner

@digitalextremist commented on GitHub (Sep 12, 2025):

And, congratz on shipping your new memory estimates as the default #12252 @jessegross!

<!-- gh-comment-id:3283512147 --> @digitalextremist commented on GitHub (Sep 12, 2025): And, congratz on shipping your new memory estimates as the default #12252 @jessegross!
Author
Owner

@digitalextremist commented on GitHub (Sep 24, 2025):

By the way; what is being called "model scheduling" sounds a lot like it requires or is much of the memory estimates functionality? Is that the case @jessegross?

Am updating to 0.12.1 now. Seems like this issue ought to be closed as moot soon since most of it applies to a prior era.

<!-- gh-comment-id:3328857581 --> @digitalextremist commented on GitHub (Sep 24, 2025): By the way; what is being called `"model scheduling"` sounds a lot like it requires or _is_ much of the `memory estimates` functionality? Is that the case @jessegross? Am updating to `0.12.1` now. Seems like this issue ought to be closed as `moot` soon since most of it applies to a prior era.
Author
Owner

@rick-github commented on GitHub (Sep 24, 2025):

The model scheduler replaces the memory estimation logic, but only for models running on the ollama engine. Since that actually covers most of the popular models, it is sort of moot.

<!-- gh-comment-id:3329208059 --> @rick-github commented on GitHub (Sep 24, 2025): The model scheduler replaces the memory estimation logic, but only for models running on the ollama engine. Since that actually covers most of the popular models, it is sort of moot.
Author
Owner

@digitalextremist commented on GitHub (Sep 24, 2025):

Thanks @rick-github; since I have not been having any issues for many versions now, and this issue has gone silent for a while for others, and everything seems to be at a sea-change for Ollama overall. I will mark this issue as closed!

Still not totally clear if @jessegross's memory estimation has already been sunset, or what, but about to try the new version.

<!-- gh-comment-id:3329508903 --> @digitalextremist commented on GitHub (Sep 24, 2025): Thanks @rick-github; since I have not been having any issues for many versions now, and this issue has gone silent for a while for others, and everything seems to be at a sea-change for `Ollama` overall. I will mark this issue as closed! Still not totally clear if @jessegross's memory estimation has already been sunset, or what, but about to try the new version.
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

"Model scheduling" is referring to the new memory estimates, it's the same thing as what we have been discussing here. However, the new system is not really an estimate any more, so it wasn't the best name.

<!-- gh-comment-id:3329981218 --> @jessegross commented on GitHub (Sep 24, 2025): "Model scheduling" is referring to the new memory estimates, it's the same thing as what we have been discussing here. However, the new system is not really an estimate any more, so it wasn't the best name.
Author
Owner

@digitalextremist commented on GitHub (Sep 24, 2025):

Thank you for clarifying @jessegross; I had a feeling that what you have been working on is essentially the same underlying concepts. I know we are all navigating abstraction right now so it is nice to have a feeling be right in such a vague situation.

Again, my congratulations for laying down such a fundamental piece that really got a through to somewhere awesome.

<!-- gh-comment-id:3329995082 --> @digitalextremist commented on GitHub (Sep 24, 2025): Thank you for clarifying @jessegross; I had a feeling that what you have been working on is essentially the same underlying concepts. I know we are all navigating abstraction right now so it is nice to have a feeling be right in such a vague situation. Again, my congratulations for laying down such a fundamental piece that really got a through to somewhere awesome.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68575