[GH-ISSUE #9957] [ ROCm error: out of memory ] Runner Terminated: num_ctx within model / hardware limits reliably crashes #68575

New Issue

GiteaMirror · 2026-05-04T14:28:31-05:00

GiteaMirror commented

2026-05-04 14:28:31 -05:00

Originally created by @digitalextremist on GitHub (Mar 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9957

tl;dr -- Rather than work within the VRAM+RAM, num_ctx within hardware and model limits causes crash at an unknown point with AMD.

This is using the :rocm tag of the ollama image on docker

Host system has 64GB RAM and GPU has 16GB VRAM

ollama ps with a test case looks like this:

That is using sikamikanikobg/OlympicCoder-7B which is FP16

The model is based on qwen2 and being run with 32768 for num_ctx

I can cause this with many other models, including gemma3 but I am focusing on the last instance of the runner crash.

Have included gist well before the crash and likely shared more than necessary in that since this is probably a known issue.

Was not able to discern which known issue though, when searching through open issues.

Does not seem to truly split across both VRAM and system RAM as it pertains to num_ctx

This is using the F/OSS driver since the proprietary driver seems to be meh, am avoiding.

Relevant log output

Stupid-long version of log: https://gist.github.com/digitalextremist/b59e033c67d28a13f6ab7689131e98e4

Point where it initially hits the fan:

zero_llm_ollama  | ROCm error: out of memory
zero_llm_ollama  |   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366
zero_llm_ollama  |   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
zero_llm_ollama  | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error
zero_llm_ollama  | Memory critical error by agent node-0 (Agent handle: 0x648878515ea0) on address 0x7b8dd8300000. Reason: Memory in use. 
zero_llm_ollama  | SIGABRT: abort
zero_llm_ollama  | PC=0x7b8e42fba00b m=11 sigcode=18446744073709551610
zero_llm_ollama  | signal arrived during cgo execution

OS

Docker

GPU

AMD

CPU

Intel

Ollama version

0.6.2

Originally created by @digitalextremist on GitHub (Mar 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9957 tl;dr -- Rather than work within the VRAM+RAM, num_ctx within hardware and model limits causes crash at an unknown point with AMD. --- This is using the `:rocm` tag of the `ollama` image on `docker` Host system has `64GB` RAM and GPU has `16GB` VRAM `ollama ps` with a test case looks like this: ![Image](https://github.com/user-attachments/assets/09e08725-6578-42c2-8aca-ce3e0424d09f) That is using [`sikamikanikobg/OlympicCoder-7B`](https://ollama.com/sikamikanikobg/OlympicCoder-7B) which is FP16 The model is based on `qwen2` and being run with `32768` for `num_ctx` I can cause this with many other models, including `gemma3` but I am focusing on the last instance of the runner crash. Have included gist well before the crash and likely shared more than necessary in that since this is probably a known issue. Was not able to discern which known issue though, when searching through open issues. Does not seem to truly split across both VRAM and system RAM as it pertains to `num_ctx` This is using the F/OSS driver since the proprietary driver seems to be meh, am avoiding. ### Relevant log output Stupid-long version of log: https://gist.github.com/digitalextremist/b59e033c67d28a13f6ab7689131e98e4 Point where it initially hits the fan: ```shell zero_llm_ollama | ROCm error: out of memory zero_llm_ollama | current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366 zero_llm_ollama | ggml_cuda_device_malloc(&ptr, look_ahead_size, device) zero_llm_ollama | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error zero_llm_ollama | Memory critical error by agent node-0 (Agent handle: 0x648878515ea0) on address 0x7b8dd8300000. Reason: Memory in use. zero_llm_ollama | SIGABRT: abort zero_llm_ollama | PC=0x7b8e42fba00b m=11 sigcode=18446744073709551610 zero_llm_ollama | signal arrived during cgo execution ``` ### OS Docker ### GPU AMD ### CPU Intel ### Ollama version 0.6.2

GiteaMirror added the bug label 2026-05-04 14:28:31 -05:00

GiteaMirror closed this issue

2026-05-04 14:28:35 -05:00

GiteaMirror commented

2026-05-04 14:28:39 -05:00

@rick-github commented on GitHub (Mar 23, 2025):

zero_llm_ollama  | time=2025-03-23T21:52:07.578Z level=INFO source=server.go:138 msg=offload library=rocm
 layers.requested=-1 layers.model=29 layers.offload=25 layers.split="" memory.available="[15.8 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="18.1 GiB" memory.required.partial="15.6 GiB"
 memory.required.kv="1.8 GiB" memory.required.allocations="[15.6 GiB]" memory.weights.total="12.2 GiB"
 memory.weights.repeating="12.2 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.8 GiB"
 memory.graph.partial="2.3 GiB"

ollama estimated it needed 15.6G to offload 25 of 29 layers, with 15.8G available on the device. It looks like it OOM'ed on the first completion call, so the runner tried to allocate something > 200M. ollama estimations are sometimes off, you can see here for some mitigations. With the change in the ollama runner architecture in the 0.6 series there's likely some work to be done on the estimation logic.

@rick-github commented on GitHub (Mar 23, 2025): ``` zero_llm_ollama | time=2025-03-23T21:52:07.578Z level=INFO source=server.go:138 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=25 layers.split="" memory.available="[15.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="18.1 GiB" memory.required.partial="15.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[15.6 GiB]" memory.weights.total="12.2 GiB" memory.weights.repeating="12.2 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="1.8 GiB" memory.graph.partial="2.3 GiB" ``` ollama estimated it needed 15.6G to offload 25 of 29 layers, with 15.8G available on the device. It looks like it OOM'ed on the first completion call, so the runner tried to allocate something > 200M. ollama estimations are sometimes off, you can see [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for some mitigations. With the change in the ollama runner architecture in the 0.6 series there's likely some work to be done on the estimation logic.

GiteaMirror commented

2026-05-04 14:28:41 -05:00

@digitalextremist commented on GitHub (Mar 23, 2025):

Thanks a lot @rick-github for the quick diagnosis with mitigation tips; will try those.

@digitalextremist commented on GitHub (Mar 23, 2025): Thanks a lot @rick-github for the quick diagnosis with mitigation tips; will try those.

GiteaMirror commented

2026-05-04 14:28:41 -05:00

@digitalextremist commented on GitHub (Mar 26, 2025):

@jessegross would this also be fixed by the solution in f66216e399 resolving #9890?

@digitalextremist commented on GitHub (Mar 26, 2025): @jessegross would this also be fixed by the solution in f66216e3990b73869341c58ac9561b26c468c558 resolving #9890?

GiteaMirror commented

2026-05-04 14:28:42 -05:00

@jessegross commented on GitHub (Mar 26, 2025):

@jessegross would this also be fixed by the solution in f66216e resolving #9890?

Probably not, that commit doesn't have any effect on qwen2.

@jessegross commented on GitHub (Mar 26, 2025): > [@jessegross](https://github.com/jessegross) would this also be fixed by the solution in [f66216e](https://github.com/ollama/ollama/commit/f66216e3990b73869341c58ac9561b26c468c558) resolving [#9890](https://github.com/ollama/ollama/issues/9890)? Probably not, that commit doesn't have any effect on qwen2.

GiteaMirror commented

2026-05-04 14:28:44 -05:00

@digitalextremist commented on GitHub (Mar 26, 2025):

Thanks for the quick follow-up @jessegross!

@digitalextremist commented on GitHub (Mar 26, 2025): Thanks for the quick follow-up @jessegross!

GiteaMirror commented

2026-05-04 14:28:46 -05:00

@rtaic-coder commented on GitHub (Apr 9, 2025):

I have similar crash with two AMD GPU with 44GB and CPU memory 64GB using Gemma 3 and Deepseek-r1-32b quantized in ollama:0.6.5-rocm running in docker. ollama ps showing the model loaded 100% in GPU using 23GB only. But when completion request with about 81k characters was sent.

ollama_amd | time=2025-04-09T20:30:25.470Z level=DEBUG source=process_text_spm.go:184 msg="adding bos token to prompt" id=2
ollama_amd | time=2025-04-09T20:30:25.471Z level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=19357 used=0 remaining=19357
ollama_amd | ROCm error: out of memory
ollama_amd | current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366
ollama_amd | ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
ollama_amd | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error
ollama_amd | Memory critical error by agent node-0 (Agent handle: 0x7e3b74665c70) on address 0x7e3b94300000. Reason: Memory in use.
ollama_amd | SIGABRT: abort
ollama_amd | PC=0x7e3c4d55c00b m=27 sigcode=18446744073709551610
ollama_amd | signal arrived during cgo execution

@rtaic-coder commented on GitHub (Apr 9, 2025): I have similar crash with two AMD GPU with 44GB and CPU memory 64GB using Gemma 3 and Deepseek-r1-32b quantized in `ollama:0.6.5-rocm` running in docker. `ollama ps` showing the model loaded 100% in GPU using 23GB only. But when completion request with about 81k characters was sent. > ollama_amd | time=2025-04-09T20:30:25.470Z level=DEBUG source=process_text_spm.go:184 msg="adding bos token to prompt" id=2 ollama_amd | time=2025-04-09T20:30:25.471Z level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=19357 used=0 remaining=19357 ollama_amd | ROCm error: out of memory ollama_amd | current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:366 ollama_amd | ggml_cuda_device_malloc(&ptr, look_ahead_size, device) ollama_amd | //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error ollama_amd | Memory critical error by agent node-0 (Agent handle: 0x7e3b74665c70) on address 0x7e3b94300000. Reason: Memory in use. ollama_amd | SIGABRT: abort ollama_amd | PC=0x7e3c4d55c00b m=27 sigcode=18446744073709551610 ollama_amd | signal arrived during cgo execution

GiteaMirror commented

2026-05-04 14:28:48 -05:00

@digitalextremist commented on GitHub (Apr 9, 2025):

Thanks for confirming @rtaic-coder

I still also reproducing this with many different models

It seems memory estimation for model + context + compute is still unreliable

@digitalextremist commented on GitHub (Apr 9, 2025): Thanks for confirming @rtaic-coder I still also reproducing this with many different models It seems memory estimation for model + context + compute is still unreliable

GiteaMirror commented

2026-05-04 14:28:51 -05:00

@Anaphylaxis commented on GitHub (May 3, 2025):

I experience similar behavior with a 7900xtx and 80GB RAM. The available GPU RAM goes down and down with each different model load, which is all the time as it unloads for the embedding model. It is unstable, and I have to kill Ollama and restart the process to clear the VRAM so I can use 100% GPU again.

@Anaphylaxis commented on GitHub (May 3, 2025): I experience similar behavior with a 7900xtx and 80GB RAM. The available GPU RAM goes down and down with each different model load, which is all the time as it unloads for the embedding model. It is unstable, and I have to kill Ollama and restart the process to clear the VRAM so I can use 100% GPU again.

GiteaMirror commented

2026-05-04 14:28:55 -05:00

@digitalextremist commented on GitHub (May 3, 2025):

Thanks for confirming that also @Anaphylaxis

@ParthSareen dropped by discord today and said he nudged this issue along among the maintainers, so hopefully those of us watching this issue will see a fix come up soon

The additional aspect you are adding I will be watching for also; you make it sound more like a leak than memory estimation being inconsistent only. Will look at that aspect too

@digitalextremist commented on GitHub (May 3, 2025): Thanks for confirming that also @Anaphylaxis @ParthSareen dropped by `discord` today and said he nudged this issue along among the maintainers, so hopefully those of us watching this issue will see a fix come up soon The additional aspect you are adding I will be watching for also; you make it sound more like a leak than memory estimation being inconsistent only. Will look at that aspect too

GiteaMirror commented

2026-05-04 14:28:59 -05:00

@Anaphylaxis commented on GitHub (May 4, 2025):

For example,3tok/s vs 25tok/s if I don't kill the process every so often

> PS C:\Users\user> ollama ps          
> NAME    ID    SIZE    PROCESSOR    UNTIL 
> PS C:\Users\user> ollama run qwen3:32b    
> >>> /set verbose
> Set 'verbose' mode.
> >>> Hello, how are you?
> <think>
> Let me analyze this simple greeting. Hello is a standard opening that 
> shows politeness and respect. It's a good opportunity to demonstrate 
> warmth and approachability in our conversation. I should respond in a
> friendly and engaging way that invites further discussion.
> 
> I need to consider the user's perspective - they might be testing the
> system, or they might want to have a genuine conversation. Either way, I        
> should remain consistent in my response style, showing empathy and
> understanding.
> 
> The greeting "Hello, how are you?" is simple but effective. It's a 
> standard way to start a conversation that shows consideration for the 
> other person. I should acknowledge their greeting in a similar way, while       
> also showing my own personality.
> 
> I can add a bit of personality to my response by including a friendly 
> emoji to show warmth. This makes the conversation feel more natural and
> personable. I should also make sure to ask them back how they're doing,
> creating a natural flow to the conversation.
> 
> I should be mindful of keeping my response concise but not too brief. The       
> user might be looking for a conversation partner, so I want to be engaging      
> but not overwhelming. I should stay friendly and approachable in my tone.
> 
> The user might be looking for different things in this conversation - 
> maybe just a friendly chat, or perhaps they have questions or need help
> with something. I should be prepared to adapt to whatever direction the
> conversation takes.
> 
> I need to make sure my response is welcoming and open-ended. That way, the      
> user feels comfortable to talk about whatever they want. I should avoid         
> any assumptions about what they're looking for, and instead keep my
> response flexible.
> 
> The user might also be testing how I respond to simple greetings, so I 
> need to make sure my response is appropriate but not overly formal. I want      
> to show that I can have natural, friendly conversations while still being       
> professional and helpful.
> </think>
> 
> Hello! I'm doing well, thank you for asking. I'm always excited to chat         
> and help out. How have you been? I'd love to hear what's on your mind! 😊
> 
> total duration:       2m15.524901s
> load duration:        18.1326ms
> prompt eval count:    14 token(s)
> prompt eval duration: 3.1519391s
> prompt eval rate:     4.44 tokens/s
> eval count:           416 token(s)
> eval duration:        2m12.3537658s
> eval rate:            3.14 tokens/s
> >>> /bye
> PS C:\Users\user> ollama ps           
> NAME         ID              SIZE     PROCESSOR          UNTIL               
> qwen3:32b    e1c9f234c6eb    22 GB    37%/63% CPU/GPU    29 minutes from now    
> PS C:\Users\user> Stop-Process -name "ollama" && ollama` app
> PS C:\Users\user> ollama run qwen3:32b                      
> >>> /set verbose
> Set 'verbose' mode.
> >>> Hello, how are you?
> <think>
> Alright, the user just greeted me and asked how I am. I need to respond in      
> a friendly and welcoming way. Let me make sure I acknowledge their 
> greeting and offer a positive response.
> 
> First, I should thank them for asking. It's a nice touch to show
> appreciation. Then, I can mention that I'm doing well, which keeps the
> conversation upbeat. Including an emoji like a smiley face adds a friendly      
> tone.
> 
> Next, I should invite them to share how they're feeling. This opens the
> door for a more personal conversation and shows I'm interested in their
> well-being. Ending with an offer to help with any questions or topics 
> keeps the interaction helpful and supportive.
> 
> I need to check the flow to ensure it's natural and not too robotic. Also,      
> keeping the response concise but warm is key. Let me put it all together        
> and make sure it sounds genuine.
> </think>
> 
> Hi there! Thank you for asking - I'm doing well, just happy to be here! 😊      
> How are *you* feeling today? I'd love to hear what's on your mind or help       
> with any questions you might have. What would you like to chat about?
> 
> total duration:       9.6188248s
> load duration:        17.9104ms
> prompt eval count:    14 token(s)
> prompt eval duration: 118.3738ms
> prompt eval rate:     118.27 tokens/s
> eval count:           237 token(s)
> eval duration:        9.4708065s
> eval rate:            25.02 tokens/s

@Anaphylaxis commented on GitHub (May 4, 2025): For example,3tok/s vs 25tok/s if I don't kill the process every so often ``` > PS C:\Users\user> ollama ps > NAME ID SIZE PROCESSOR UNTIL > PS C:\Users\user> ollama run qwen3:32b > >>> /set verbose > Set 'verbose' mode. > >>> Hello, how are you? > <think> > Let me analyze this simple greeting. Hello is a standard opening that > shows politeness and respect. It's a good opportunity to demonstrate > warmth and approachability in our conversation. I should respond in a > friendly and engaging way that invites further discussion. > > I need to consider the user's perspective - they might be testing the > system, or they might want to have a genuine conversation. Either way, I > should remain consistent in my response style, showing empathy and > understanding. > > The greeting "Hello, how are you?" is simple but effective. It's a > standard way to start a conversation that shows consideration for the > other person. I should acknowledge their greeting in a similar way, while > also showing my own personality. > > I can add a bit of personality to my response by including a friendly > emoji to show warmth. This makes the conversation feel more natural and > personable. I should also make sure to ask them back how they're doing, > creating a natural flow to the conversation. > > I should be mindful of keeping my response concise but not too brief. The > user might be looking for a conversation partner, so I want to be engaging > but not overwhelming. I should stay friendly and approachable in my tone. > > The user might be looking for different things in this conversation - > maybe just a friendly chat, or perhaps they have questions or need help > with something. I should be prepared to adapt to whatever direction the > conversation takes. > > I need to make sure my response is welcoming and open-ended. That way, the > user feels comfortable to talk about whatever they want. I should avoid > any assumptions about what they're looking for, and instead keep my > response flexible. > > The user might also be testing how I respond to simple greetings, so I > need to make sure my response is appropriate but not overly formal. I want > to show that I can have natural, friendly conversations while still being > professional and helpful. > </think> > > Hello! I'm doing well, thank you for asking. I'm always excited to chat > and help out. How have you been? I'd love to hear what's on your mind! 😊 > > total duration: 2m15.524901s > load duration: 18.1326ms > prompt eval count: 14 token(s) > prompt eval duration: 3.1519391s > prompt eval rate: 4.44 tokens/s > eval count: 416 token(s) > eval duration: 2m12.3537658s > eval rate: 3.14 tokens/s > >>> /bye > PS C:\Users\user> ollama ps > NAME ID SIZE PROCESSOR UNTIL > qwen3:32b e1c9f234c6eb 22 GB 37%/63% CPU/GPU 29 minutes from now > PS C:\Users\user> Stop-Process -name "ollama" && ollama` app > PS C:\Users\user> ollama run qwen3:32b > >>> /set verbose > Set 'verbose' mode. > >>> Hello, how are you? > <think> > Alright, the user just greeted me and asked how I am. I need to respond in > a friendly and welcoming way. Let me make sure I acknowledge their > greeting and offer a positive response. > > First, I should thank them for asking. It's a nice touch to show > appreciation. Then, I can mention that I'm doing well, which keeps the > conversation upbeat. Including an emoji like a smiley face adds a friendly > tone. > > Next, I should invite them to share how they're feeling. This opens the > door for a more personal conversation and shows I'm interested in their > well-being. Ending with an offer to help with any questions or topics > keeps the interaction helpful and supportive. > > I need to check the flow to ensure it's natural and not too robotic. Also, > keeping the response concise but warm is key. Let me put it all together > and make sure it sounds genuine. > </think> > > Hi there! Thank you for asking - I'm doing well, just happy to be here! 😊 > How are *you* feeling today? I'd love to hear what's on your mind or help > with any questions you might have. What would you like to chat about? > > total duration: 9.6188248s > load duration: 17.9104ms > prompt eval count: 14 token(s) > prompt eval duration: 118.3738ms > prompt eval rate: 118.27 tokens/s > eval count: 237 token(s) > eval duration: 9.4708065s > eval rate: 25.02 tokens/s ```

GiteaMirror commented

2026-05-04 14:29:00 -05:00

@digitalextremist commented on GitHub (May 4, 2025):

Curious to see what happens with 0.6.8 based on the prerelease changelog!

Will check that @Anaphylaxis; did have some issues like that but they were resolved in 0.6.1

@digitalextremist commented on GitHub (May 4, 2025): Curious to see what happens with `0.6.8` based on the prerelease changelog! Will check that @Anaphylaxis; did have some issues like that but they were resolved in `0.6.1`

GiteaMirror commented

2026-05-04 14:29:01 -05:00

@andrew-aladjev commented on GitHub (Jul 22, 2025):

Same issue is still here:

64 GB RAM
16 GB VRAM (7600 XT)
rocm 6.4.2.60402-120~24.04
ollama 0.9.6
Qwen 3 32B (num_ctx 32k)
Actual context size is 21k

log:

Jul 22 05:09:21 puchuu-home-2 ollama[2188]: ROCm error: out of memory
Jul 22 05:09:21 puchuu-home-2 ollama[2188]:   current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:372
Jul 22 05:09:21 puchuu-home-2 ollama[2188]:   ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Jul 22 05:09:21 puchuu-home-2 ollama[2188]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:76: ROCm error
Jul 22 05:09:21 puchuu-home-2 ollama[2188]: Memory critical error by agent node-0 (Agent handle: 0x645e052700b0) on address 0x7636d0200000. Reason: Memory in use.
Jul 22 05:09:21 puchuu-home-2 ollama[2188]: SIGABRT: abort

@digitalextremist it looks like nothing changed.

@andrew-aladjev commented on GitHub (Jul 22, 2025): Same issue is still here: - 64 GB RAM - 16 GB VRAM (7600 XT) - rocm 6.4.2.60402-120~24.04 - ollama 0.9.6 - Qwen 3 32B (`num_ctx` 32k) - Actual context size is 21k log: ``` Jul 22 05:09:21 puchuu-home-2 ollama[2188]: ROCm error: out of memory Jul 22 05:09:21 puchuu-home-2 ollama[2188]: current device: 0, in function alloc at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:372 Jul 22 05:09:21 puchuu-home-2 ollama[2188]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Jul 22 05:09:21 puchuu-home-2 ollama[2188]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:76: ROCm error Jul 22 05:09:21 puchuu-home-2 ollama[2188]: Memory critical error by agent node-0 (Agent handle: 0x645e052700b0) on address 0x7636d0200000. Reason: Memory in use. Jul 22 05:09:21 puchuu-home-2 ollama[2188]: SIGABRT: abort ``` @digitalextremist it looks like nothing changed.

GiteaMirror commented

2026-05-04 14:29:02 -05:00

@rick-github commented on GitHub (Jul 22, 2025):

A full server log would help with debugging, but based on ROCm error: out of memory, see https://github.com/ollama/ollama/issues/9957#issuecomment-2746512857

@rick-github commented on GitHub (Jul 22, 2025): A full server log would help with debugging, but based on `ROCm error: out of memory`, see https://github.com/ollama/ollama/issues/9957#issuecomment-2746512857

GiteaMirror commented

2026-05-04 14:29:03 -05:00

@PedroHLC commented on GitHub (Aug 15, 2025):

It shouldn't crash when a 16368 MiB GPU runs a 14Gb model, right? New memory estimates seems to be in-place according to logs:

time=2025-08-15T10:44:55.096-03:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped runner --ollama-engine --model /var/lib/o
llama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 40251"
time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:657 msg="loading model" "model layers"=25 requested=-1
time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:663 msg="system memory" total="62.7 GiB" free="41.1 GiB" free_swap="65.0 GiB"
time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:667 msg="gpu memory" id=GPU-11883c28f37e073e available="15.2 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:40251"
time=2025-08-15T10:44:55.108-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[ID
:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-15T10:44:55.146-03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, ID: GPU-11883c28f37e073e
load_backend: loaded ROCm backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so
load_backend: loaded CPU backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-cpu-haswell.so
time=2025-08-15T10:44:57.855-03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_V
MM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-15T10:44:58.012-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[
ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25
[ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="492.0 MiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="2.1 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:342 msg="total memory" size="15.4 GiB"
time=2025-08-15T10:44:58.058-03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-15T10:44:58.058-03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
time=2025-08-15T10:44:58.059-03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-15T10:45:01.080-03:00 level=INFO source=server.go:1270 msg="llama runner started in 5.98 seconds"
ROCm error: out of memory
  current device: 0, in function alloc at /build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424
  ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
/build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(+0x2fc78) [0x7f3d302ebc78]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_print_backtrace+0x231) [0x7f3d302ebc51]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_abort+0x109) [0x7f3d302ebda9]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x106b52) [0x7f3cf8a97b52]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x116432) [0x7f3cf8aa7432]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10f406) [0x7f3cf8aa0406]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10eb42) [0x7f3cf8a9fb42]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10cd7c) [0x7f3cf8a9dd7c]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c5f16b]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c0c96b]
/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0xf3d464]
SIGABRT: abort
PC=0x7f3d79a6af3c m=16 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 51 gp=0xc000582e00 m=16 mp=0xc000101808

Used v0.11.5-rc2 and gpt-oss:20b. GPU is a RX 6800 and ROCm 6.3.3.

EDIT: I can do small talk with model 100% on GPU, it breaks with anything more complex.

EDIT2: Ah okay, it works if I limit myself in ~7104 tokens. (But I would love to mix usage with RAM and increase this)

@PedroHLC commented on GitHub (Aug 15, 2025): It shouldn't crash when a 16368 MiB GPU runs a 14Gb model, right? New memory estimates seems to be in-place according to logs: ```txt time=2025-08-15T10:44:55.096-03:00 level=INFO source=server.go:166 msg="enabling new memory estimates" time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped runner --ollama-engine --model /var/lib/o llama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 40251" time=2025-08-15T10:44:55.097-03:00 level=INFO source=server.go:657 msg="loading model" "model layers"=25 requested=-1 time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:663 msg="system memory" total="62.7 GiB" free="41.1 GiB" free_swap="65.0 GiB" time=2025-08-15T10:44:55.098-03:00 level=INFO source=server.go:667 msg="gpu memory" id=GPU-11883c28f37e073e available="15.2 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-15T10:44:55.103-03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:40251" time=2025-08-15T10:44:55.108-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[ID :GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-15T10:44:55.146-03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, ID: GPU-11883c28f37e073e load_backend: loaded ROCm backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so load_backend: loaded CPU backend from /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-cpu-haswell.so time=2025-08-15T10:44:57.855-03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_V MM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-15T10:44:58.012-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25[ ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-15T10:44:58.058-03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:16384 KvCacheType: NumThreads:12 GPULayers:25 [ID:GPU-11883c28f37e073e Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" time=2025-08-15T10:44:58.058-03:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:310 msg="model weights" device=ROCm0 size="11.8 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:321 msg="kv cache" device=ROCm0 size="492.0 MiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:332 msg="compute graph" device=ROCm0 size="2.1 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=backend.go:342 msg="total memory" size="15.4 GiB" time=2025-08-15T10:44:58.058-03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-15T10:44:58.058-03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" time=2025-08-15T10:44:58.059-03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" time=2025-08-15T10:45:01.080-03:00 level=INFO source=server.go:1270 msg="llama runner started in 5.98 seconds" ROCm error: out of memory current device: 0, in function alloc at /build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:424 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) /build/source/ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: ROCm error /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(+0x2fc78) [0x7f3d302ebc78] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_print_backtrace+0x231) [0x7f3d302ebc51] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-base.so(ggml_abort+0x109) [0x7f3d302ebda9] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x106b52) [0x7f3cf8a97b52] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x116432) [0x7f3cf8aa7432] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10f406) [0x7f3cf8aa0406] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10eb42) [0x7f3cf8a9fb42] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/lib/ollama/libggml-hip.so(+0x10cd7c) [0x7f3cf8a9dd7c] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c5f16b] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0x1c0c96b] /nix/store/yy3x8grizgyy522kd0lk4f4nhv5pybay-ollama-0.11.4/bin/.ollama-wrapped() [0xf3d464] SIGABRT: abort PC=0x7f3d79a6af3c m=16 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 51 gp=0xc000582e00 m=16 mp=0xc000101808 ``` Used v0.11.5-rc2 and `gpt-oss:20b`. GPU is a RX 6800 and ROCm 6.3.3. EDIT: I can do small talk with model 100% on GPU, it breaks with anything more complex. EDIT2: Ah okay, it works if I limit myself in ~7104 tokens. (But I would love to mix usage with RAM and increase this)

GiteaMirror commented

2026-05-04 14:29:04 -05:00

@jessegross commented on GitHub (Aug 15, 2025):

@PedroHLC Can you post the full log?

@jessegross commented on GitHub (Aug 15, 2025): @PedroHLC Can you post the full log?

GiteaMirror commented

2026-05-04 14:29:05 -05:00

@PedroHLC commented on GitHub (Aug 17, 2025):

@PedroHLC Can you post the full log?

ollama-rocm-oom.log

@PedroHLC commented on GitHub (Aug 17, 2025): > [@PedroHLC](https://github.com/PedroHLC) Can you post the full log? [ollama-rocm-oom.log](https://github.com/user-attachments/files/21823742/ollama-rocm-oom.log)

GiteaMirror commented

2026-05-04 14:29:06 -05:00

@PedroHLC commented on GitHub (Aug 29, 2025):

@jessegross I'm unable to reproduce this error any more with v0.11.8-rc0.

@PedroHLC commented on GitHub (Aug 29, 2025): @jessegross I'm unable to reproduce this error any more with `v0.11.8-rc0`.

GiteaMirror commented

2026-05-04 14:29:07 -05:00

@jessegross commented on GitHub (Aug 29, 2025):

@PedroHLC Flash attention was turned on by default for gpt-oss in 0.11.8, which reduces the size of those allocations.

@jessegross commented on GitHub (Aug 29, 2025): @PedroHLC Flash attention was turned on by default for gpt-oss in 0.11.8, which reduces the size of those allocations.

GiteaMirror commented

2026-05-04 14:29:08 -05:00

@digitalextremist commented on GitHub (Aug 29, 2025):

For the record, I am no longer experiencing this issue since 0.11.6 or so, possibly earlier.

Right now gpt-oss:20b and qwen3-coder:30b have both performed at the maximum possible context, reliably, for 18-hour stretches in code sessions for days on end. I would call this one resolved on my side @jessegross.

That memory estimation system you dropped on us changed the game it feels like!

Will keep open until @PedroHLC verifies he is good to go also.

@digitalextremist commented on GitHub (Aug 29, 2025): For the record, I am no longer experiencing this issue since `0.11.6` or so, possibly earlier. Right now `gpt-oss:20b` and `qwen3-coder:30b` have both performed at the maximum possible context, reliably, for 18-hour stretches in code sessions for days on end. I would call this one resolved on my side @jessegross. That memory estimation system you dropped on us changed the game it feels like! Will keep open until @PedroHLC verifies he is good to go also.

GiteaMirror commented

2026-05-04 14:29:09 -05:00

@ShaunaGordon commented on GitHub (Sep 3, 2025):

I just updated to 0.11.8 and am still getting this error. ☹️

@ShaunaGordon commented on GitHub (Sep 3, 2025): I just updated to `0.11.8` and am still getting this error. ☹️

GiteaMirror commented

2026-05-04 14:29:10 -05:00

@digitalextremist commented on GitHub (Sep 4, 2025):

Sorry to hear that! Curious @ShaunaGordon:

Which GPU are you using?
Are you using a docker image?
What model do you see this most with?
What is your num_ctx setting on the request?
Are you using flash attention and/or kv cache?
Are you using the new memory estimation approach?

@digitalextremist commented on GitHub (Sep 4, 2025): Sorry to hear that! Curious @ShaunaGordon: - Which GPU are you using? - Are you using a `docker` image? - What model do you see this most with? - What is your `num_ctx` setting on the request? - Are you using `flash attention` and/or `kv cache`? - Are you using the new memory estimation approach?

GiteaMirror commented

2026-05-04 14:29:11 -05:00

@ShaunaGordon commented on GitHub (Sep 4, 2025):

It seems to be intermittent now, after I happened to notice my autocomplete (cogito:3b) started working again.

Radeon 7900XTX 24gb vram (Ryzen 9 9950X3D, 64gb ram)
No docker images, ollama-rocm straight from Arch repos on the host system
I was getting it with my cogito:8b, initially, but now my smaller models seem to be okay, so I tried with larger models, command-r and qwen3 30b. Oddly, the larger models clock in at 30gb and 26gb respectively when started from Continue, despite only being 20gb according to their respective pages. Both models trigger the OOM error from ollama, but only when Continue calls them. Nothing else reports using the remaining system resources. System ram never seems to break 35gb usage, and the CPU barely has to try. Continue is obviously doing something to push the models' usage up (something I'll talk to them about as +50% seems excessive to me), but even so, there's no apparent reason for ollama to go OOM. When I try models that split with a similar percentage in the ollama CLI or in the Newelle client (such as Mixtral 8x7b, as both command-r and qwen fit when run from these clients), I don't get OOM errors, though Mixtral only shows 4096 for context length.
I've left num_ctx as default for both ollama and Continue (the client I'm working through). ollama ps is reporting 32768 for the context size for all tests from Continue (which is way low for command-r and cogito (128k), and qwen3:30b (256k) and way high for command-r7b (8k), though ollama appears to properly detect the lengths set by the models). I want to say Continue uses that as its default to allow space for most models, since it sets a max somewhere and lets the user set up to that max, but in this case, it's coming in way below the big models, so I wouldn't expect ollama to really care.
I'm not setting those and it doesn't look like Continue is, either. So it's whatever ollama defaults to.
Is there a particular flag for it? I'm willing to try if I need to explicitly activate it. I just assumed 0.11.8 switched to it by default when it was released.

@ShaunaGordon commented on GitHub (Sep 4, 2025): It seems to be intermittent now, after I happened to notice my autocomplete (cogito:3b) started working again. - Radeon 7900XTX 24gb vram (Ryzen 9 9950X3D, 64gb ram) - No docker images, ollama-rocm straight from Arch repos on the host system - I was getting it with my cogito:8b, initially, but now my smaller models seem to be okay, so I tried with larger models, [command-r](https://ollama.com/library/command-r) and [qwen3 30b](https://ollama.com/library/qwen3). Oddly, the larger models clock in at 30gb and 26gb respectively when started from Continue, despite only being 20gb according to their respective pages. Both models trigger the OOM error from ollama, but only when Continue calls them. Nothing else reports using the remaining system resources. System ram never seems to break 35gb usage, and the CPU barely has to try. Continue is obviously doing *something* to push the models' usage up (something I'll talk to them about as +50% seems excessive to me), but even so, there's no apparent reason for ollama to go OOM. When I try models that split with a similar percentage in the ollama CLI or in the Newelle client (such as Mixtral 8x7b, as both command-r and qwen fit when run from these clients), I don't get OOM errors, though Mixtral only shows 4096 for context length. - I've left `num_ctx` as default for both ollama and Continue (the client I'm working through). `ollama ps` is reporting 32768 for the context size for all tests from Continue (which is way low for command-r and cogito (128k), and qwen3:30b (256k) and way high for command-r7b (8k), though ollama appears to properly detect the lengths set by the models). I want to say Continue uses that as its default to allow space for most models, since it sets a max somewhere and lets the user set up to that max, but in this case, it's coming in way below the big models, so I wouldn't expect ollama to really care. - I'm not setting those and it doesn't look like Continue is, either. So it's whatever ollama defaults to. - Is there a particular flag for it? I'm willing to try if I need to explicitly activate it. I just assumed 0.11.8 switched to it by default when it was released.

GiteaMirror commented

2026-05-04 14:29:12 -05:00

@digitalextremist commented on GitHub (Sep 4, 2025):

Thanks for the details @ShaunaGordon

Increeased size sounds like context length probably. And those context numbers sound high, even for 24GB. That was when I started to feel the issue before, when the context length would technically fit, but because the estimate of what was needed in reality was off, it would crash.

Are you using the new memory management system #11090? And flash memory, and/or KV Cache?

I am using Zed myself. That has tightly configured num_ctx which is clear to see, so I'm not sure on other IDEs. But the numbers you are seeing in ps seem high. Curious how gpt-oss:20b looks for you since so much emphasis went into optimizing for that, as with gemma3 ... you might also try the qat tags on those models just to see how it goes.

@digitalextremist commented on GitHub (Sep 4, 2025): Thanks for the details @ShaunaGordon Increeased size sounds like `context length` probably. And those context numbers sound high, even for 24GB. That was when I started to feel the issue before, when the context length would technically fit, but because the estimate of what was needed in reality was off, it would crash. Are you using the new memory management system #11090? And flash memory, and/or KV Cache? I am using `Zed` myself. That has tightly configured `num_ctx` which is clear to see, so I'm not sure on other IDEs. But the numbers you are seeing in `ps` seem high. Curious how `gpt-oss:20b` looks for you since so much emphasis went into optimizing for that, as with `gemma3` ... you might also try the `qat` tags on those models just to see how it goes.

GiteaMirror commented

2026-05-04 14:29:13 -05:00

@jessegross commented on GitHub (Sep 4, 2025):

@ShaunaGordon The new memory management isn't on by default yet - you need to set the environment variable OLLAMA_NEW_ESTIMATES=1. However, it's also only supported in the new engine and most of the models you listed are only available in the old engine. gpt-oss would be new engine.

@jessegross commented on GitHub (Sep 4, 2025): @ShaunaGordon The new memory management isn't on by default yet - you need to set the environment variable OLLAMA_NEW_ESTIMATES=1. However, it's also only supported in the new engine and most of the models you listed are only available in the old engine. `gpt-oss` would be new engine.

GiteaMirror commented

2026-05-04 14:29:13 -05:00

@ShaunaGordon commented on GitHub (Sep 8, 2025):

I just got an OOM error with qwen2.5-coder:14b for some reason. It's coming in at 18gb even through Continue, so even with the bloating, it fits well within the GPU (and ollama allocates accordingly). There's no evident reason it should be erroring.

gpt-oss is reporting 14gb, so that seems to be working. It also so far isn't triggering OOM errors, so that's good.

@ShaunaGordon commented on GitHub (Sep 8, 2025): I just got an OOM error with qwen2.5-coder:14b for some reason. It's coming in at 18gb even through Continue, so even with the bloating, it fits well within the GPU (and ollama allocates accordingly). There's no evident reason it should be erroring. `gpt-oss` is reporting 14gb, so that seems to be working. It also so far isn't triggering OOM errors, so that's good.

GiteaMirror commented

2026-05-04 14:29:14 -05:00

@andrew-aladjev commented on GitHub (Sep 8, 2025):

ollama is based on llama.cpp. From my experience it is completely impossible to predict memory allocation in llama.cpp for selected backend. You may predict memory amount for llama.cpp, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm:

#!/bin/bash
set -e

# Setup INT handler.
trap 'exit 130' INT

...

# Prepare HF repository name.
HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL'

# Prepare token counts.
PROMPT_TOKEN_COUNT=<TODO>
MAX_COT_TOKEN_COUNT=$(( 2 ** 12 ))
MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 ))

# Prepare GPU layer counts.
MIN_GPU_LAYER_COUNT=0
MAX_GPU_LAYER_COUNT=49

# Prepare additional options.
ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU')

# Prepare max token count.
MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT ))
echo "Max token count: ${MAX_TOKEN_COUNT}"

# Find GPU layer counts.
GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT
BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT

while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$GPU_LAYER_COUNT" \
    --no-warmup \
    "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \
    RESULT=$?

  if [ $RESULT -eq 0 ]; then
    BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"

    MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 ))
  elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $GPU_LAYER_COUNT"

    MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 ))
  else
    exit $RESULT
  fi

  GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 ))
done

# Process prompt file.
while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --file "$PROMPT_FILE_PATH" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \
    "${ADDITIONAL_OPTIONS[@]}" || \
    RESULT=$?

  if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT"

    BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 ))
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"
  else
    exit $RESULT
  fi
done

Where qwen3-coder.sh is llama.cpp wrapper with some custom system prompt.

It works perfect, but requires 3-10 seconds to find appropriate gpu layers count, from my experience it is fine.

@andrew-aladjev commented on GitHub (Sep 8, 2025): `ollama` is based on `llama.cpp`. From my experience it is completely impossible to predict memory allocation in `llama.cpp` for selected backend. You may predict memory amount for `llama.cpp`, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm: ``` #!/bin/bash set -e # Setup INT handler. trap 'exit 130' INT ... # Prepare HF repository name. HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL' # Prepare token counts. PROMPT_TOKEN_COUNT=<TODO> MAX_COT_TOKEN_COUNT=$(( 2 ** 12 )) MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 )) # Prepare GPU layer counts. MIN_GPU_LAYER_COUNT=0 MAX_GPU_LAYER_COUNT=49 # Prepare additional options. ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU') # Prepare max token count. MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT )) echo "Max token count: ${MAX_TOKEN_COUNT}" # Find GPU layer counts. GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$GPU_LAYER_COUNT" \ --no-warmup \ "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \ RESULT=$? if [ $RESULT -eq 0 ]; then BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 )) elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $GPU_LAYER_COUNT" MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 )) else exit $RESULT fi GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 )) done # Process prompt file. while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --file "$PROMPT_FILE_PATH" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \ "${ADDITIONAL_OPTIONS[@]}" || \ RESULT=$? if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT" BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 )) echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" else exit $RESULT fi done ``` Where `qwen3-coder.sh` is `llama.cpp` wrapper with some custom system prompt. It works perfect, but requires 3-10 seconds to find appropriate gpu layers count, from my experience it is fine.

GiteaMirror commented

2026-05-04 14:29:15 -05:00

@andrew-aladjev commented on GitHub (Sep 8, 2025):

My complete solution looks as follows:

First of all you need to build llama.cpp with rocm support in docker:

build-llama.cpp.sh:

#!/bin/bash
set -e

# Clone llama.cpp.
cd ~/workspace
git clone 'git@github.com:ggml-org/llama.cpp.git' --depth '1' || :
cd llama.cpp
git fetch --all

# Checkout the latest stable tag.
git fetch --tags
LATEST_STABLE_TAG=$(git tag | grep '^b' | sort --human-numeric-sort --reverse | head -n '1')
git checkout "$LATEST_STABLE_TAG"

# Build image for AMD GPU.
docker build \
  --tag 'local/llama.cpp:light-rocm' \
  --target 'light' \
  --file '.devops/rocm.Dockerfile' \
  --progress 'plain' \
  '.'

# Docker cleanup.
docker system prune --force

You need to run build-llama.cpp.sh to update your rocm llama.cpp image.

Then llama.cpp.sh:

#!/bin/bash
set -e

# Prepare nproc.
NPROC=$(nproc)

# Run image.
docker run \
  --rm \
  --network 'host' \
  --group-add 'video' \
  --ipc 'host' \
  --cap-add 'SYS_PTRACE' \
  --security-opt 'seccomp=unconfined' \
  --device '/dev/kfd' \
  --device '/dev/dri' \
  --volume '/home/llama.cpp:/root/.cache/llama.cpp' \
  --volume '/tmp:/tmp' \
  'local/llama.cpp:light-rocm' \
  --threads "$NPROC" \
  --no-escape \
  --reasoning-format 'none' \
  "$@"

Then qwen3-coder.sh:

#!/bin/bash
set -e

# Prepare directory path.
DIR_PATH=$(dirname "${BASH_SOURCE[0]}")

# Read system prompt.
SYSTEM_PROMPT_FILE_PATH="${DIR_PATH}/qwen3-system-prompt.txt"
SYSTEM_PROMPT=$(cat "$SYSTEM_PROMPT_FILE_PATH")

# Run llama.cpp.
llama.cpp.sh \
  --temp '0.7' \
  --top-k '20' \
  --top-p '0.8' \
  --min-p '0.0' \
  --repeat-penalty '1.05' \
  --system-prompt "$SYSTEM_PROMPT" \
  "$@"

qwen3-system-prompt.txt is up to you.

qwen3-coder-30b-a3b.sh:

#!/bin/bash                                         
set -e                             
                                                                                                        
# Setup INT handler.                      
trap 'exit 130' INT                                 
                                                                                                        
# Prepare prompt file path.
PROMPT_FILE_PATH="$1"                                                                                   
                                                                                                        
# Prepare HF repository name.                       
HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL'
                                                    
# Prepare max token counts.
MAX_COT_TOKEN_COUNT=$(( 2 ** 12 ))        
MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 ))

# Prepare additional options.
ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU')

# Find prompt token count.
PROMPT_TOKEN_OUTPUT=$(
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --file "$PROMPT_FILE_PATH" \
    --ctx-size '-2' \
    --no-warmup \
    --no-conversation \
    "${ADDITIONAL_OPTIONS[@]}" 2>&1 || :
)

# Prepare prompt token count.
PROMPT_TOKEN_COUNT=$(
  grep 'prompt is too long' <<< "$PROMPT_TOKEN_OUTPUT" |
    grep -o '[0-9]\+' |
    head -n '1'
)

# Prepare max GPU layer count.
MAX_GPU_LAYER_COUNT=$(
  grep 'n_layer\s*=' <<< "$PROMPT_TOKEN_OUTPUT" |
    grep -o '[0-9]\+' |
    head -n '1'
)

# Prepare max token count.
MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT ))
echo "Max token count: ${MAX_TOKEN_COUNT}"

# Find GPU layer counts.
MIN_GPU_LAYER_COUNT=0
GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT
BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT

while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$GPU_LAYER_COUNT" \
    --no-warmup \
    "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \ 
    RESULT=$?

  if [ $RESULT -eq 0 ]; then
    BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"

    MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 ))
  elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $GPU_LAYER_COUNT"

    MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 ))
  else
    exit $RESULT
  fi

  GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 ))
done

# Process prompt file.
while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do
  RESULT=0
  qwen3-coder.sh \
    --hf-repo "$HF_REPOSITORY_NAME" \
    --file "$PROMPT_FILE_PATH" \
    --ctx-size "$MAX_TOKEN_COUNT" \
    --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \
    "${ADDITIONAL_OPTIONS[@]}" || \
    RESULT=$?

  if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then
    echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT"

    BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 ))
    echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT"
  else
    exit $RESULT
  fi
done

You can run qwen3-coder-30b-a3b.sh with one argument: file path with prompt.

My solution introduces anti-enterprise solution of running qwen3-coder using llama.cpp.

Enterprise solution means you are loading model once with constant context size. Enterprise solution is a good one when your context size is close to limits every time your are querying llm. Otherwise it is a very bad solution.

I am reloading model for each query with variable context size in order to run it with maximum possible offloaded gpu layers. As a result I am receiving excellent performance on just single budget gpu: 7600 xt with 16 gb VRAM when context size is around 64k.

@andrew-aladjev commented on GitHub (Sep 8, 2025): My complete solution looks as follows: First of all you need to build `llama.cpp` with `rocm` support in `docker`: `build-llama.cpp.sh`: ``` #!/bin/bash set -e # Clone llama.cpp. cd ~/workspace git clone 'git@github.com:ggml-org/llama.cpp.git' --depth '1' || : cd llama.cpp git fetch --all # Checkout the latest stable tag. git fetch --tags LATEST_STABLE_TAG=$(git tag | grep '^b' | sort --human-numeric-sort --reverse | head -n '1') git checkout "$LATEST_STABLE_TAG" # Build image for AMD GPU. docker build \ --tag 'local/llama.cpp:light-rocm' \ --target 'light' \ --file '.devops/rocm.Dockerfile' \ --progress 'plain' \ '.' # Docker cleanup. docker system prune --force ``` You need to run `build-llama.cpp.sh` to update your rocm `llama.cpp` image. Then `llama.cpp.sh`: ``` #!/bin/bash set -e # Prepare nproc. NPROC=$(nproc) # Run image. docker run \ --rm \ --network 'host' \ --group-add 'video' \ --ipc 'host' \ --cap-add 'SYS_PTRACE' \ --security-opt 'seccomp=unconfined' \ --device '/dev/kfd' \ --device '/dev/dri' \ --volume '/home/llama.cpp:/root/.cache/llama.cpp' \ --volume '/tmp:/tmp' \ 'local/llama.cpp:light-rocm' \ --threads "$NPROC" \ --no-escape \ --reasoning-format 'none' \ "$@" ``` Then `qwen3-coder.sh`: ``` #!/bin/bash set -e # Prepare directory path. DIR_PATH=$(dirname "${BASH_SOURCE[0]}") # Read system prompt. SYSTEM_PROMPT_FILE_PATH="${DIR_PATH}/qwen3-system-prompt.txt" SYSTEM_PROMPT=$(cat "$SYSTEM_PROMPT_FILE_PATH") # Run llama.cpp. llama.cpp.sh \ --temp '0.7' \ --top-k '20' \ --top-p '0.8' \ --min-p '0.0' \ --repeat-penalty '1.05' \ --system-prompt "$SYSTEM_PROMPT" \ "$@" ``` `qwen3-system-prompt.txt` is up to you. `qwen3-coder-30b-a3b.sh`: ``` #!/bin/bash set -e # Setup INT handler. trap 'exit 130' INT # Prepare prompt file path. PROMPT_FILE_PATH="$1" # Prepare HF repository name. HF_REPOSITORY_NAME='unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_XL' # Prepare max token counts. MAX_COT_TOKEN_COUNT=$(( 2 ** 12 )) MAX_OUTPUT_TOKEN_COUNT=$(( 2 ** 13 )) # Prepare additional options. ADDITIONAL_OPTIONS=('--override-tensor' '.ffn_.*_exps.=CPU') # Find prompt token count. PROMPT_TOKEN_OUTPUT=$( qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --file "$PROMPT_FILE_PATH" \ --ctx-size '-2' \ --no-warmup \ --no-conversation \ "${ADDITIONAL_OPTIONS[@]}" 2>&1 || : ) # Prepare prompt token count. PROMPT_TOKEN_COUNT=$( grep 'prompt is too long' <<< "$PROMPT_TOKEN_OUTPUT" | grep -o '[0-9]\+' | head -n '1' ) # Prepare max GPU layer count. MAX_GPU_LAYER_COUNT=$( grep 'n_layer\s*=' <<< "$PROMPT_TOKEN_OUTPUT" | grep -o '[0-9]\+' | head -n '1' ) # Prepare max token count. MAX_TOKEN_COUNT=$(( $PROMPT_TOKEN_COUNT + $MAX_COT_TOKEN_COUNT + $MAX_OUTPUT_TOKEN_COUNT )) echo "Max token count: ${MAX_TOKEN_COUNT}" # Find GPU layer counts. MIN_GPU_LAYER_COUNT=0 GPU_LAYER_COUNT=$MAX_GPU_LAYER_COUNT BEST_GPU_LAYER_COUNT=$MIN_GPU_LAYER_COUNT while [ $MIN_GPU_LAYER_COUNT -le $MAX_GPU_LAYER_COUNT ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$GPU_LAYER_COUNT" \ --no-warmup \ "${ADDITIONAL_OPTIONS[@]}" > /dev/null 2>&1 || \ RESULT=$? if [ $RESULT -eq 0 ]; then BEST_GPU_LAYER_COUNT=$GPU_LAYER_COUNT echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" MIN_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT + 1 )) elif [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $GPU_LAYER_COUNT" MAX_GPU_LAYER_COUNT=$(( $GPU_LAYER_COUNT - 1 )) else exit $RESULT fi GPU_LAYER_COUNT=$(( ($MIN_GPU_LAYER_COUNT + $MAX_GPU_LAYER_COUNT + 1) / 2 )) done # Process prompt file. while [ $BEST_GPU_LAYER_COUNT -ge 0 ]; do RESULT=0 qwen3-coder.sh \ --hf-repo "$HF_REPOSITORY_NAME" \ --file "$PROMPT_FILE_PATH" \ --ctx-size "$MAX_TOKEN_COUNT" \ --n-gpu-layers "$BEST_GPU_LAYER_COUNT" \ "${ADDITIONAL_OPTIONS[@]}" || \ RESULT=$? if [ $RESULT -eq 1 ] || [ $RESULT -eq 139 ]; then echo "Failed to use GPU layer count: $BEST_GPU_LAYER_COUNT" BEST_GPU_LAYER_COUNT=$(( $BEST_GPU_LAYER_COUNT - 1 )) echo "Best GPU layer count: $BEST_GPU_LAYER_COUNT" else exit $RESULT fi done ``` You can run `qwen3-coder-30b-a3b.sh` with one argument: file path with prompt. My solution introduces anti-enterprise solution of running `qwen3-coder` using `llama.cpp`. Enterprise solution means you are loading model once with constant context size. Enterprise solution is a good one when your context size is close to limits every time your are querying llm. Otherwise it is a very bad solution. I am reloading model for each query with variable context size in order to run it with maximum possible offloaded gpu layers. As a result I am receiving excellent performance on just single budget gpu: 7600 xt with 16 gb VRAM when context size is around 64k.

GiteaMirror commented

2026-05-04 14:29:16 -05:00

@jessegross commented on GitHub (Sep 8, 2025):

@ShaunaGordon qwen2.5-coder runs on the old engine by default and therefore cannot take advantage of the new memory allocation system. However, it is implemented in the new engine and you can turn it on with OLLAMA_NEW_ENGINE=1 (plus OLLAMA_NEW_ESTIMATES=1).

@jessegross commented on GitHub (Sep 8, 2025): @ShaunaGordon qwen2.5-coder runs on the old engine by default and therefore cannot take advantage of the new memory allocation system. However, it is implemented in the new engine and you can turn it on with OLLAMA_NEW_ENGINE=1 (plus OLLAMA_NEW_ESTIMATES=1).

GiteaMirror commented

2026-05-04 14:29:18 -05:00

@jessegross commented on GitHub (Sep 8, 2025):

@andrew-aladjev

ollama is based on llama.cpp. From my experience it is completely impossible to predict memory allocation in llama.cpp for selected backend. You may predict memory amount for llama.cpp, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm:

This is only true for the old engine, which is based on llama.cpp. Ollama's new engine does not use llama.cpp and has much more accurate allocations, which is currently in preview. It can be enabled with OLLAMA_NEW_ESTIMATES=1, as noted above.

@jessegross commented on GitHub (Sep 8, 2025): @andrew-aladjev > `ollama` is based on `llama.cpp`. From my experience it is completely impossible to predict memory allocation in `llama.cpp` for selected backend. You may predict memory amount for `llama.cpp`, but afterwards you will receive segfault (not enough VRAM) from backend. So I just dropped this idea and switched to binary search algorithm: This is only true for the old engine, which is based on llama.cpp. Ollama's new engine does not use llama.cpp and has much more accurate allocations, which is currently in preview. It can be enabled with OLLAMA_NEW_ESTIMATES=1, as noted above.

GiteaMirror commented

2026-05-04 14:29:20 -05:00

@ShaunaGordon commented on GitHub (Sep 9, 2025):

@jessegross Yes, I understand that. I have gpt-oss using the new system. I was confirming that it is correctly reporting the allocation and seems to be working (though today, I'm getting other crashes with super cryptic error messages, so I don't even know if it's at all related).

I mentioned qwen not because I was expecting it to run on the new system, but because I had originally only been able to replicate the issue with models that split onto the CPU, yet with qwen, the model fits well within, by all accounts, and yet still throws OOM errors, despite ostensibly having 6+gb of space in the vram.

What exactly is the proper course of action at this point, here? It seems unreasonable to expect people to only use gpt-oss, yet alternatives still run into this, because they're on the older system. Are we just stuck choosing between gpt-oss and undersized models until someone does whatever needs to be done for other models to use the new system?

@ShaunaGordon commented on GitHub (Sep 9, 2025): @jessegross Yes, I understand that. I have `gpt-oss` using the new system. I was confirming that it is correctly reporting the allocation and seems to be working (though today, I'm getting other crashes with super cryptic error messages, so I don't even know if it's at all related). I mentioned qwen not because I was expecting it to run on the new system, but because I had originally only been able to replicate the issue with models that split onto the CPU, yet with qwen, the model fits well within, by all accounts, and yet still throws OOM errors, despite ostensibly having 6+gb of space in the vram. What exactly is the proper course of action at this point, here? It seems unreasonable to expect people to only use `gpt-oss`, yet alternatives still run into this, because they're on the older system. Are we just stuck choosing between `gpt-oss` and undersized models until someone does whatever needs to be done for other models to use the new system?

GiteaMirror commented

2026-05-04 14:29:23 -05:00

@jessegross commented on GitHub (Sep 9, 2025):

@ShaunaGordon qwen2 can run on the new engine if you set OLLAMA_NEW_ENGINE=1

@jessegross commented on GitHub (Sep 9, 2025): @ShaunaGordon qwen2 can run on the new engine if you set OLLAMA_NEW_ENGINE=1

GiteaMirror commented

2026-05-04 14:29:27 -05:00

@digitalextremist commented on GitHub (Sep 10, 2025):

@andrew-aladjev I have an RX 7800 XT 16GB and your notes got my attention:

Do you mind moving your solution to a repository or group of gists outside this issue?

I am happy with Ollama, especially after @jessegross changed the fabric of reality in our favor, but I feel like there is something there in what you posted @andrew-aladjev when it comes to pinning a certain model. I started this issue wanting to have a reliable coding agent based on a local LLM, and that would be with an "enterprise solution" so-called.

@jessegross is there a compatibility list somewhere showing which models run on which engine?

I am generally using gpt-oss:20b and qwen3-coder:30b and gemma3:4b-qat ( my wife's Janet themed chat/research bot ) but also started experimenting with devstral ( which seems meh to me ) ... I notice that two I really want to run behave completely differently than the ones in the official registry, already named:

But then this one seems to be fine:

https://ollama.com/danielsheep/Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS

Wondering if there is a way we can identify which engine a model is using.

@digitalextremist commented on GitHub (Sep 10, 2025): @andrew-aladjev I have an `RX 7800 XT 16GB` and your notes got my attention: Do you mind moving your solution to a repository or group of gists outside this issue? I am happy with `Ollama`, especially after @jessegross changed the fabric of reality in our favor, but I feel like there is something there in what you posted @andrew-aladjev when it comes to pinning a certain model. I started this issue wanting to have a reliable coding agent based on a local LLM, and that would be with an "enterprise solution" so-called. --- @jessegross is there a compatibility list somewhere showing which models run on which engine? I am generally using `gpt-oss:20b` and `qwen3-coder:30b` and `gemma3:4b-qat` ( my wife's [`Janet`](https://thegoodplace.fandom.com/wiki/Janet) themed chat/research bot ) but also started experimenting with `devstral` ( which seems meh to me ) ... I notice that two I really want to run behave completely differently than the ones in the official registry, already named: - https://ollama.com/huihui_ai/qwen3-coder-abliterated:30b - https://ollama.com/huihui_ai/gpt-oss-abliterated:20b But then this one seems to be fine: - https://ollama.com/danielsheep/Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS Wondering if there is a way we can identify which engine a model is using.

GiteaMirror commented

2026-05-04 14:29:29 -05:00

@jessegross commented on GitHub (Sep 10, 2025):

@digitalextremist

You can see the architectures that have been implemented in the Ollama engine here:
https://github.com/ollama/ollama/blob/main/model/models/models.go

Not all of these are currently enabled by default. We are working to verify them so that the Ollama engine is used automatically when an implementation exists. Here are the ones that are used automatically:
20b53eaa72/fs/ggml/ggml.go (L241)

You can verify whether a particular model is using the Ollama engine by looking for this line in the log file:
level=INFO source=runner.go:1305 msg="starting ollama engine"

In some cases, models share the same architecture string but are actually different so the log message is the best way to know for certain which engine is being used.

@jessegross commented on GitHub (Sep 10, 2025): @digitalextremist You can see the architectures that have been implemented in the Ollama engine here: https://github.com/ollama/ollama/blob/main/model/models/models.go Not all of these are currently enabled by default. We are working to verify them so that the Ollama engine is used automatically when an implementation exists. Here are the ones that are used automatically: https://github.com/ollama/ollama/blob/20b53eaa726a4c08043c7af1fa6a322295edcde2/fs/ggml/ggml.go#L241 You can verify whether a particular model is using the Ollama engine by looking for this line in the log file: `level=INFO source=runner.go:1305 msg="starting ollama engine"` In some cases, models share the same architecture string but are actually different so the log message is the best way to know for certain which engine is being used.

GiteaMirror commented

2026-05-04 14:29:30 -05:00

@digitalextremist commented on GitHub (Sep 11, 2025):

It is awesome of you to share your knowledge on this @jessegross; thanks for being cool and generous with your time. I was definitely not expecting both to have memory estimation solved for most models I use now, and get time-traveled to where I can rtfm in code directly. Been seeing */OSS head off a cliff lately with rampant inhumanity. Pardon my being taken aback!

One of if not the reason I stick close with Ollama ( while knowing there is more out there ) is that quality, along with actual ninja skill. Anyone can become a rockstar coder, very few can be genuinely cool. Anyway, thanks. I will rtfm from there now. Though I am not a go coder, I am starting to want to superficially understanding the code base since this is awesome. Like loving your chickens or cow. I trip out at the substance I experience from being here. I hope Ollama stays cool long-term!

Thanks again; and sticking with this one until the others on this issue feel solidly on the other side of crashes.

@digitalextremist commented on GitHub (Sep 11, 2025): It is awesome of you to share your knowledge on this @jessegross; thanks for being cool and generous with your time. I was definitely not expecting both to have memory estimation solved for most models I use now, and get time-traveled to where I can `rtfm` in code directly. Been seeing `*/OSS` head off a cliff lately with rampant inhumanity. Pardon my being taken aback! One of if not _the_ reason I stick close with `Ollama` ( while knowing there is more out there ) is that quality, along with actual ninja skill. Anyone can become a rockstar coder, very few can be genuinely cool. Anyway, thanks. I will `rtfm` from there now. Though I am not a `go` coder, I am starting to want to superficially understanding the code base since this is awesome. Like loving your chickens or cow. I trip out at the substance I experience from being here. I hope `Ollama` stays cool long-term! Thanks again; and sticking with this one until the others on this issue feel solidly on the other side of crashes.

GiteaMirror commented

2026-05-04 14:29:31 -05:00

@jessegross commented on GitHub (Sep 11, 2025):

@digitalextremist Thank you for your support!

@jessegross commented on GitHub (Sep 11, 2025): @digitalextremist Thank you for your support!

GiteaMirror commented

2026-05-04 14:29:32 -05:00

@digitalextremist commented on GitHub (Sep 12, 2025):

And, congratz on shipping your new memory estimates as the default #12252 @jessegross!

@digitalextremist commented on GitHub (Sep 12, 2025): And, congratz on shipping your new memory estimates as the default #12252 @jessegross!

GiteaMirror commented

2026-05-04 14:29:34 -05:00

@digitalextremist commented on GitHub (Sep 24, 2025):

By the way; what is being called "model scheduling" sounds a lot like it requires or is much of the memory estimates functionality? Is that the case @jessegross?

Am updating to 0.12.1 now. Seems like this issue ought to be closed as moot soon since most of it applies to a prior era.

@digitalextremist commented on GitHub (Sep 24, 2025): By the way; what is being called `"model scheduling"` sounds a lot like it requires or _is_ much of the `memory estimates` functionality? Is that the case @jessegross? Am updating to `0.12.1` now. Seems like this issue ought to be closed as `moot` soon since most of it applies to a prior era.

GiteaMirror commented

2026-05-04 14:29:36 -05:00

@rick-github commented on GitHub (Sep 24, 2025):

The model scheduler replaces the memory estimation logic, but only for models running on the ollama engine. Since that actually covers most of the popular models, it is sort of moot.

@rick-github commented on GitHub (Sep 24, 2025): The model scheduler replaces the memory estimation logic, but only for models running on the ollama engine. Since that actually covers most of the popular models, it is sort of moot.

GiteaMirror commented

2026-05-04 14:29:36 -05:00

@digitalextremist commented on GitHub (Sep 24, 2025):

Thanks @rick-github; since I have not been having any issues for many versions now, and this issue has gone silent for a while for others, and everything seems to be at a sea-change for Ollama overall. I will mark this issue as closed!

Still not totally clear if @jessegross's memory estimation has already been sunset, or what, but about to try the new version.

@digitalextremist commented on GitHub (Sep 24, 2025): Thanks @rick-github; since I have not been having any issues for many versions now, and this issue has gone silent for a while for others, and everything seems to be at a sea-change for `Ollama` overall. I will mark this issue as closed! Still not totally clear if @jessegross's memory estimation has already been sunset, or what, but about to try the new version.

GiteaMirror commented

2026-05-04 14:29:37 -05:00

@jessegross commented on GitHub (Sep 24, 2025):

"Model scheduling" is referring to the new memory estimates, it's the same thing as what we have been discussing here. However, the new system is not really an estimate any more, so it wasn't the best name.

@jessegross commented on GitHub (Sep 24, 2025): "Model scheduling" is referring to the new memory estimates, it's the same thing as what we have been discussing here. However, the new system is not really an estimate any more, so it wasn't the best name.

GiteaMirror commented

2026-05-04 14:29:38 -05:00

@digitalextremist commented on GitHub (Sep 24, 2025):

Thank you for clarifying @jessegross; I had a feeling that what you have been working on is essentially the same underlying concepts. I know we are all navigating abstraction right now so it is nice to have a feeling be right in such a vague situation.

Again, my congratulations for laying down such a fundamental piece that really got a through to somewhere awesome.

@digitalextremist commented on GitHub (Sep 24, 2025): Thank you for clarifying @jessegross; I had a feeling that what you have been working on is essentially the same underlying concepts. I know we are all navigating abstraction right now so it is nice to have a feeling be right in such a vague situation. Again, my congratulations for laying down such a fundamental piece that really got a through to somewhere awesome.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#68575