[GH-ISSUE #12756] Server crashes when trying to run some models #8460

Closed
opened 2026-04-12 21:08:56 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @CryptoCopter on GitHub (Oct 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12756

What is the issue?

Sometimes the server just straight up panics/crashes.
To be clear, this is not an intermittent issue - the modes that work always work, and those that crash do so very time.

Unfortunately, the logfile is too large to be put into the issue directly, so here's a link

The log is from trying to run mistral-nemo:12b, which I chose because I have been previously able to run it on the very same machine with previous Ollama versions.
Unfortunately, I can't tell exactly when the regression was introduced.

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.12.6

Originally created by @CryptoCopter on GitHub (Oct 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12756 ### What is the issue? Sometimes the server just straight up panics/crashes. To be clear, this is not an intermittent issue - the modes that work always work, and those that crash do so very time. Unfortunately, the logfile is too large to be put into the issue directly, so [here's a link](https://fserve.splork.de/public/server.log) The log is from trying to run `mistral-nemo:12b`, which I chose because I have been previously able to run it on the very same machine with previous Ollama versions. Unfortunately, I can't tell exactly when the regression was introduced. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.6
GiteaMirror added the bug label 2026-04-12 21:08:57 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 23, 2025):

time=2025-10-23T15:23:39.626+02:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1
 layers.model=41 layers.offload=1 layers.split=[1] memory.available="[11.0 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="36.2 GiB" memory.required.partial="10.6 GiB" memory.required.kv="20.0 GiB"
 memory.required.allocations="[10.6 GiB]" memory.weights.total="6.2 GiB" memory.weights.repeating="5.7 GiB"
 memory.weights.nonrepeating="525.0 MiB" memory.graph.full="8.3 GiB" memory.graph.partial="8.9 GiB"

A context size of 128k results in a large graph, resulting in only one layer being assigned to the GPU, taking 10.6 of the 11 GiB available.

graph_reserve: failed to allocate compute buffers
Exception 0xc0000005 0x0 0x1f5dd282618 0x7ff9dd9dfe5a
PC=0x7ff9dd9dfe5a
signal arrived during external code execution

During loading the runner couldn't allocate enough memory. The initial allocation left only 400MiB wiggle room. seemingly not enough margin. It's possible that in previous versions of ollama, mistral-nemo:12b was running entirely in CPU, so this wasn't an issue earlier.

See here for ways to mitigate this.

<!-- gh-comment-id:3437766053 --> @rick-github commented on GitHub (Oct 23, 2025): ``` time=2025-10-23T15:23:39.626+02:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=41 layers.offload=1 layers.split=[1] memory.available="[11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="36.2 GiB" memory.required.partial="10.6 GiB" memory.required.kv="20.0 GiB" memory.required.allocations="[10.6 GiB]" memory.weights.total="6.2 GiB" memory.weights.repeating="5.7 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="8.3 GiB" memory.graph.partial="8.9 GiB" ``` A context size of 128k results in a large graph, resulting in only one layer being assigned to the GPU, taking 10.6 of the 11 GiB available. ``` graph_reserve: failed to allocate compute buffers Exception 0xc0000005 0x0 0x1f5dd282618 0x7ff9dd9dfe5a PC=0x7ff9dd9dfe5a signal arrived during external code execution ``` During loading the runner couldn't allocate enough memory. The initial allocation left only 400MiB wiggle room. seemingly not enough margin. It's possible that in previous versions of ollama, `mistral-nemo:12b` was running entirely in CPU, so this wasn't an issue earlier. See [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for ways to mitigate this.
Author
Owner

@CryptoCopter commented on GitHub (Oct 23, 2025):

Okay... that is strange, since inference was always super snappy.
But I will take a look at the mitigations, thanks!

But maybe giving some sort of warning Not enough VRAM, Need X, Have Y would be better than just crashing...

<!-- gh-comment-id:3437889844 --> @CryptoCopter commented on GitHub (Oct 23, 2025): Okay... that is strange, since inference was always super snappy. But I will take a look at the mitigations, thanks! But maybe giving some sort of warning `Not enough VRAM, Need X, Have Y` would be better than just crashing...
Author
Owner

@rick-github commented on GitHub (Oct 23, 2025):

Yeah, well it's not a feature, it's a bug. That's why there are new releases from time to time.

<!-- gh-comment-id:3437899223 --> @rick-github commented on GitHub (Oct 23, 2025): Yeah, well it's not a feature, it's a bug. That's why there are new releases from time to time.
Author
Owner

@CryptoCopter commented on GitHub (Oct 23, 2025):

Of course, I was not trying to imply that it's expected behaviour. I'm sorry if I phrased it that way.

Since I'm filing a bug report, I just thought I would give a hint towards what I would see as improved behaviour.
Since ollama seems to know beforehand how much memory it's going to need, it might be sensible to abort and throw a warning if the available vram is insufficient.

<!-- gh-comment-id:3437928913 --> @CryptoCopter commented on GitHub (Oct 23, 2025): Of course, I was not trying to imply that it's expected behaviour. I'm sorry if I phrased it that way. Since I'm filing a bug report, I just thought I would give a hint towards what I would see as improved behaviour. Since ollama seems to know beforehand how much memory it's going to need, it might be sensible to abort and throw a warning if the available vram is insufficient.
Author
Owner

@rick-github commented on GitHub (Oct 23, 2025):

ollama does its best to estimate usage, but it's not always accurate - model architectures differ wildly about how they use memory, and it's hard to get all the nuances during the estimation stage. The last few releases have improved the memory estimation logic, but unfortunately only for those models that run on the new ollama engine. As model families are migrated to the ollama engine these sorts of memory miscalculations will diminish, but for now the easiest way to reduce the OOMs is to follow some of the mitigation strategies in the post.

<!-- gh-comment-id:3438063689 --> @rick-github commented on GitHub (Oct 23, 2025): ollama does its best to estimate usage, but it's not always accurate - model architectures differ wildly about how they use memory, and it's hard to get all the nuances during the estimation stage. The last few releases have improved the memory estimation logic, but unfortunately only for those models that run on the new ollama engine. As model families are migrated to the ollama engine these sorts of memory miscalculations will diminish, but for now the easiest way to reduce the OOMs is to follow some of the mitigation strategies in the post.
Author
Owner

@CryptoCopter commented on GitHub (Oct 23, 2025):

Yes, enable flash attention and setting kv quantization to 8 bit did the trick, thanks!

<!-- gh-comment-id:3438393607 --> @CryptoCopter commented on GitHub (Oct 23, 2025): Yes, enable flash attention and setting kv quantization to 8 bit did the trick, thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8460