[GH-ISSUE #10532] LLAMA4 Scout crashes during run #68988

New Issue

GiteaMirror · 2026-05-04T16:40:10-05:00

GiteaMirror commented

2026-05-04 16:40:10 -05:00

Originally created by @srshkmr on GitHub (May 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10532

What is the issue?

Error: llama runner process has terminated: signal: killed```

I am on M3 pro 36GB RAM. 

do i have to wait for a different model?

### Relevant log output

```shell

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.6.7

Originally created by @srshkmr on GitHub (May 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10532 ### What is the issue? ```ollama run llama4:17b-scout-16e-instruct-q4_K_M Error: llama runner process has terminated: signal: killed``` I am on M3 pro 36GB RAM. do i have to wait for a different model? ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.6.7

GiteaMirror added the bug label 2026-05-04 16:40:10 -05:00

GiteaMirror closed this issue

2026-05-04 16:40:14 -05:00

GiteaMirror commented

2026-05-04 16:40:18 -05:00

@rick-github commented on GitHub (May 2, 2025):

signal: killed generally means that an external actor killed the runner. The likely cause is that the operating system thought that the runner was too big and would cause an OOM condition on the machine. There should be log entries from your OS in the system log.

@rick-github commented on GitHub (May 2, 2025): `signal: killed` generally means that an external actor killed the runner. The likely cause is that the operating system thought that the runner was too big and would cause an OOM condition on the machine. There should be log entries from your OS in the system log.

GiteaMirror commented

2026-05-04 16:40:24 -05:00

@igorschlum commented on GitHub (May 12, 2025):

Hi, I'm adding my experience to this bug report as I'm encountering the same issue when trying to run llama4:maverick with Ollama 0.6.8.

My Environment:

Ollama Version: 0.6.8
Mac Station with 192 GB of RAM sequoia 15.4.1

I thought that llama4 Maverick was using only some layers and so could run with 180 GB of Ram despite the fact that the model size is 245 GB.

@igorschlum commented on GitHub (May 12, 2025): Hi, I'm adding my experience to this bug report as I'm encountering the same issue when trying to run llama4:maverick with Ollama 0.6.8. My Environment: Ollama Version: 0.6.8 Mac Station with 192 GB of RAM sequoia 15.4.1 I thought that llama4 Maverick was using only some layers and so could run with 180 GB of Ram despite the fact that the model size is 245 GB.

GiteaMirror commented

2026-05-04 16:40:27 -05:00

@rick-github commented on GitHub (May 12, 2025):

It should be able to run with 180G RAM, there just has to be enough swap to accommodate the part of the model that is not RAM-resident. I'm not a Mac user so I don't know if this is helpful: https://discussions.apple.com/thread/252429784

@rick-github commented on GitHub (May 12, 2025): It should be able to run with 180G RAM, there just has to be enough swap to accommodate the part of the model that is not RAM-resident. I'm not a Mac user so I don't know if this is helpful: https://discussions.apple.com/thread/252429784

GiteaMirror commented

2026-05-04 16:40:29 -05:00

@igorschlum commented on GitHub (May 18, 2025):

Hi @rick-github,

Thanks for your suggestion. I spent some time looking how swap works on MacOS.

Swap is normally managed automatically on macOS, and even after I freed up approximately 900GB of space on my internal HD, I'm still encountering a kill message before the loading process completes.

Here is my log:

time=2025-05-18T03:31:57.503+02:00 level=INFO source=server.go:135 msg="system memory" total="192.0 GiB" free="221.5 GiB" free_swap="0 B"
time=2025-05-18T03:31:57.504+02:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=49 layers.offload=27 layers.split="" memory.available="[144.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="238.9 GiB" memory.required.partial="143.9 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[143.9 GiB]" memory.weights.total="225.8 GiB" memory.weights.repeating="225.1 GiB" memory.weights.nonrepeating="809.3 MiB" memory.graph.full="404.6 MiB" memory.graph.partial="404.6 MiB" projector.weights="1.6 GiB" projector.graph="0 B"
time=2025-05-18T03:31:57.541+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/igor/.ollama/models/blobs/sha256-ecdedd393ed15c5cd32bb4ae6240db958f600d757daa64aab531656964b13b9c --ctx-size 4096 --batch-size 512 --n-gpu-layers 27 --threads 16 --no-mmap --parallel 1 --port 49320"
ollama(60311) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
time=2025-05-18T03:31:57.543+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:49320"
time=2025-05-18T03:31:57.585+02:00 level=INFO source=ggml.go:73 msg="" architecture=llama4 file_type=Q4_K_M name="" description="" num_tensors=1085 num_key_values=45
time=2025-05-18T03:31:57.587+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=Metal size="131.1 GiB"
time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="96.9 GiB"
time=2025-05-18T03:31:57.794+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-18T03:32:32.865+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-18T03:32:33.116+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-18T03:35:03.494+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-18T03:35:13.240+02:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: signal: killed"
[GIN] 2025/05/18 - 03:35:13 | 500 |         3m15s |       127.0.0.1 | POST     "/api/generate"

And my MacOS system log is here:

system log.txt

I hope that it can be fixed. If it works on linux, it could work on MacOS.

@igorschlum commented on GitHub (May 18, 2025): Hi @rick-github, Thanks for your suggestion. I spent some time looking how swap works on MacOS. Swap is normally managed automatically on macOS, and even after I freed up approximately 900GB of space on my internal HD, I'm still encountering a kill message before the loading process completes. Here is my log: ``` time=2025-05-18T03:31:57.503+02:00 level=INFO source=server.go:135 msg="system memory" total="192.0 GiB" free="221.5 GiB" free_swap="0 B" time=2025-05-18T03:31:57.504+02:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=49 layers.offload=27 layers.split="" memory.available="[144.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="238.9 GiB" memory.required.partial="143.9 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[143.9 GiB]" memory.weights.total="225.8 GiB" memory.weights.repeating="225.1 GiB" memory.weights.nonrepeating="809.3 MiB" memory.graph.full="404.6 MiB" memory.graph.partial="404.6 MiB" projector.weights="1.6 GiB" projector.graph="0 B" time=2025-05-18T03:31:57.541+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/igor/.ollama/models/blobs/sha256-ecdedd393ed15c5cd32bb4ae6240db958f600d757daa64aab531656964b13b9c --ctx-size 4096 --batch-size 512 --n-gpu-layers 27 --threads 16 --no-mmap --parallel 1 --port 49320" ollama(60311) MallocStackLogging: can't turn off malloc stack logging because it was not enabled. time=2025-05-18T03:31:57.543+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:49320" time=2025-05-18T03:31:57.585+02:00 level=INFO source=ggml.go:73 msg="" architecture=llama4 file_type=Q4_K_M name="" description="" num_tensors=1085 num_key_values=45 time=2025-05-18T03:31:57.587+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=Metal size="131.1 GiB" time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="96.9 GiB" time=2025-05-18T03:31:57.794+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-18T03:32:32.865+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-18T03:32:33.116+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-18T03:35:03.494+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-18T03:35:13.240+02:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: signal: killed" [GIN] 2025/05/18 - 03:35:13 | 500 | 3m15s | 127.0.0.1 | POST "/api/generate" ``` And my MacOS system log is here: [system log.txt](https://github.com/user-attachments/files/20270218/system.log.txt) I hope that it can be fixed. If it works on linux, it could work on MacOS.

GiteaMirror commented

2026-05-04 16:40:32 -05:00

@rick-github commented on GitHub (May 18, 2025):

Does this help?

@rick-github commented on GitHub (May 18, 2025): Does [this](https://github.com/ollama/ollama/issues/6918#issuecomment-2488651203) help?

GiteaMirror commented

2026-05-04 16:40:34 -05:00

@igorschlum commented on GitHub (May 18, 2025):

Hi @rick-github,

Thanks again for your previous insights. I've spent quite a bit more time trying to understand what's happening when running large models like llama4:maverick on macOS with Ollama.

My current understanding, after more observation and reading, points towards the macOS system itself killing the Ollama process. This seems to happen because Ollama (or rather, llama.cpp as invoked by Ollama) attempts to load all of the layers at launch of a model.

This brings to mind the answer you gave here https://github.com/ollama/ollama/issues/8571#issuecomment-2620274345 : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error.

The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time.

From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly.

The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained.

It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations.

Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac.

I'm joining @sunhy0316 in his request for simplicity https://github.com/ollama/ollama/issues/10631

@igorschlum commented on GitHub (May 18, 2025): Hi @rick-github, Thanks again for your previous insights. I've spent quite a bit more time trying to understand what's happening when running large models like llama4:maverick on macOS with Ollama. My current understanding, after more observation and reading, points towards the macOS system itself killing the Ollama process. This seems to happen because Ollama (or rather, llama.cpp as invoked by Ollama) attempts to load all of the layers at launch of a model. This brings to mind the answer you gave here https://github.com/ollama/ollama/issues/8571#issuecomment-2620274345 : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error. The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time. From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly. The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained. It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations. Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac. I'm joining @sunhy0316 in his request for simplicity https://github.com/ollama/ollama/issues/10631

GiteaMirror commented

2026-05-04 16:40:39 -05:00

@rick-github commented on GitHub (May 19, 2025):

This brings to mind the answer you gave here #8571 (comment) : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error.

The problem that MacOS has is that mmap doesn't play well with Metal. In #8571 we established that the model loaded and ran fine by setting num_gpu=0 and use_mmap=true. This is because the model is mapped into the runner address space rather than having the runner allocate RAM to hold the model weights. This obviously provides lower performance than using Metal to do partial inferencing, like Windows and Linux do with their GPUs. The problem for MacOS is this code, where mmap is disabled if there is a partial offload of the model. This forces the runner to try to allocate RAM for the model weights, and the kernel steps in and kills the runner because the amount of RAM+swap is insufficient. The comment for the code indicates a fundamental issue with Metal and offloading, I don't know the source of the issue.

Based on the code you could try setting num_gpu to a value larger than the layer count of the model, enable use_mmap, and hope for the best. In Linux and Windows, this would result in the excess layers being shoved into shared memory, that is in system RAM but processed by the GPU. In my experience this is slower than the hybrid GPU+CPU model, but it works. I don't have a Mac to try this on so I can't speak to whether the model would run or crash with an OOM.

So the only real way forward for MacOS and large models is to figure out why the dynamic swap doesn't seem to work. I'm knowledgeable about operating systems in general but not the specifics of MacOS dynamic swap, which is why my advice is just simple scripts and links to other discussions trying to work around the problem.

The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time.

Unfortunately the experts in an MoE are not a set of layers that can be switched into active memory. Each layer is composed of a subset of the expert, so that the expert runs through the entire stack of layers, rather than being confined to a subset of layers. This is why all the layers are loaded. Even if that was the case, the selection of layers can change from one token to the next - the routing network chooses the set of experts to use for the current token generation cycle based on the current state of the token buffer, so it can change from one inference cycle to the next.

However, there is one use case where this might actually work. There are two ways to make model weights available for inference: load them in RAM, or map them into the process address space. When loading into RAM and there's not enough physical RAM to hold the weights, the weights spill into swap, typically a disk based file. When mapping the weights into the address space, there's no RAM usage by the process, and the weights have to be read off disk. So in both cases, the bandwidth of the disk is a limiting factor on inference speed. If the inference is such that the same set of experts are chosen for each inference cycle, then the weights comprising those experts will end up in RAM (either paged in from swap, or read from disk and saved in the page cache). So it's possible for a performance increase to be seen for inferences that use the same set of experts.

From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly.

By default, Linux doesn't use dynamic swap - swap space is allocated when the system is built, and can be added to and removed manually later, but it's a relatively fixed amount of virtual space. The dynamic nature of MacOS swap means that the process has to be suspended when the kernel detects that there's not enough swap and it has to extend the swap space. Why it kills the process rather than extending the swap is unknown to me.

The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained.

It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations.

I agree that having the user tweak the Modelfile or API parameters to get a model to work is less than ideal. You've probably seen the many (many, many) times I've given advice on how to work around the memory estimation problems that ollama has. The goal of ollama is to make it easy to run models, and for the most part (excepting releases like 0.6.2 and 0.7.0) this has been the case. Now, however, models are getting more capable, so users want to run those models. Unfortunately they test the limits of consumer grade hardware, and combined with the estimation issues, it makes for a poor user experience. At the moment, users that want to try the bleeding edge of models will have to stock up on bandaids.

Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac.

I'm joining @sunhy0316 in his request for simplicity #10631

This is just another bandaid. The solution is to fix memory estimation. Part of 0.7.0 was rolled back today to alleviate the over-spilling that came with the release. More work on this (annoying) part of ollama is pending.

To the more specific issue you are having with llama4:

try num_gpu=0 and use_mmap=true
try num_gpu=50 and use_mmap=true
use the script with OLLAMA_MEMORY=161061273600, num_gpu=0 and use_mmap=false
use the script with OLLAMA_MEMORY=257698037760 and defaults for everything else

@rick-github commented on GitHub (May 19, 2025): > This brings to mind the answer you gave here [#8571 (comment)](https://github.com/ollama/ollama/issues/8571#issuecomment-2620274345) : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error. The problem that MacOS has is that `mmap` doesn't play well with Metal. In #8571 we established that the model loaded and ran fine by setting `num_gpu=0` and `use_mmap=true`. This is because the model is mapped into the runner address space rather than having the runner allocate RAM to hold the model weights. This obviously provides lower performance than using Metal to do partial inferencing, like Windows and Linux do with their GPUs. The problem for MacOS is [this code](https://github.com/ollama/ollama/blob/94ab428e3f77fdd9d9c833b369bb40980c65049a/llm/server.go#L227), where `mmap` is disabled if there is a partial offload of the model. This forces the runner to try to allocate RAM for the model weights, and the kernel steps in and kills the runner because the amount of RAM+swap is insufficient. The comment for the code indicates a fundamental issue with Metal and offloading, I don't know the source of the issue. Based on the code you could try setting `num_gpu` to a value larger than the layer count of the model, enable `use_mmap`, and hope for the best. In Linux and Windows, this would result in the excess layers being shoved into shared memory, that is in system RAM but processed by the GPU. In my experience this is slower than the hybrid GPU+CPU model, but it works. I don't have a Mac to try this on so I can't speak to whether the model would run or crash with an OOM. So the only real way forward for MacOS and large models is to figure out why the dynamic swap doesn't seem to work. I'm knowledgeable about operating systems in general but not the specifics of MacOS dynamic swap, which is why my advice is just simple scripts and links to other discussions trying to work around the problem. > The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time. Unfortunately the experts in an MoE are not a set of layers that can be switched into active memory. Each layer is composed of a subset of the expert, so that the expert runs through the entire stack of layers, rather than being confined to a subset of layers. This is why all the layers are loaded. Even if that was the case, the selection of layers can change from one token to the next - the routing network chooses the set of experts to use for the current token generation cycle based on the current state of the token buffer, so it can change from one inference cycle to the next. However, there is one use case where this might actually work. There are two ways to make model weights available for inference: load them in RAM, or map them into the process address space. When loading into RAM and there's not enough physical RAM to hold the weights, the weights spill into swap, typically a disk based file. When mapping the weights into the address space, there's no RAM usage by the process, and the weights have to be read off disk. So in both cases, the bandwidth of the disk is a limiting factor on inference speed. If the inference is such that the same set of experts are chosen for each inference cycle, then the weights comprising those experts will end up in RAM (either paged in from swap, or read from disk and saved in the page cache). So it's possible for a performance increase to be seen for inferences that use the same set of experts. > From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly. By default, Linux doesn't use dynamic swap - swap space is allocated when the system is built, and can be added to and removed manually later, but it's a relatively fixed amount of virtual space. The dynamic nature of MacOS swap means that the process has to be suspended when the kernel detects that there's not enough swap and it has to extend the swap space. Why it kills the process rather than extending the swap is unknown to me. > The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained. > > It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations. I agree that having the user tweak the Modelfile or API parameters to get a model to work is less than ideal. You've probably seen the many (many, many) times I've given advice on how to work around the memory estimation problems that ollama has. The goal of ollama is to make it easy to run models, and for the most part (excepting releases like 0.6.2 and 0.7.0) this has been the case. Now, however, models are getting more capable, so users want to run those models. Unfortunately they test the limits of consumer grade hardware, and combined with the estimation issues, it makes for a poor user experience. At the moment, users that want to try the bleeding edge of models will have to stock up on bandaids. > Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac. > > I'm joining [@sunhy0316](https://github.com/sunhy0316) in his request for simplicity [#10631](https://github.com/ollama/ollama/issues/10631) This is just another bandaid. The solution is to fix memory estimation. Part of 0.7.0 was rolled back today to alleviate the over-spilling that came with the release. More work on this (annoying) part of ollama is pending. To the more specific issue you are having with llama4: - try `num_gpu=0` and `use_mmap=true` - try `num_gpu=50` and `use_mmap=true` - use the [script](https://github.com/ollama/ollama/issues/6918#issuecomment-2488651203) with `OLLAMA_MEMORY=161061273600`, `num_gpu=0` and `use_mmap=false` - use the [script](https://github.com/ollama/ollama/issues/6918#issuecomment-2488651203) with `OLLAMA_MEMORY=257698037760` and defaults for everything else

GiteaMirror commented

2026-05-04 16:40:44 -05:00

@rick-github commented on GitHub (May 20, 2025):

I just noticed that llama4 uses the new ollama engine, which doesn't support mmap. So loading the model requires enough RAM+swap.

@rick-github commented on GitHub (May 20, 2025): I just noticed that llama4 uses the new ollama engine, which doesn't support mmap. So loading the model requires enough RAM+swap.

GiteaMirror commented

2026-05-04 16:40:46 -05:00

@igorschlum commented on GitHub (May 20, 2025):

Hi @rick-github Thank you for your long explanation. I Hope that distributing layers on multiple computers over a local network will be a solution to run larger models with existing set of computers.

@igorschlum commented on GitHub (May 20, 2025): Hi @rick-github Thank you for your long explanation. I Hope that distributing layers on multiple computers over a local network will be a solution to run larger models with existing set of computers.

GiteaMirror commented

2026-05-04 16:40:48 -05:00

@igorschlum commented on GitHub (May 24, 2025):

@srshkmr you can close the issue. On macOS, you need to have enough memory to load the model you want to run. On linux and on Windows, you can swap the memory, but I think it's so slow that it's good only to test some prompts.

@igorschlum commented on GitHub (May 24, 2025): @srshkmr you can close the issue. On macOS, you need to have enough memory to load the model you want to run. On linux and on Windows, you can swap the memory, but I think it's so slow that it's good only to test some prompts.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#68988