[GH-ISSUE #10532] LLAMA4 Scout crashes during run #68988

Closed
opened 2026-05-04 16:40:10 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @srshkmr on GitHub (May 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10532

What is the issue?

Error: llama runner process has terminated: signal: killed```

I am on M3 pro 36GB RAM. 

do i have to wait for a different model?

### Relevant log output

```shell

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.6.7

Originally created by @srshkmr on GitHub (May 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10532 ### What is the issue? ```ollama run llama4:17b-scout-16e-instruct-q4_K_M Error: llama runner process has terminated: signal: killed``` I am on M3 pro 36GB RAM. do i have to wait for a different model? ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.6.7
GiteaMirror added the bug label 2026-05-04 16:40:10 -05:00
Author
Owner

@rick-github commented on GitHub (May 2, 2025):

signal: killed generally means that an external actor killed the runner. The likely cause is that the operating system thought that the runner was too big and would cause an OOM condition on the machine. There should be log entries from your OS in the system log.

<!-- gh-comment-id:2847015745 --> @rick-github commented on GitHub (May 2, 2025): `signal: killed` generally means that an external actor killed the runner. The likely cause is that the operating system thought that the runner was too big and would cause an OOM condition on the machine. There should be log entries from your OS in the system log.
Author
Owner

@igorschlum commented on GitHub (May 12, 2025):

Hi, I'm adding my experience to this bug report as I'm encountering the same issue when trying to run llama4:maverick with Ollama 0.6.8.

My Environment:

Ollama Version: 0.6.8
Mac Station with 192 GB of RAM sequoia 15.4.1

I thought that llama4 Maverick was using only some layers and so could run with 180 GB of Ram despite the fact that the model size is 245 GB.

<!-- gh-comment-id:2874021571 --> @igorschlum commented on GitHub (May 12, 2025): Hi, I'm adding my experience to this bug report as I'm encountering the same issue when trying to run llama4:maverick with Ollama 0.6.8. My Environment: Ollama Version: 0.6.8 Mac Station with 192 GB of RAM sequoia 15.4.1 I thought that llama4 Maverick was using only some layers and so could run with 180 GB of Ram despite the fact that the model size is 245 GB.
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

It should be able to run with 180G RAM, there just has to be enough swap to accommodate the part of the model that is not RAM-resident. I'm not a Mac user so I don't know if this is helpful: https://discussions.apple.com/thread/252429784

<!-- gh-comment-id:2874083712 --> @rick-github commented on GitHub (May 12, 2025): It should be able to run with 180G RAM, there just has to be enough swap to accommodate the part of the model that is not RAM-resident. I'm not a Mac user so I don't know if this is helpful: https://discussions.apple.com/thread/252429784
Author
Owner

@igorschlum commented on GitHub (May 18, 2025):

Hi @rick-github,

Thanks for your suggestion. I spent some time looking how swap works on MacOS.

Swap is normally managed automatically on macOS, and even after I freed up approximately 900GB of space on my internal HD, I'm still encountering a kill message before the loading process completes.

Here is my log:

time=2025-05-18T03:31:57.503+02:00 level=INFO source=server.go:135 msg="system memory" total="192.0 GiB" free="221.5 GiB" free_swap="0 B"
time=2025-05-18T03:31:57.504+02:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=49 layers.offload=27 layers.split="" memory.available="[144.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="238.9 GiB" memory.required.partial="143.9 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[143.9 GiB]" memory.weights.total="225.8 GiB" memory.weights.repeating="225.1 GiB" memory.weights.nonrepeating="809.3 MiB" memory.graph.full="404.6 MiB" memory.graph.partial="404.6 MiB" projector.weights="1.6 GiB" projector.graph="0 B"
time=2025-05-18T03:31:57.541+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/igor/.ollama/models/blobs/sha256-ecdedd393ed15c5cd32bb4ae6240db958f600d757daa64aab531656964b13b9c --ctx-size 4096 --batch-size 512 --n-gpu-layers 27 --threads 16 --no-mmap --parallel 1 --port 49320"
ollama(60311) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
time=2025-05-18T03:31:57.543+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:49320"
time=2025-05-18T03:31:57.585+02:00 level=INFO source=ggml.go:73 msg="" architecture=llama4 file_type=Q4_K_M name="" description="" num_tensors=1085 num_key_values=45
time=2025-05-18T03:31:57.587+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=Metal size="131.1 GiB"
time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="96.9 GiB"
time=2025-05-18T03:31:57.794+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-18T03:32:32.865+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-18T03:32:33.116+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
time=2025-05-18T03:35:03.494+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-18T03:35:13.240+02:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: signal: killed"
[GIN] 2025/05/18 - 03:35:13 | 500 |         3m15s |       127.0.0.1 | POST     "/api/generate"

And my MacOS system log is here:

system log.txt

I hope that it can be fixed. If it works on linux, it could work on MacOS.

<!-- gh-comment-id:2888706595 --> @igorschlum commented on GitHub (May 18, 2025): Hi @rick-github, Thanks for your suggestion. I spent some time looking how swap works on MacOS. Swap is normally managed automatically on macOS, and even after I freed up approximately 900GB of space on my internal HD, I'm still encountering a kill message before the loading process completes. Here is my log: ``` time=2025-05-18T03:31:57.503+02:00 level=INFO source=server.go:135 msg="system memory" total="192.0 GiB" free="221.5 GiB" free_swap="0 B" time=2025-05-18T03:31:57.504+02:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=49 layers.offload=27 layers.split="" memory.available="[144.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="238.9 GiB" memory.required.partial="143.9 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[143.9 GiB]" memory.weights.total="225.8 GiB" memory.weights.repeating="225.1 GiB" memory.weights.nonrepeating="809.3 MiB" memory.graph.full="404.6 MiB" memory.graph.partial="404.6 MiB" projector.weights="1.6 GiB" projector.graph="0 B" time=2025-05-18T03:31:57.541+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/igor/.ollama/models/blobs/sha256-ecdedd393ed15c5cd32bb4ae6240db958f600d757daa64aab531656964b13b9c --ctx-size 4096 --batch-size 512 --n-gpu-layers 27 --threads 16 --no-mmap --parallel 1 --port 49320" ollama(60311) MallocStackLogging: can't turn off malloc stack logging because it was not enabled. time=2025-05-18T03:31:57.543+02:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-18T03:31:57.543+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-18T03:31:57.551+02:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:49320" time=2025-05-18T03:31:57.585+02:00 level=INFO source=ggml.go:73 msg="" architecture=llama4 file_type=Q4_K_M name="" description="" num_tensors=1085 num_key_values=45 time=2025-05-18T03:31:57.587+02:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=Metal size="131.1 GiB" time=2025-05-18T03:31:57.684+02:00 level=INFO source=ggml.go:299 msg="model weights" buffer=CPU size="96.9 GiB" time=2025-05-18T03:31:57.794+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-18T03:32:32.865+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-18T03:32:33.116+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" time=2025-05-18T03:35:03.494+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-18T03:35:13.240+02:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: signal: killed" [GIN] 2025/05/18 - 03:35:13 | 500 | 3m15s | 127.0.0.1 | POST "/api/generate" ``` And my MacOS system log is here: [system log.txt](https://github.com/user-attachments/files/20270218/system.log.txt) I hope that it can be fixed. If it works on linux, it could work on MacOS.
Author
Owner

@rick-github commented on GitHub (May 18, 2025):

Does this help?

<!-- gh-comment-id:2888948923 --> @rick-github commented on GitHub (May 18, 2025): Does [this](https://github.com/ollama/ollama/issues/6918#issuecomment-2488651203) help?
Author
Owner

@igorschlum commented on GitHub (May 18, 2025):

Hi @rick-github,

Thanks again for your previous insights. I've spent quite a bit more time trying to understand what's happening when running large models like llama4:maverick on macOS with Ollama.

My current understanding, after more observation and reading, points towards the macOS system itself killing the Ollama process. This seems to happen because Ollama (or rather, llama.cpp as invoked by Ollama) attempts to load all of the layers at launch of a model.

This brings to mind the answer you gave here https://github.com/ollama/ollama/issues/8571#issuecomment-2620274345 : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error.

The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time.

From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly.

The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained.

It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations.

Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac.

I'm joining @sunhy0316 in his request for simplicity https://github.com/ollama/ollama/issues/10631

<!-- gh-comment-id:2889274585 --> @igorschlum commented on GitHub (May 18, 2025): Hi @rick-github, Thanks again for your previous insights. I've spent quite a bit more time trying to understand what's happening when running large models like llama4:maverick on macOS with Ollama. My current understanding, after more observation and reading, points towards the macOS system itself killing the Ollama process. This seems to happen because Ollama (or rather, llama.cpp as invoked by Ollama) attempts to load all of the layers at launch of a model. This brings to mind the answer you gave here https://github.com/ollama/ollama/issues/8571#issuecomment-2620274345 : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error. The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time. From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly. The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained. It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations. Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac. I'm joining @sunhy0316 in his request for simplicity https://github.com/ollama/ollama/issues/10631
Author
Owner

@rick-github commented on GitHub (May 19, 2025):

This brings to mind the answer you gave here #8571 (comment) : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error.

The problem that MacOS has is that mmap doesn't play well with Metal. In #8571 we established that the model loaded and ran fine by setting num_gpu=0 and use_mmap=true. This is because the model is mapped into the runner address space rather than having the runner allocate RAM to hold the model weights. This obviously provides lower performance than using Metal to do partial inferencing, like Windows and Linux do with their GPUs. The problem for MacOS is this code, where mmap is disabled if there is a partial offload of the model. This forces the runner to try to allocate RAM for the model weights, and the kernel steps in and kills the runner because the amount of RAM+swap is insufficient. The comment for the code indicates a fundamental issue with Metal and offloading, I don't know the source of the issue.

Based on the code you could try setting num_gpu to a value larger than the layer count of the model, enable use_mmap, and hope for the best. In Linux and Windows, this would result in the excess layers being shoved into shared memory, that is in system RAM but processed by the GPU. In my experience this is slower than the hybrid GPU+CPU model, but it works. I don't have a Mac to try this on so I can't speak to whether the model would run or crash with an OOM.

So the only real way forward for MacOS and large models is to figure out why the dynamic swap doesn't seem to work. I'm knowledgeable about operating systems in general but not the specifics of MacOS dynamic swap, which is why my advice is just simple scripts and links to other discussions trying to work around the problem.

The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time.

Unfortunately the experts in an MoE are not a set of layers that can be switched into active memory. Each layer is composed of a subset of the expert, so that the expert runs through the entire stack of layers, rather than being confined to a subset of layers. This is why all the layers are loaded. Even if that was the case, the selection of layers can change from one token to the next - the routing network chooses the set of experts to use for the current token generation cycle based on the current state of the token buffer, so it can change from one inference cycle to the next.

However, there is one use case where this might actually work. There are two ways to make model weights available for inference: load them in RAM, or map them into the process address space. When loading into RAM and there's not enough physical RAM to hold the weights, the weights spill into swap, typically a disk based file. When mapping the weights into the address space, there's no RAM usage by the process, and the weights have to be read off disk. So in both cases, the bandwidth of the disk is a limiting factor on inference speed. If the inference is such that the same set of experts are chosen for each inference cycle, then the weights comprising those experts will end up in RAM (either paged in from swap, or read from disk and saved in the page cache). So it's possible for a performance increase to be seen for inferences that use the same set of experts.

From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly.

By default, Linux doesn't use dynamic swap - swap space is allocated when the system is built, and can be added to and removed manually later, but it's a relatively fixed amount of virtual space. The dynamic nature of MacOS swap means that the process has to be suspended when the kernel detects that there's not enough swap and it has to extend the swap space. Why it kills the process rather than extending the swap is unknown to me.

The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained.

It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations.

I agree that having the user tweak the Modelfile or API parameters to get a model to work is less than ideal. You've probably seen the many (many, many) times I've given advice on how to work around the memory estimation problems that ollama has. The goal of ollama is to make it easy to run models, and for the most part (excepting releases like 0.6.2 and 0.7.0) this has been the case. Now, however, models are getting more capable, so users want to run those models. Unfortunately they test the limits of consumer grade hardware, and combined with the estimation issues, it makes for a poor user experience. At the moment, users that want to try the bleeding edge of models will have to stock up on bandaids.

Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac.

I'm joining @sunhy0316 in his request for simplicity #10631

This is just another bandaid. The solution is to fix memory estimation. Part of 0.7.0 was rolled back today to alleviate the over-spilling that came with the release. More work on this (annoying) part of ollama is pending.

To the more specific issue you are having with llama4:

  • try num_gpu=0 and use_mmap=true
  • try num_gpu=50 and use_mmap=true
  • use the script with OLLAMA_MEMORY=161061273600, num_gpu=0 and use_mmap=false
  • use the script with OLLAMA_MEMORY=257698037760 and defaults for everything else
<!-- gh-comment-id:2892381854 --> @rick-github commented on GitHub (May 19, 2025): > This brings to mind the answer you gave here [#8571 (comment)](https://github.com/ollama/ollama/issues/8571#issuecomment-2620274345) : to successfully ran a large GGUF model directly with llama.cpp by manually specifying a lower number of layers to offload (--n-gpu-layers). When they tried the same model with Ollama, which likely estimated a higher (and in that case, excessive) layer count, it crashed with the signal: killed error. The problem that MacOS has is that `mmap` doesn't play well with Metal. In #8571 we established that the model loaded and ran fine by setting `num_gpu=0` and `use_mmap=true`. This is because the model is mapped into the runner address space rather than having the runner allocate RAM to hold the model weights. This obviously provides lower performance than using Metal to do partial inferencing, like Windows and Linux do with their GPUs. The problem for MacOS is [this code](https://github.com/ollama/ollama/blob/94ab428e3f77fdd9d9c833b369bb40980c65049a/llm/server.go#L227), where `mmap` is disabled if there is a partial offload of the model. This forces the runner to try to allocate RAM for the model weights, and the kernel steps in and kills the runner because the amount of RAM+swap is insufficient. The comment for the code indicates a fundamental issue with Metal and offloading, I don't know the source of the issue. Based on the code you could try setting `num_gpu` to a value larger than the layer count of the model, enable `use_mmap`, and hope for the best. In Linux and Windows, this would result in the excess layers being shoved into shared memory, that is in system RAM but processed by the GPU. In my experience this is slower than the hybrid GPU+CPU model, but it works. I don't have a Mac to try this on so I can't speak to whether the model would run or crash with an OOM. So the only real way forward for MacOS and large models is to figure out why the dynamic swap doesn't seem to work. I'm knowledgeable about operating systems in general but not the specifics of MacOS dynamic swap, which is why my advice is just simple scripts and links to other discussions trying to work around the problem. > The particular appeal of models like llama4:maverick for me was its Mixture of Experts (MoE) architecture. My hope was that this design would inherently allow it to run with a smaller active memory footprint, as only a subset of "experts" or layers would need to be active at any given time. Unfortunately the experts in an MoE are not a set of layers that can be switched into active memory. Each layer is composed of a subset of the expert, so that the expert runs through the entire stack of layers, rather than being confined to a subset of layers. This is why all the layers are loaded. Even if that was the case, the selection of layers can change from one token to the next - the routing network chooses the set of experts to use for the current token generation cycle based on the current state of the token buffer, so it can change from one inference cycle to the next. However, there is one use case where this might actually work. There are two ways to make model weights available for inference: load them in RAM, or map them into the process address space. When loading into RAM and there's not enough physical RAM to hold the weights, the weights spill into swap, typically a disk based file. When mapping the weights into the address space, there's no RAM usage by the process, and the weights have to be read off disk. So in both cases, the bandwidth of the disk is a limiting factor on inference speed. If the inference is such that the same set of experts are chosen for each inference cycle, then the weights comprising those experts will end up in RAM (either paged in from swap, or read from disk and saved in the page cache). So it's possible for a performance increase to be seen for inferences that use the same set of experts. > From what I gather, this might work better on your end because you're on Linux. As you mentioned, Linux seems more willing to allocate swap space and allow the process to run slowly, whereas macOS appears to be more aggressive in killing a process that requests a very large initial memory chunk that exceeds physical RAM significantly. By default, Linux doesn't use dynamic swap - swap space is allocated when the system is built, and can be added to and removed manually later, but it's a relatively fixed amount of virtual space. The dynamic nature of MacOS swap means that the process has to be suspended when the kernel detects that there's not enough swap and it has to extend the swap space. Why it kills the process rather than extending the swap is unknown to me. > The remaining solution, as you pointed out, would be to create a custom Modelfile and manually limit num_gpu to load fewer layers. However, this does add a layer of complexity that might be a hurdle for a more "basic" macOS user. They'd ideally want ollama run model:tag to just work, even if it means the model runs a bit slower by not using all possible layers if RAM is constrained. > > It would be fantastic if Ollama could, in the future, perhaps be more "aware" of MoE architectures or generally adopt a more conservative initial loading strategy, especially on macOS. For instance, it could try to load a minimal set of layers required for the model to function, and only attempt to load more if sufficient RAM is clearly available, rather than overestimating and leading to an immediate crash. The goal would be for Ollama to try and fit within available resources by default, particularly on systems known to be less tolerant of massive initial memory allocations. I agree that having the user tweak the Modelfile or API parameters to get a model to work is less than ideal. You've probably seen the many (many, many) times I've given advice on how to work around the memory estimation problems that ollama has. The goal of ollama is to make it easy to run models, and for the most part (excepting releases like 0.6.2 and 0.7.0) this has been the case. Now, however, models are getting more capable, so users want to run those models. Unfortunately they test the limits of consumer grade hardware, and combined with the estimation issues, it makes for a poor user experience. At the moment, users that want to try the bleeding edge of models will have to stock up on bandaids. > Thanks for listening and for all the work you do on Ollama. These are just my observations as a user trying to make the most of these exciting new models on my Mac. > > I'm joining [@sunhy0316](https://github.com/sunhy0316) in his request for simplicity [#10631](https://github.com/ollama/ollama/issues/10631) This is just another bandaid. The solution is to fix memory estimation. Part of 0.7.0 was rolled back today to alleviate the over-spilling that came with the release. More work on this (annoying) part of ollama is pending. To the more specific issue you are having with llama4: - try `num_gpu=0` and `use_mmap=true` - try `num_gpu=50` and `use_mmap=true` - use the [script](https://github.com/ollama/ollama/issues/6918#issuecomment-2488651203) with `OLLAMA_MEMORY=161061273600`, `num_gpu=0` and `use_mmap=false` - use the [script](https://github.com/ollama/ollama/issues/6918#issuecomment-2488651203) with `OLLAMA_MEMORY=257698037760` and defaults for everything else
Author
Owner

@rick-github commented on GitHub (May 20, 2025):

I just noticed that llama4 uses the new ollama engine, which doesn't support mmap. So loading the model requires enough RAM+swap.

<!-- gh-comment-id:2894948633 --> @rick-github commented on GitHub (May 20, 2025): I just noticed that llama4 uses the new ollama engine, which doesn't support mmap. So loading the model requires enough RAM+swap.
Author
Owner

@igorschlum commented on GitHub (May 20, 2025):

Hi @rick-github Thank you for your long explanation. I Hope that distributing layers on multiple computers over a local network will be a solution to run larger models with existing set of computers.

<!-- gh-comment-id:2895992123 --> @igorschlum commented on GitHub (May 20, 2025): Hi @rick-github Thank you for your long explanation. I Hope that distributing layers on multiple computers over a local network will be a solution to run larger models with existing set of computers.
Author
Owner

@igorschlum commented on GitHub (May 24, 2025):

@srshkmr you can close the issue. On macOS, you need to have enough memory to load the model you want to run. On linux and on Windows, you can swap the memory, but I think it's so slow that it's good only to test some prompts.

<!-- gh-comment-id:2906953996 --> @igorschlum commented on GitHub (May 24, 2025): @srshkmr you can close the issue. On macOS, you need to have enough memory to load the model you want to run. On linux and on Windows, you can swap the memory, but I think it's so slow that it's good only to test some prompts.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68988