[GH-ISSUE #1627] Can't run dolphin-mixtral, llama runner process has terminated #62941

Closed
opened 2026-05-03 10:55:02 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @mbruhler on GitHub (Dec 20, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1627

Hi,
I have trouble running dolphin mixtral using ollama

When I type ollama run dolphin-mixtral the message "llama runner process has terminated" appears

This is the log:

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/var/folders/tw/8r367x6x1t12cqcm0w9fhcgh0000gn/T/ollama2368397414/llama.cpp/gguf/build/metal/bin/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 319.22 MiB
llama_new_context_with_model: max tensor size =   102.55 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 16384.00 MiB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  8935.19 MiB, offs =  17072324608, (25320.81 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   512.03 MiB, (25832.84 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   316.05 MiB, (26148.89 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 4 failed with status 5
GGML_ASSERT: /Users/jmorgan/workspace/ollama/llm/llama.cpp/gguf/ggml-metal.m:2353: false
2023/12/20 11:45:01 llama.go:451: signal: abort trap
2023/12/20 11:45:01 llama.go:459: error starting llama runner: llama runner process has terminated
2023/12/20 11:45:01 llama.go:525: llama runner stopped successfully
[GIN] 2023/12/20 - 11:45:01 | 500 | 25.934369375s |       127.0.0.1 | POST     "/api/generate"

I would greatly appreciate any help! :)

Thanks

Originally created by @mbruhler on GitHub (Dec 20, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1627 Hi, I have trouble running dolphin mixtral using ollama When I type `ollama run dolphin-mixtral` the message "llama runner process has terminated" appears This is the log: ``` llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 Pro ggml_metal_init: picking default device: Apple M1 Pro ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/var/folders/tw/8r367x6x1t12cqcm0w9fhcgh0000gn/T/ollama2368397414/llama.cpp/gguf/build/metal/bin/ggml-metal.metal' ggml_metal_init: GPU name: Apple M1 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: compute buffer total size = 319.22 MiB llama_new_context_with_model: max tensor size = 102.55 MiB ggml_metal_add_buffer: allocated 'data ' buffer, size = 16384.00 MiB, offs = 0 ggml_metal_add_buffer: allocated 'data ' buffer, size = 8935.19 MiB, offs = 17072324608, (25320.81 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size ggml_metal_add_buffer: allocated 'kv ' buffer, size = 512.03 MiB, (25832.84 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 316.05 MiB, (26148.89 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size ggml_metal_graph_compute: command buffer 4 failed with status 5 GGML_ASSERT: /Users/jmorgan/workspace/ollama/llm/llama.cpp/gguf/ggml-metal.m:2353: false 2023/12/20 11:45:01 llama.go:451: signal: abort trap 2023/12/20 11:45:01 llama.go:459: error starting llama runner: llama runner process has terminated 2023/12/20 11:45:01 llama.go:525: llama runner stopped successfully [GIN] 2023/12/20 - 11:45:01 | 500 | 25.934369375s | 127.0.0.1 | POST "/api/generate" ``` I would greatly appreciate any help! :) Thanks
Author
Owner

@ghyghoo8 commented on GitHub (Dec 20, 2023):

mark

<!-- gh-comment-id:1864318521 --> @ghyghoo8 commented on GitHub (Dec 20, 2023): mark
Author
Owner

@mbruhler commented on GitHub (Dec 20, 2023):

This is propably to memory limitation as this model is quite large. Smaller models are running as expected. Can you confirm?

<!-- gh-comment-id:1864681085 --> @mbruhler commented on GitHub (Dec 20, 2023): This is propably to memory limitation as this model is quite large. Smaller models are running as expected. Can you confirm?
Author
Owner

@easp commented on GitHub (Dec 20, 2023):

@mbruhler

ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   512.03 MiB, (25832.84 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   316.05 MiB, (26148.89 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size

Yes, there isn't enough memory available to GPU.

MacOS limits RAM GPU can access in order to reserve RAM for OS and other needs. In this case its limiting GPU to 21845.34 MB (2/3rds of 32GB) and Ollama needs

You can use a tag for a smaller quantization, like q3_K_M and/or tell MacOS to allocate more RAM for GPU use.

<!-- gh-comment-id:1865288915 --> @easp commented on GitHub (Dec 20, 2023): @mbruhler ``` ggml_metal_add_buffer: allocated 'kv ' buffer, size = 512.03 MiB, (25832.84 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 316.05 MiB, (26148.89 / 21845.34)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size ``` Yes, there isn't enough memory available to GPU. MacOS limits RAM GPU can access in order to reserve RAM for OS and other needs. In this case its limiting GPU to 21845.34 MB (2/3rds of 32GB) and Ollama needs You can use a tag for a smaller quantization, like q3_K_M and/or [tell MacOS to allocate more RAM for GPU use.](https://techobsessed.net/2023/12/increasing-ram-available-to-gpu-on-apple-silicon-macs-for-running-large-language-models/)
Author
Owner

@streichsbaer commented on GitHub (Dec 21, 2023):

You can use a tag for a smaller quantization, like q3_K_M and/or tell MacOS to allocate more RAM for GPU use.

Thanks, @easp!

Running sudo sysctl iogpu.wired_limit_mb=26624 solved the problem for me on my M1 Pro (Sonoma 14.1.1).

<!-- gh-comment-id:1865529176 --> @streichsbaer commented on GitHub (Dec 21, 2023): > You can use a tag for a smaller quantization, like q3_K_M and/or [tell MacOS to allocate more RAM for GPU use.](https://techobsessed.net/2023/12/increasing-ram-available-to-gpu-on-apple-silicon-macs-for-running-large-language-models/) Thanks, @easp! Running `sudo sysctl iogpu.wired_limit_mb=26624` solved the problem for me on my M1 Pro (Sonoma 14.1.1).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62941