[GH-ISSUE #1016] Support AMD GPUs on Intel Macs #26257

Open
opened 2026-04-22 02:22:33 -05:00 by GiteaMirror · 171 comments
Owner

Originally created by @J0hnny007 on GitHub (Nov 6, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1016

Originally assigned to: @dhiltgen on GitHub.

I'm currently trying out the ollama app on my iMac (i7/Vega64) and I can't seem to get it to use my GPU.

I have tried running it with num_gpu 1 but that generated the warnings below.

2023/11/06 16:06:33 llama.go:384: starting llama runner 2023/11/06 16:06:33 llama.go:386: error starting the external llama runner: fork/exec /var/folders/2z/r_0t221x2blbq02n5dp2m5fr0000gn/T/ollama1975281143/llama.cpp/gguf/build/metal/bin/ollama-runner: bad CPU type in executable 2023/11/06 16:06:33 llama.go:384: starting llama runner 2023/11/06 16:06:33 llama.go:442: waiting for llama runner to start responding {"timestamp":1699283193,"level":"WARNING","function":"server_params_parse","line":873,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1} {"timestamp":1699283193,"level":"INFO","function":"main","line":1324,"message":"build info","build":219,"commit":"9e70cc0"} {"timestamp":1699283193,"level":"INFO","function":"main","line":1330,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

Originally created by @J0hnny007 on GitHub (Nov 6, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1016 Originally assigned to: @dhiltgen on GitHub. I'm currently trying out the ollama app on my iMac (i7/Vega64) and I can't seem to get it to use my GPU. I have tried running it with num_gpu 1 but that generated the warnings below. ` 2023/11/06 16:06:33 llama.go:384: starting llama runner 2023/11/06 16:06:33 llama.go:386: error starting the external llama runner: fork/exec /var/folders/2z/r_0t221x2blbq02n5dp2m5fr0000gn/T/ollama1975281143/llama.cpp/gguf/build/metal/bin/ollama-runner: bad CPU type in executable 2023/11/06 16:06:33 llama.go:384: starting llama runner 2023/11/06 16:06:33 llama.go:442: waiting for llama runner to start responding {"timestamp":1699283193,"level":"WARNING","function":"server_params_parse","line":873,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1} {"timestamp":1699283193,"level":"INFO","function":"main","line":1324,"message":"build info","build":219,"commit":"9e70cc0"} {"timestamp":1699283193,"level":"INFO","function":"main","line":1330,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} `
GiteaMirror added the feature requestamdmacos labels 2026-04-22 02:22:34 -05:00
Author
Owner

@BruceMacD commented on GitHub (Nov 7, 2023):

Hi @J0hnny007, thanks for opening the issue. Ollama only supports the Metal GPU API on Macs right now. AMD GPUs won't work.

<!-- gh-comment-id:1799331745 --> @BruceMacD commented on GitHub (Nov 7, 2023): Hi @J0hnny007, thanks for opening the issue. Ollama only supports the Metal GPU API on Macs right now. AMD GPUs won't work.
Author
Owner

@J0hnny007 commented on GitHub (Nov 7, 2023):

Good to know, though I thought that mps can use AMD GPUs. Oh well, thanks for the info.

<!-- gh-comment-id:1799472813 --> @J0hnny007 commented on GitHub (Nov 7, 2023): Good to know, though I thought that mps can use AMD GPUs. Oh well, thanks for the info.
Author
Owner

@cmarhoover commented on GitHub (Dec 13, 2023):

Apple's "Metal Overview" page has the following hardware support list in the page footer:

Metal 3 is supported on the following hardware:
iPhone and iPad: Apple A13 Bionic or later
Mac: Apple silicon (M1 or later), AMD Radeon Pro Vega series, AMD Radeon Pro 5000/6000 series, Intel Iris Plus Graphics series, Intel UHD Graphics 630

Despite being listed as supporting Metal 3, I can confirm that Ollama does not currently use the Radeon RX 6900 in my Mac Pro system.

<!-- gh-comment-id:1853252425 --> @cmarhoover commented on GitHub (Dec 13, 2023): Apple's "[Metal Overview](https://developer.apple.com/metal/)" page has the following hardware support list in the page footer: > Metal 3 is supported on the following hardware: > iPhone and iPad: Apple A13 Bionic or later > Mac: Apple silicon (M1 or later), AMD Radeon Pro Vega series, AMD Radeon Pro 5000/6000 series, Intel Iris Plus Graphics series, Intel UHD Graphics 630 Despite being listed as supporting Metal 3, I can confirm that Ollama does not currently use the Radeon RX 6900 in my Mac Pro system.
Author
Owner

@Basten7 commented on GitHub (Dec 16, 2023):

Me too, I confirm that Ollama does not use the Radeon RX 6800X on Mac Pro when Parameter is set to "PARAMETER num_gpu 1" in Modelfile.

<!-- gh-comment-id:1858832434 --> @Basten7 commented on GitHub (Dec 16, 2023): Me too, I confirm that Ollama does not use the Radeon RX 6800X on Mac Pro when Parameter is set to "PARAMETER num_gpu 1" in Modelfile.
Author
Owner

@cracksauce commented on GitHub (Dec 20, 2023):

Are there any plans for Ollama to support this type of hardware setup (AMD GPUs on Intel Mac)?

<!-- gh-comment-id:1864524367 --> @cracksauce commented on GitHub (Dec 20, 2023): Are there any plans for Ollama to support this type of hardware setup (AMD GPUs on Intel Mac)?
Author
Owner

@ucodia commented on GitHub (Jan 1, 2024):

Intel Mac with AMD graphics card do have support for Metal 3 as the screenshot below attest

image

Though as previously reported, Ollama does not seem to be able to leverage AMD GPU despite having API support on MacOS.

@J0hnny007 Could we please reopen this issue as it was closed on the assumption that AMD GPU were not compatible with Metal?

<!-- gh-comment-id:1873150891 --> @ucodia commented on GitHub (Jan 1, 2024): Intel Mac with AMD graphics card do have support for Metal 3 as the screenshot below attest <img width="319" alt="image" src="https://github.com/jmorganca/ollama/assets/1795860/0dabd4f8-eee8-49b7-bf65-6888a7fad77d"> Though as previously reported, Ollama does not seem to be able to leverage AMD GPU despite having API support on MacOS. @J0hnny007 Could we please reopen this issue as it was closed on the assumption that AMD GPU were not compatible with Metal?
Author
Owner

@pjv commented on GitHub (Jan 2, 2024):

Some possibly relevant data: on my intel iMac pro with AMD Radeon Pro Vega 8GB vram, if I build the current head of llama.cpp with make CUBLAS=1 the resulting .main binary will run models with the GPU.

SCR-20240102-mynp SCR-20240102-nahz
<!-- gh-comment-id:1874523036 --> @pjv commented on GitHub (Jan 2, 2024): Some possibly relevant data: on my intel iMac pro with AMD Radeon Pro Vega 8GB vram, if I build the current head of llama.cpp with `make CUBLAS=1` the resulting `.main` binary will run models with the GPU. <img width="246" alt="SCR-20240102-mynp" src="https://github.com/jmorganca/ollama/assets/327716/8bd7f7fa-6073-4543-a133-ff8ab01938fa"> <img width="273" alt="SCR-20240102-nahz" src="https://github.com/jmorganca/ollama/assets/327716/d06e4b65-7f99-497f-8453-b99ebf2b85e3">
Author
Owner

@cracksauce commented on GitHub (Jan 2, 2024):

Some possibly relevant data: on my intel iMac pro with AMD Radeon Pro Vega 8GB vram, if I build the current head of llama.cpp with make CUBLAS=1 the resulting .main binary will run models with the GPU.

SCR-20240102-mynp SCR-20240102-nahz

Could you describe how to do that for those of us who are less technical? Would appreciate it- thanks!

<!-- gh-comment-id:1874693326 --> @cracksauce commented on GitHub (Jan 2, 2024): > Some possibly relevant data: on my intel iMac pro with AMD Radeon Pro Vega 8GB vram, if I build the current head of llama.cpp with `make CUBLAS=1` the resulting `.main` binary will run models with the GPU. > > <img alt="SCR-20240102-mynp" width="246" src="https://private-user-images.githubusercontent.com/327716/293784255-8bd7f7fa-6073-4543-a133-ff8ab01938fa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQyMzkyMjYsIm5iZiI6MTcwNDIzODkyNiwicGF0aCI6Ii8zMjc3MTYvMjkzNzg0MjU1LThiZDdmN2ZhLTYwNzMtNDU0My1hMTMzLWZmOGFiMDE5MzhmYS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwMTAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDEwMlQyMzQyMDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05YTBjYjYxNGRjZDMzZGYwNThhMTMzYjEzNDQ1MmY1MGJmOTFiZDdhNGM5N2JiYzI2YmFkZDgxN2E3ODYxODliJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.mncgb8xiKUVoC-6qwaCC3WZBzC3wwAXcl4nbYe2MGOc"> <img alt="SCR-20240102-nahz" width="273" src="https://private-user-images.githubusercontent.com/327716/293784558-d06e4b65-7f99-497f-8453-b99ebf2b85e3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDQyMzkyMjYsIm5iZiI6MTcwNDIzODkyNiwicGF0aCI6Ii8zMjc3MTYvMjkzNzg0NTU4LWQwNmU0YjY1LTdmOTktNDk3Zi04NDUzLWI5OWViZjJiODVlMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwMTAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDEwMlQyMzQyMDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yM2NkNzlmYTMzYWE4OWE0NjJiNTEwMGE1ODExY2U2MmYxYmVlZjZlYWNiNDNhZjFjMmJiN2Y5NTg4MmIxM2YwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.qFDMoncb9NCypE4wp2nlpcJw-srJOXYYd_YpqqZXFOc"> Could you describe how to do that for those of us who are less technical? Would appreciate it- thanks!
Author
Owner

@pjv commented on GitHub (Jan 3, 2024):

@cracksauce my report wasn’t a how-to fix for ollama. It was a pointer to the ollama developers that may allow them to tweak how they build one of the ollama dependencies in a way that could possibly allow ollama to make use of AMD GPUs on intel macs.

If you are interested in building and running llama.cpp directly, you should check out that project’s repo.

<!-- gh-comment-id:1875119651 --> @pjv commented on GitHub (Jan 3, 2024): @cracksauce my report wasn’t a how-to fix for ollama. It was a pointer to the ollama developers that may allow them to tweak how they build one of the ollama dependencies in a way that could possibly allow ollama to make use of AMD GPUs on intel macs. If you are interested in building and running llama.cpp directly, you should check out that project’s [repo](https://github.com/ggerganov/llama.cpp).
Author
Owner

@leobenkel commented on GitHub (Jan 3, 2024):

Hello,
I have a mac OSX with these spec:

AMD Radeon Pro 5500M 8 GB
Intel UHD Graphics 630 1536 MB

Has anyone been able to find a solution on how to run ollama docker image to be using the GPU ? I have not found a tutorial that works. I tried following the nvidia one which obviously did not work

<!-- gh-comment-id:1876096765 --> @leobenkel commented on GitHub (Jan 3, 2024): Hello, I have a mac OSX with these spec: ``` AMD Radeon Pro 5500M 8 GB Intel UHD Graphics 630 1536 MB ``` Has anyone been able to find a solution on how to run ollama docker image to be using the GPU ? I have not found a tutorial that works. I tried following the nvidia one which obviously did not work
Author
Owner

@cracksauce commented on GitHub (Jan 7, 2024):

@leobenkel Seems like there might be a potential adjustment that devs can make to one of the Ollama dependency builds to take advantage of AMD GPU's utilization of Metal 3 on Intel Macs. TBD I suppose!

cc @J0hnny007 @BruceMacD

Some other possible fixes and random tweaks after perusing llama.cpp repo:
https://github.com/ggerganov/llama.cpp/issues/2965#issuecomment-1763223051
https://github.com/ggerganov/llama.cpp/issues/3000
https://github.com/ggerganov/llama.cpp/issues/3129#issuecomment-1848436692
https://github.com/ggerganov/llama.cpp/pull/1435#issuecomment-1546928978
https://github.com/ggerganov/llama.cpp/issues/1429#issuecomment-1805455807

<!-- gh-comment-id:1880061178 --> @cracksauce commented on GitHub (Jan 7, 2024): @leobenkel Seems like there might be a potential adjustment that devs can make to one of the Ollama dependency builds to take advantage of AMD GPU's utilization of Metal 3 on Intel Macs. TBD I suppose! cc @J0hnny007 @BruceMacD Some other possible fixes and random tweaks after perusing llama.cpp repo: https://github.com/ggerganov/llama.cpp/issues/2965#issuecomment-1763223051 https://github.com/ggerganov/llama.cpp/issues/3000 https://github.com/ggerganov/llama.cpp/issues/3129#issuecomment-1848436692 https://github.com/ggerganov/llama.cpp/pull/1435#issuecomment-1546928978 https://github.com/ggerganov/llama.cpp/issues/1429#issuecomment-1805455807
Author
Owner

@leobenkel commented on GitHub (Jan 12, 2024):

Thank you @cracksauce , that would be great ! :)

<!-- gh-comment-id:1888183754 --> @leobenkel commented on GitHub (Jan 12, 2024): Thank you @cracksauce , that would be great ! :)
Author
Owner

@dhiltgen commented on GitHub (Jan 15, 2024):

PR #2007 once merged likely provides a foundation upon which we could potentially support this.

Much like the gen_linux.sh script, we could augment the gen_darwin.sh script in the x86 case to look for the underlying GPU libraries on the build system, and if detected, build a variant of llama.cpp with the appropriate flags. The detection logic would likely need some adjustments as well for intel macs.

<!-- gh-comment-id:1892746305 --> @dhiltgen commented on GitHub (Jan 15, 2024): PR #2007 once merged likely provides a foundation upon which we could potentially support this. Much like the [gen_linux.sh](https://github.com/jmorganca/ollama/blob/main/llm/generate/gen_linux.sh) script, we could augment the [gen_darwin.sh](https://github.com/jmorganca/ollama/blob/main/llm/generate/gen_darwin.sh) script in the x86 case to look for the underlying GPU libraries on the build system, and if detected, build a variant of llama.cpp with the appropriate flags. The [detection logic](https://github.com/jmorganca/ollama/tree/main/gpu) would likely need some adjustments as well for intel macs.
Author
Owner

@birchcode commented on GitHub (Feb 16, 2024):

Would love this. Running a 6900xt here.

Any way we can help?

PR #2007 once merged likely provides a foundation upon which we could potentially support this.

Much like the gen_linux.sh script, we could augment the gen_darwin.sh script in the x86 case to look for the underlying GPU libraries on the build system, and if detected, build a variant of llama.cpp with the appropriate flags. The detection logic would likely need some adjustments as well for intel macs.

<!-- gh-comment-id:1948486095 --> @birchcode commented on GitHub (Feb 16, 2024): Would love this. Running a 6900xt here. Any way we can help? > PR #2007 once merged likely provides a foundation upon which we could potentially support this. > > Much like the [gen_linux.sh](https://github.com/jmorganca/ollama/blob/main/llm/generate/gen_linux.sh) script, we could augment the [gen_darwin.sh](https://github.com/jmorganca/ollama/blob/main/llm/generate/gen_darwin.sh) script in the x86 case to look for the underlying GPU libraries on the build system, and if detected, build a variant of llama.cpp with the appropriate flags. The [detection logic](https://github.com/jmorganca/ollama/tree/main/gpu) would likely need some adjustments as well for intel macs.
Author
Owner

@dhiltgen commented on GitHub (Feb 16, 2024):

Any way we can help?

The biggest unknown in my mind is viability of the underlying GPU libraries CUDA/ROCm on Intel MacOS. When Apple released the M's with integrated GPUs, they alienated both AMD and NVIDIA, so neither company is going to support their libraries going forward on Intel Macs. So really the question is what was the last supported version, and is that version viable to build llama.cpp? So I think the answer to your question is, try to see if you can get llama.cpp upstream to build on your Intel mac with the last supported version of ROCm and leverage your Radeon GPU. If that works, then my guidance above on the build scripts would apply to wiring that into our build process.

I'm not sure we'd integrated this into our official builds given the sunsetting nature of this compatibility matrix, but I think we'd be open to improvements to the build scripts so that people can build from source on Intel Macs and get GPU acceleration.

<!-- gh-comment-id:1948948685 --> @dhiltgen commented on GitHub (Feb 16, 2024): > Any way we can help? The biggest unknown in my mind is viability of the underlying GPU libraries CUDA/ROCm on Intel MacOS. When Apple released the M's with integrated GPUs, they alienated both AMD and NVIDIA, so neither company is going to support their libraries going forward on Intel Macs. So really the question is what was the last supported version, and is that version viable to build llama.cpp? So I think the answer to your question is, try to see if you can get llama.cpp upstream to build on your Intel mac with the last supported version of ROCm and leverage your Radeon GPU. If that works, then my guidance above on the build scripts would apply to wiring that into our build process. I'm not sure we'd integrated this into our official builds given the sunsetting nature of this compatibility matrix, but I think we'd be open to improvements to the build scripts so that people can build from source on Intel Macs and get GPU acceleration.
Author
Owner

@birchcode commented on GitHub (Feb 18, 2024):

Any way we can help?

The biggest unknown in my mind is viability of the underlying GPU libraries CUDA/ROCm on Intel MacOS. When Apple released the M's with integrated GPUs, they alienated both AMD and NVIDIA, so neither company is going to support their libraries going forward on Intel Macs. So really the question is what was the last supported version, and is that version viable to build llama.cpp? So I think the answer to your question is, try to see if you can get llama.cpp upstream to build on your Intel mac with the last supported version of ROCm and leverage your Radeon GPU. If that works, then my guidance above on the build scripts would apply to wiring that into our build process.

The alienation explains some things.Yes, ROCm I think has never been supported on Apple. Would have to boot into linux(my next option) to use that. But we have Metal - not sure what the mileage will be.

I was able to build llama.cpp with make CUBLAS=1 running 11.6, Metal Family: Supported, Metal GPUFamily macOS 2

<!-- gh-comment-id:1950974299 --> @birchcode commented on GitHub (Feb 18, 2024): > > Any way we can help? > > The biggest unknown in my mind is viability of the underlying GPU libraries CUDA/ROCm on Intel MacOS. When Apple released the M's with integrated GPUs, they alienated both AMD and NVIDIA, so neither company is going to support their libraries going forward on Intel Macs. So really the question is what was the last supported version, and is that version viable to build llama.cpp? So I think the answer to your question is, try to see if you can get llama.cpp upstream to build on your Intel mac with the last supported version of ROCm and leverage your Radeon GPU. If that works, then my guidance above on the build scripts would apply to wiring that into our build process. The alienation explains some things.Yes, ROCm I think has never been supported on Apple. Would have to boot into linux(my next option) to use that. But we have Metal - not sure what the mileage will be. I was able to build llama.cpp with `make CUBLAS=1` running 11.6, Metal Family: Supported, Metal GPUFamily macOS 2
Author
Owner

@dhiltgen commented on GitHub (Feb 19, 2024):

@birchcode that sounds like a good step. What sort of performance are you able to achieve, and does it look promising?

Using the Metal API on Intel Mac for these other GPUs may complicate our memory detection and layer calculations. Somehow we'd need to refine https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go to retrieve the GPU memory from some metal API, and then use an algo similar to the cuda/rocm version

<!-- gh-comment-id:1952924326 --> @dhiltgen commented on GitHub (Feb 19, 2024): @birchcode that sounds like a good step. What sort of performance are you able to achieve, and does it look promising? Using the Metal API on Intel Mac for these other GPUs may complicate our memory detection and layer calculations. Somehow we'd need to refine https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go to retrieve the GPU memory from some metal API, and then use an algo similar to the cuda/rocm [version](https://github.com/ollama/ollama/blob/main/gpu/gpu.go#L244-L259)
Author
Owner

@birchcode commented on GitHub (Feb 20, 2024):

@dhiltgen I went to execute but had no models. Previously I've already downloaded some models using the GUI - is it possible to use them somehow rather than download new ones? Not on the fastest connection right now ..

I download deepseek-ai/deepseek-coder-6.7b-instruct, and followed the guide to convert them but had some issues with that.

<!-- gh-comment-id:1953470866 --> @birchcode commented on GitHub (Feb 20, 2024): @dhiltgen I went to execute but had no models. Previously I've already downloaded some models using the GUI - is it possible to use them somehow rather than download new ones? Not on the fastest connection right now .. I download deepseek-ai/deepseek-coder-6.7b-instruct, and followed the guide to convert them but had some issues with that.
Author
Owner

@dhiltgen commented on GitHub (Feb 20, 2024):

A little trick - if you run the ollama serve command and load up a model, you can see the file path of the model in the server log output, and then you can use that file to the llama.cpp server executable.

Example:

% ollama serve
...
llama_model_loader: loaded meta data with 19 key-value pairs and 237 tensors from /Users/daniel/.ollama/models/blobs/sha256:66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 (version GGUF V2)
...

Then in your llama.cpp repo, after building the server

% ./build/bin/server -m /Users/daniel/.ollama/models/blobs/sha256:66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 -c 2048 --n-gpu-layers 999
...
<!-- gh-comment-id:1955175468 --> @dhiltgen commented on GitHub (Feb 20, 2024): A little trick - if you run the `ollama serve` command and load up a model, you can see the file path of the model in the server log output, and then you can use that file to the llama.cpp server executable. Example: ``` % ollama serve ... llama_model_loader: loaded meta data with 19 key-value pairs and 237 tensors from /Users/daniel/.ollama/models/blobs/sha256:66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 (version GGUF V2) ... ``` Then in your llama.cpp repo, after building the server ``` % ./build/bin/server -m /Users/daniel/.ollama/models/blobs/sha256:66002b78c70a22ab25e16cc9a1736c6cc6335398c7312e3eb33db202350afe66 -c 2048 --n-gpu-layers 999 ... ```
Author
Owner

@birchcode commented on GitHub (Feb 21, 2024):

That helped. Moving a little closer...

{"timestamp":1708486671,"level":"INFO","function":"main","line":2544,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1708486671,"level":"INFO","function":"main","line":2643,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/rmp/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = codellama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3577.61 MiB, ( 3577.61 / 16368.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  3577.61 MiB
llm_load_tensors:        CPU buffer size =    70.35 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon RX 6900 XT
ggml_metal_init: picking default device: AMD Radeon RX 6900 XT
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/rmp/projects/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   AMD Radeon RX 6900 XT
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = false
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB
ggml_metal_init: skipping kernel_soft_max                  (not supported)
ggml_metal_init: skipping kernel_soft_max_4                (not supported)
ggml_metal_init: skipping kernel_rms_norm                  (not supported)
ggml_metal_init: skipping kernel_group_norm                (not supported)
ggml_metal_init: skipping kernel_mul_mv_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f16            (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f32_1row       (not supported)
ggml_metal_init: skipping kernel_mul_mv_f16_f32_l4         (not supported)
ggml_metal_init: skipping kernel_mul_mv_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_iq3_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32     (not supported)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  1024.00 MiB, ( 4626.27 / 16368.00)
llama_kv_cache_init:      Metal KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   164.00 MiB, ( 4790.27 / 16368.00)
llama_new_context_with_model:      Metal compute buffer size =   164.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 3
ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT'
GGML_ASSERT: ggml-metal.m:769: !"unsupported op"
ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT'
ggml_metal_graph_compute_block_invoke: error: unsupported op 'RMS_NORM'
GGML_ASSERT: ggml-metal.m:769: !"unsupported op"
GGML_ASSERT: ggml-metal.m:769: !"unsupported op"
ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT'
GGML_ASSERT: ggml-metal.m:769: !"unsupported op"
[1]    28814 abort      ./server -m  -c 2048 --n-gpu-layers 999```
<!-- gh-comment-id:1955858189 --> @birchcode commented on GitHub (Feb 21, 2024): That helped. Moving a little closer... ```{"timestamp":1708486671,"level":"INFO","function":"main","line":2537,"message":"build info","build":2170,"commit":"8f1be0d4"} {"timestamp":1708486671,"level":"INFO","function":"main","line":2544,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "} llama server listening at http://127.0.0.1:8080 {"timestamp":1708486671,"level":"INFO","function":"main","line":2643,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"} llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/rmp/.ollama/models/blobs/sha256:3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 259/32016 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3577.61 MiB, ( 3577.61 / 16368.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: Metal buffer size = 3577.61 MiB llm_load_tensors: CPU buffer size = 70.35 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon RX 6900 XT ggml_metal_init: picking default device: AMD Radeon RX 6900 XT ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/Users/rmp/projects/llama.cpp/ggml-metal.metal' ggml_metal_init: GPU name: AMD Radeon RX 6900 XT ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: simdgroup reduction support = false ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ggml_metal_init: skipping kernel_soft_max (not supported) ggml_metal_init: skipping kernel_soft_max_4 (not supported) ggml_metal_init: skipping kernel_rms_norm (not supported) ggml_metal_init: skipping kernel_group_norm (not supported) ggml_metal_init: skipping kernel_mul_mv_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f16 (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f32_1row (not supported) ggml_metal_init: skipping kernel_mul_mv_f16_f32_l4 (not supported) ggml_metal_init: skipping kernel_mul_mv_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mv_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 1024.00 MiB, ( 4626.27 / 16368.00) llama_kv_cache_init: Metal KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU input buffer size = 13.02 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 164.00 MiB, ( 4790.27 / 16368.00) llama_new_context_with_model: Metal compute buffer size = 164.00 MiB llama_new_context_with_model: CPU compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT' GGML_ASSERT: ggml-metal.m:769: !"unsupported op" ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT' ggml_metal_graph_compute_block_invoke: error: unsupported op 'RMS_NORM' GGML_ASSERT: ggml-metal.m:769: !"unsupported op" GGML_ASSERT: ggml-metal.m:769: !"unsupported op" ggml_metal_graph_compute_block_invoke: error: unsupported op 'MUL_MAT' GGML_ASSERT: ggml-metal.m:769: !"unsupported op" [1] 28814 abort ./server -m -c 2048 --n-gpu-layers 999```
Author
Owner

@FellowTraveler commented on GitHub (Mar 12, 2024):

Hey I use an intel Mac with AMD Radeon Pro 5500M which supports metal 3 API but having trouble getting Ollama to work, LMK when you fix this because there's a million more like me.
*EDIT: Apparently it's no good to use Metal 3 API because it's not optimized for AMD GPUs, but for silicon. So you have to use Vulkan or rocm or god knows what Idk.

<!-- gh-comment-id:1990981347 --> @FellowTraveler commented on GitHub (Mar 12, 2024): Hey I use an intel Mac with AMD Radeon Pro 5500M which supports metal 3 API but having trouble getting Ollama to work, LMK when you fix this because there's a million more like me. *EDIT: Apparently it's no good to use Metal 3 API because it's not optimized for AMD GPUs, but for silicon. So you have to use Vulkan or rocm or god knows what Idk.
Author
Owner

@cracksauce commented on GitHub (Apr 18, 2024):

Any updates on this?

<!-- gh-comment-id:2064537032 --> @cracksauce commented on GitHub (Apr 18, 2024): Any updates on this?
Author
Owner

@xakrume commented on GitHub (Apr 20, 2024):

llama3 compiled from sources on my MacPro with AMD Radeon RX 6950 XT 16GB successfully utilized GPU.

(base) ➜  bin git:(master) ./main -m models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 -i
Log start
main: build = 2685 (8cc91dc6)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0
main: seed  = 1713629542
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.22 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3584.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   982.98 MiB, offs =   3327152128, ( 4566.98 / 16368.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      Metal buffer size =  4155.99 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: picking default device: AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: loading '/Volumes/256-A/llama/llama.cpp/build/bin/default.metallib'
ggml_metal_init: GPU name:   AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32      (not supported)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 4664.62 / 16368.00)
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   258.50 MiB, ( 4923.12 / 16368.00)
llama_new_context_with_model:      Metal compute buffer size =   258.50 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
main: interactive mode on.
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Question:
Let m be 2/6 - 2/2. Let v = 0.06 + 0.24. Which is the closest to m?  (a) v  (b) 1
Answer:
a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. For the given options, a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. For the given options, a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. For the given options, a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5

Screenshot 2024-04-20 at 7 21 23 PM

Screenshot 2024-04-20 at 7 20 34 PM
Screenshot 2024-04-20 at 7 20 30 PM

photo_2024-04-20_19-25-00

<!-- gh-comment-id:2067720905 --> @xakrume commented on GitHub (Apr 20, 2024): llama3 compiled from sources on my MacPro with AMD Radeon RX 6950 XT 16GB successfully utilized GPU. ``` (base) ➜ bin git:(master) ./main -m models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 -i Log start main: build = 2685 (8cc91dc6) main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0 main: seed = 1713629542 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3584.00 MiB, offs = 0 ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 982.98 MiB, offs = 3327152128, ( 4566.98 / 16368.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: Metal buffer size = 4155.99 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: picking default device: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: loading '/Volumes/256-A/llama/llama.cpp/build/bin/default.metallib' ggml_metal_init: GPU name: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 64.00 MiB, ( 4664.62 / 16368.00) llama_kv_cache_init: Metal KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 258.50 MiB, ( 4923.12 / 16368.00) llama_new_context_with_model: Metal compute buffer size = 258.50 MiB llama_new_context_with_model: CPU compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | main: interactive mode on. sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. Question: Let m be 2/6 - 2/2. Let v = 0.06 + 0.24. Which is the closest to m? (a) v (b) 1 Answer: a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. For the given options, a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. For the given options, a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. For the given options, a) v is the closest to m. It is approximately 0.3. For m to be this close, b) 1 must be greater than or equal to 2. This is not the case. So, m is the closest. This is incorrect. It seems that the correct answer is not provided. It is the closest of the given options. The question should be reworded to provide more options. For instance, options like 0.1, 0.05, 0.01, etc. could be added to provide more accuracy. 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5 ``` ![Screenshot 2024-04-20 at 7 21 23 PM](https://github.com/ollama/ollama/assets/923816/a20fa5e8-1b65-4474-988b-72f52f0097da) ![Screenshot 2024-04-20 at 7 20 34 PM](https://github.com/ollama/ollama/assets/923816/912cbaa3-3e80-4013-84e0-25a30aba8340) ![Screenshot 2024-04-20 at 7 20 30 PM](https://github.com/ollama/ollama/assets/923816/da43932e-ea9e-44ae-92b6-45211f6de61f) ![photo_2024-04-20_19-25-00](https://github.com/ollama/ollama/assets/923816/dca41e05-1c75-4a97-833a-5c36d94452a6)
Author
Owner

@xakrume commented on GitHub (Apr 20, 2024):

llama3 build options

zig build -Doptimize=ReleaseFast -Dtarget=native -Dcpu=native

~/llama.cpp/build/bin/main --version
version: 2685 (8cc91dc6)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0
<!-- gh-comment-id:2067721335 --> @xakrume commented on GitHub (Apr 20, 2024): llama3 build options ``` zig build -Doptimize=ReleaseFast -Dtarget=native -Dcpu=native ~/llama.cpp/build/bin/main --version version: 2685 (8cc91dc6) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0 ```
Author
Owner

@renanwilliam commented on GitHub (Apr 24, 2024):

I have an Intel Mac with an i9 and a Radeon Pro Vega 20 4 GB GPU / 32GB RAM. Running Ollama is incredibly slow and almost unusable, unfortunately

<!-- gh-comment-id:2075626480 --> @renanwilliam commented on GitHub (Apr 24, 2024): I have an Intel Mac with an i9 and a Radeon Pro Vega 20 4 GB GPU / 32GB RAM. Running Ollama is incredibly slow and almost unusable, unfortunately
Author
Owner

@cracksauce commented on GitHub (Apr 26, 2024):

llama3 compiled from sources on my MacPro with AMD Radeon RX 6950 XT 16GB successfully utilized GPU.

@xakrume Can you explain in more detail how you did this? You took the llama.cpp build using zig? What commands did you run? Did you make any changes to how you use ollama?

<!-- gh-comment-id:2078412653 --> @cracksauce commented on GitHub (Apr 26, 2024): > llama3 compiled from sources on my MacPro with AMD Radeon RX 6950 XT 16GB successfully utilized GPU. > > ``` @xakrume Can you explain in more detail how you did this? You took the llama.cpp build using zig? What commands did you run? Did you make any changes to how you use ollama?
Author
Owner

@dhiltgen commented on GitHub (Apr 26, 2024):

@xakrume if you're up for it, could you post a PR to update the x86 darwin build to add a metal variant?

The build portion would be added around here

and we'd need some adjustments to the GPU discovery logic to be able to identify when to use this variant. At present it's simplistic and just toggles CPU variants on x86, and always uses metal on arm but I think we'd need to actually discover on x86 if there is a metal GPU present. https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go and https://github.com/ollama/ollama/blob/main/gpu/gpu_info_darwin.m

<!-- gh-comment-id:2079649707 --> @dhiltgen commented on GitHub (Apr 26, 2024): @xakrume if you're up for it, could you post a PR to update the x86 darwin build to add a metal variant? The build portion would be added around [here](https://github.com/ollama/ollama/blob/main/llm/generate/gen_darwin.sh#L71) and we'd need some adjustments to the GPU discovery logic to be able to identify when to use this variant. At present it's simplistic and just toggles CPU variants on x86, and always uses metal on arm but I think we'd need to actually discover on x86 if there is a metal GPU present. https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go and https://github.com/ollama/ollama/blob/main/gpu/gpu_info_darwin.m
Author
Owner

@xakrume commented on GitHub (Apr 26, 2024):

@cracksauce , sorry, my mistake. No need to run zig.
I'm using MacPorts packages

sudo port install cmake
# add llvm libs symlinks
sudo ln -sf /opt/local/libexec/llvm-16/lib/libc++.1.dylib /opt/local/lib/
sudo ln -sf /opt/local/libexec/llvm-16/lib/libc++abi.1.dylib /opt/local/lib/

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

binary files located at bin/ at your current build direcotry.

CPU features from zig targets

(base)   llama.cpp git:(master) zig targets | jq .native
{
  "triple": "x86_64-macos.14.4.1...14.4.1-none",
  "cpu": {
    "arch": "x86_64",
    "name": "skylake",
    "features": [
      "64bit",
      "adx",
      "aes",
      "allow_light_256_bit",
      "avx",
      "avx2",
      "bmi",
      "bmi2",
      "clflushopt",
      "cmov",
      "crc32",
      "cx16",
      "cx8",
      "ermsb",
      "f16c",
      "false_deps_popcnt",
      "fast_15bytenop",
      "fast_gather",
      "fast_scalar_fsqrt",
      "fast_shld_rotate",
      "fast_variable_crosslane_shuffle",
      "fast_variable_perlane_shuffle",
      "fast_vector_fsqrt",
      "fma",
      "fsgsbase",
      "fxsr",
      "idivq_to_divl",
      "invpcid",
      "lzcnt",
      "macrofusion",
      "mmx",
      "movbe",
      "no_bypass_delay_blend",
      "no_bypass_delay_mov",
      "no_bypass_delay_shuffle",
      "nopl",
      "pclmul",
      "popcnt",
      "prfchw",
      "rdrnd",
      "rdseed",
      "sahf",
      "sgx",
      "slow_3ops_lea",
      "sse",
      "sse2",
      "sse3",
      "sse4_1",
      "sse4_2",
      "ssse3",
      "vzeroupper",
      "x87",
      "xsave",
      "xsavec",
      "xsaveopt",
      "xsaves"
    ]
  },
  "os": "macos",
  "abi": "none"
}

CPU Intel(R) Core(TM) i9-9900K info:

(base) ➜  llama.cpp git:(master) sysctl -a | grep machdep.cpu
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.arch_perf.version: 4
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.size: 256
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.tsc_ccc.numerator: 300
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.max_basic: 22
machdep.cpu.max_ext: 2147483656
machdep.cpu.vendor: GenuineIntel
machdep.cpu.brand_string: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
machdep.cpu.family: 6
machdep.cpu.model: 158
machdep.cpu.extmodel: 9
machdep.cpu.extfamily: 0
machdep.cpu.stepping: 13
machdep.cpu.feature_bits: 9221960262849657855
machdep.cpu.leaf7_feature_bits: 43804591 1073741824
machdep.cpu.leaf7_feature_bits_edx: 3154120192
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.signature: 591597
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT SGXLC MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.logical_per_package: 16
machdep.cpu.cores_per_package: 8
machdep.cpu.microcode_version: 252
machdep.cpu.processor_flag: 1
machdep.cpu.core_count: 8
machdep.cpu.thread_count: 16
<!-- gh-comment-id:2080206233 --> @xakrume commented on GitHub (Apr 26, 2024): @cracksauce , sorry, my mistake. No need to run zig. I'm using MacPorts packages ```sh sudo port install cmake # add llvm libs symlinks sudo ln -sf /opt/local/libexec/llvm-16/lib/libc++.1.dylib /opt/local/lib/ sudo ln -sf /opt/local/libexec/llvm-16/lib/libc++abi.1.dylib /opt/local/lib/ git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp mkdir build cd build cmake .. cmake --build . --config Release ``` binary files located at bin/ at your current build direcotry. CPU features from `zig targets` ```json (base) ➜ llama.cpp git:(master) zig targets | jq .native { "triple": "x86_64-macos.14.4.1...14.4.1-none", "cpu": { "arch": "x86_64", "name": "skylake", "features": [ "64bit", "adx", "aes", "allow_light_256_bit", "avx", "avx2", "bmi", "bmi2", "clflushopt", "cmov", "crc32", "cx16", "cx8", "ermsb", "f16c", "false_deps_popcnt", "fast_15bytenop", "fast_gather", "fast_scalar_fsqrt", "fast_shld_rotate", "fast_variable_crosslane_shuffle", "fast_variable_perlane_shuffle", "fast_vector_fsqrt", "fma", "fsgsbase", "fxsr", "idivq_to_divl", "invpcid", "lzcnt", "macrofusion", "mmx", "movbe", "no_bypass_delay_blend", "no_bypass_delay_mov", "no_bypass_delay_shuffle", "nopl", "pclmul", "popcnt", "prfchw", "rdrnd", "rdseed", "sahf", "sgx", "slow_3ops_lea", "sse", "sse2", "sse3", "sse4_1", "sse4_2", "ssse3", "vzeroupper", "x87", "xsave", "xsavec", "xsaveopt", "xsaves" ] }, "os": "macos", "abi": "none" } ``` CPU Intel(R) Core(TM) i9-9900K info: ``` (base) ➜ llama.cpp git:(master) sysctl -a | grep machdep.cpu machdep.cpu.mwait.linesize_min: 64 machdep.cpu.mwait.linesize_max: 64 machdep.cpu.mwait.extensions: 3 machdep.cpu.mwait.sub_Cstates: 286531872 machdep.cpu.thermal.sensor: 1 machdep.cpu.thermal.dynamic_acceleration: 1 machdep.cpu.thermal.invariant_APIC_timer: 1 machdep.cpu.thermal.thresholds: 2 machdep.cpu.thermal.ACNT_MCNT: 1 machdep.cpu.thermal.core_power_limits: 1 machdep.cpu.thermal.fine_grain_clock_mod: 1 machdep.cpu.thermal.package_thermal_intr: 1 machdep.cpu.thermal.hardware_feedback: 0 machdep.cpu.thermal.energy_policy: 1 machdep.cpu.xsave.extended_state: 31 832 1088 0 machdep.cpu.xsave.extended_state1: 15 832 256 0 machdep.cpu.arch_perf.version: 4 machdep.cpu.arch_perf.number: 4 machdep.cpu.arch_perf.width: 48 machdep.cpu.arch_perf.events_number: 7 machdep.cpu.arch_perf.events: 0 machdep.cpu.arch_perf.fixed_number: 3 machdep.cpu.arch_perf.fixed_width: 48 machdep.cpu.cache.linesize: 64 machdep.cpu.cache.L2_associativity: 4 machdep.cpu.cache.size: 256 machdep.cpu.tlb.inst.large: 8 machdep.cpu.tlb.data.small: 64 machdep.cpu.tlb.data.small_level1: 64 machdep.cpu.address_bits.physical: 39 machdep.cpu.address_bits.virtual: 48 machdep.cpu.tsc_ccc.numerator: 300 machdep.cpu.tsc_ccc.denominator: 2 machdep.cpu.max_basic: 22 machdep.cpu.max_ext: 2147483656 machdep.cpu.vendor: GenuineIntel machdep.cpu.brand_string: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz machdep.cpu.family: 6 machdep.cpu.model: 158 machdep.cpu.extmodel: 9 machdep.cpu.extfamily: 0 machdep.cpu.stepping: 13 machdep.cpu.feature_bits: 9221960262849657855 machdep.cpu.leaf7_feature_bits: 43804591 1073741824 machdep.cpu.leaf7_feature_bits_edx: 3154120192 machdep.cpu.extfeature_bits: 1241984796928 machdep.cpu.signature: 591597 machdep.cpu.brand: 0 machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 SMEP BMI2 ERMS INVPCID FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT SGXLC MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI machdep.cpu.logical_per_package: 16 machdep.cpu.cores_per_package: 8 machdep.cpu.microcode_version: 252 machdep.cpu.processor_flag: 1 machdep.cpu.core_count: 8 machdep.cpu.thread_count: 16 ```
Author
Owner

@xakrume commented on GitHub (Apr 27, 2024):

@xakrume if you're up for it, could you post a PR to update the x86 darwin build to add a metal variant?

The build portion would be added around here

and we'd need some adjustments to the GPU discovery logic to be able to identify when to use this variant. At present it's simplistic and just toggles CPU variants on x86, and always uses metal on arm but I think we'd need to actually discover on x86 if there is a metal GPU present. https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go and https://github.com/ollama/ollama/blob/main/gpu/gpu_info_darwin.m

current go generate ./... - generates binary with Metal support.

(base) ➜  ollama git:(main) ✗ ./llm/build/darwin/x86_64/metal/bin/ollama_llama_server -m /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
{"build":2737,"commit":"46e12c4","function":"main","level":"INFO","line":2820,"msg":"build info","tid":"0x7ff84a1fa100","timestamp":1714207662}
{"function":"main","level":"INFO","line":2827,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 | ","tid":"0x7ff84a1fa100","timestamp":1714207662,"total_threads":16}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3584.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   982.98 MiB, offs =   3327152128, ( 4566.98 / 16368.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      Metal buffer size =  4155.99 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: picking default device: AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32      (not supported)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 4664.59 / 16368.00)
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   258.50 MiB, ( 4923.09 / 16368.00)
llama_new_context_with_model:      Metal compute buffer size =   258.50 MiB
llama_new_context_with_model:        CPU compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"0x7ff84a1fa100","timestamp":1714207663}
{"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"0x7ff84a1fa100","timestamp":1714207663}
{"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"0x7ff84a1fa100","timestamp":1714207663}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"15","port":"8080","tid":"0x7ff84a1fa100","timestamp":1714207663}
{"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x7ff84a1fa100","timestamp":1714207663}

Why go build . creates an ollama binary file which does not used GPU?

<!-- gh-comment-id:2080421245 --> @xakrume commented on GitHub (Apr 27, 2024): > @xakrume if you're up for it, could you post a PR to update the x86 darwin build to add a metal variant? > > The build portion would be added around [here](https://github.com/ollama/ollama/blob/main/llm/generate/gen_darwin.sh#L71) > > and we'd need some adjustments to the GPU discovery logic to be able to identify when to use this variant. At present it's simplistic and just toggles CPU variants on x86, and always uses metal on arm but I think we'd need to actually discover on x86 if there is a metal GPU present. https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go and https://github.com/ollama/ollama/blob/main/gpu/gpu_info_darwin.m current `go generate ./...` - generates binary with Metal support. ```bash (base) ➜ ollama git:(main) ✗ ./llm/build/darwin/x86_64/metal/bin/ollama_llama_server -m /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 {"build":2737,"commit":"46e12c4","function":"main","level":"INFO","line":2820,"msg":"build info","tid":"0x7ff84a1fa100","timestamp":1714207662} {"function":"main","level":"INFO","line":2827,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 | ","tid":"0x7ff84a1fa100","timestamp":1714207662,"total_threads":16} llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_tensors: ggml ctx size = 0.30 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3584.00 MiB, offs = 0 ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 982.98 MiB, offs = 3327152128, ( 4566.98 / 16368.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: Metal buffer size = 4155.99 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: picking default device: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 64.00 MiB, ( 4664.59 / 16368.00) llama_kv_cache_init: Metal KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 258.50 MiB, ( 4923.09 / 16368.00) llama_new_context_with_model: Metal compute buffer size = 258.50 MiB llama_new_context_with_model: CPU compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"0x7ff84a1fa100","timestamp":1714207663} {"function":"initialize","level":"INFO","line":460,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"0x7ff84a1fa100","timestamp":1714207663} {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"0x7ff84a1fa100","timestamp":1714207663} {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"15","port":"8080","tid":"0x7ff84a1fa100","timestamp":1714207663} {"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"0x7ff84a1fa100","timestamp":1714207663} ``` Why `go build .` creates an `ollama` binary file which does not used GPU?
Author
Owner

@xakrume commented on GitHub (Apr 27, 2024):

running with serv argument execs binary without Metal support:

(base) ➜  ollama git:(main) ✗ ./ollama serve                                                              
time=2024-04-27T12:30:10.706+03:00 level=INFO source=images.go:821 msg="total blobs: 14"
time=2024-04-27T12:30:10.706+03:00 level=INFO source=images.go:828 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-04-27T12:30:10.706+03:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-04-27T12:30:10.707+03:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners
time=2024-04-27T12:30:10.753+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal cpu cpu_avx cpu_avx2 gpu_metal]"
time=2024-04-27T12:30:10.753+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GIN] 2024/04/27 - 12:31:06 | 200 |      63.277µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/04/27 - 12:31:06 | 200 |     554.973µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/04/27 - 12:31:06 | 200 |     250.134µs |       127.0.0.1 | POST     "/api/show"
time=2024-04-27T12:31:06.449+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-27T12:31:07.500+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-27T12:31:07.501+03:00 level=INFO source=server.go:290 msg="starting llama server" cmd="/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx2/ollama_llama_server --model /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 57619"
time=2024-04-27T12:31:07.503+03:00 level=INFO source=sched.go:327 msg="loaded runners" count=1
time=2024-04-27T12:31:07.503+03:00 level=INFO source=server.go:439 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"0x7ff84a1fa100","timestamp":1714210268}
{"build":2737,"commit":"46e12c4","function":"main","level":"INFO","line":2820,"msg":"build info","tid":"0x7ff84a1fa100","timestamp":1714210268}
{"function":"main","level":"INFO","line":2827,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 | ","tid":"0x7ff84a1fa100","timestamp":1714210268,"total_threads":16}
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))

cmd="/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/CPU_AVX2 /ollama_llama_server

(base) ➜  ollama git:(main) ✗ find /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners 
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx/ollama_llama_server
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/gpu_metal
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/gpu_metal/ollama_llama_server
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal/ggml-metal.metal
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal/ggml-common.h
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal/ollama_llama_server
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu/ollama_llama_server
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx2
/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx2/ollama_llama_server

Metal binary exist and is not used

<!-- gh-comment-id:2080432230 --> @xakrume commented on GitHub (Apr 27, 2024): running with serv argument execs binary without Metal support: ``` (base) ➜ ollama git:(main) ✗ ./ollama serve time=2024-04-27T12:30:10.706+03:00 level=INFO source=images.go:821 msg="total blobs: 14" time=2024-04-27T12:30:10.706+03:00 level=INFO source=images.go:828 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-04-27T12:30:10.706+03:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-04-27T12:30:10.707+03:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners time=2024-04-27T12:30:10.753+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal cpu cpu_avx cpu_avx2 gpu_metal]" time=2024-04-27T12:30:10.753+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" [GIN] 2024/04/27 - 12:31:06 | 200 | 63.277µs | 127.0.0.1 | HEAD "/" [GIN] 2024/04/27 - 12:31:06 | 200 | 554.973µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/04/27 - 12:31:06 | 200 | 250.134µs | 127.0.0.1 | POST "/api/show" time=2024-04-27T12:31:06.449+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-27T12:31:07.500+03:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-27T12:31:07.501+03:00 level=INFO source=server.go:290 msg="starting llama server" cmd="/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx2/ollama_llama_server --model /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 57619" time=2024-04-27T12:31:07.503+03:00 level=INFO source=sched.go:327 msg="loaded runners" count=1 time=2024-04-27T12:31:07.503+03:00 level=INFO source=server.go:439 msg="waiting for llama runner to start responding" {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"0x7ff84a1fa100","timestamp":1714210268} {"build":2737,"commit":"46e12c4","function":"main","level":"INFO","line":2820,"msg":"build info","tid":"0x7ff84a1fa100","timestamp":1714210268} {"function":"main","level":"INFO","line":2827,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LAMMAFILE = 1 | ","tid":"0x7ff84a1fa100","timestamp":1714210268,"total_threads":16} llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) ``` cmd="/var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/**CPU_AVX2** /ollama_llama_server ``` (base) ➜ ollama git:(main) ✗ find /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx/ollama_llama_server /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/gpu_metal /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/gpu_metal/ollama_llama_server /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal/ggml-metal.metal /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal/ggml-common.h /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/metal/ollama_llama_server /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu/ollama_llama_server /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx2 /var/folders/hq/t9yhn0xj7bs0bgczp4n908w80000gn/T/ollama3229753367/runners/cpu_avx2/ollama_llama_server ``` Metal binary exist and is not used
Author
Owner

@xakrume commented on GitHub (Apr 27, 2024):

versus M1 Pro CPU initialization:

➜ ollama git:(main) OLLAMA_DEBUG=1 ./ollama serve
time=2024-04-27T14:01:11.188+03:00 level=INFO source=images.go:821 msg="total blobs: 0"
time=2024-04-27T14:01:11.189+03:00 level=INFO source=images.go:828 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-04-27T14:01:11.189+03:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-04-27T14:01:11.189+03:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/lv/nfkhx1zj6bjb3g4pc0pdb9200000gn/T/ollama2248323029/runners
time=2024-04-27T14:01:11.190+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-common.h.gz
time=2024-04-27T14:01:11.190+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-metal.metal.gz
time=2024-04-27T14:01:11.190+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ollama_llama_server.gz
time=2024-04-27T14:01:11.215+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/lv/nfkhx1zj6bjb3g4pc0pdb9200000gn/T/ollama2248323029/runners/metal
time=2024-04-27T14:01:11.215+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-04-27T14:01:11.215+03:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-04-27T14:01:11.215+03:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-04-27T14:01:28.764+03:00 level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop"
time=2024-04-27T14:01:28.764+03:00 level=DEBUG source=assets.go:140 msg="cleaning up" dir=/var/folders/lv/nfkhx1zj6bjb3g4pc0pdb9200000gn/T/ollama2248323029
time=2024-04-27T14:01:28.764+03:00 level=DEBUG source=sched.go:217 msg="shutting down scheduler completed loop"
<!-- gh-comment-id:2080453409 --> @xakrume commented on GitHub (Apr 27, 2024): versus M1 Pro CPU initialization: ```sh ➜ ollama git:(main) OLLAMA_DEBUG=1 ./ollama serve time=2024-04-27T14:01:11.188+03:00 level=INFO source=images.go:821 msg="total blobs: 0" time=2024-04-27T14:01:11.189+03:00 level=INFO source=images.go:828 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-04-27T14:01:11.189+03:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-04-27T14:01:11.189+03:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/lv/nfkhx1zj6bjb3g4pc0pdb9200000gn/T/ollama2248323029/runners time=2024-04-27T14:01:11.190+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-common.h.gz time=2024-04-27T14:01:11.190+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-metal.metal.gz time=2024-04-27T14:01:11.190+03:00 level=DEBUG source=payload.go:180 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ollama_llama_server.gz time=2024-04-27T14:01:11.215+03:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/lv/nfkhx1zj6bjb3g4pc0pdb9200000gn/T/ollama2248323029/runners/metal time=2024-04-27T14:01:11.215+03:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]" time=2024-04-27T14:01:11.215+03:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-04-27T14:01:11.215+03:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-04-27T14:01:28.764+03:00 level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop" time=2024-04-27T14:01:28.764+03:00 level=DEBUG source=assets.go:140 msg="cleaning up" dir=/var/folders/lv/nfkhx1zj6bjb3g4pc0pdb9200000gn/T/ollama2248323029 time=2024-04-27T14:01:28.764+03:00 level=DEBUG source=sched.go:217 msg="shutting down scheduler completed loop" ```
Author
Owner

@xakrume commented on GitHub (Apr 27, 2024):

I corrected DCMAKE_OSX_DEPLOYMENT_TARGET to 13.3 because there was a warning during the build:

/Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:10800:17: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0
void cblas_sgemm(const enum CBLAS_ORDER ORDER,
     ^
/Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:10800:17: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning
                cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^~~~~~~~~~~
/Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:11266:9: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
        cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n);
        ^~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0
void cblas_sgemm(const enum CBLAS_ORDER ORDER,
     ^
/Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:11266:9: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning
        cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n);
        ^~~~~~~~~~~

I've decided it doesn't make sense to disable Metal support for macOS if the processor supports AVX2.

diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh
index f79534c..6f5124e 100755
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -18,50 +18,26 @@ sign() {
     fi
 }
 
-COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
+COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
 
 case "${GOARCH}" in
 "amd64")
-    COMMON_CPU_DEFS="${COMMON_DARWIN_DEFS} -DCMAKE_SYSTEM_PROCESSOR=${ARCH} -DCMAKE_OSX_ARCHITECTURES=${ARCH} -DLLAMA_METAL=off -DLLAMA_NATIVE=off"
+    COMMON_CPU_DEFS="${COMMON_DARWIN_DEFS} -DCMAKE_SYSTEM_PROCESSOR=${ARCH} -DCMAKE_OSX_ARCHITECTURES=${ARCH} -DLLAMA_METAL=on -DLLAMA_NATIVE=on"
 
     # Static build for linking into the Go binary
     init_vars
     CMAKE_TARGETS="--target llama --target ggml"
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DBUILD_SHARED_LIBS=off -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}"
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DBUILD_SHARED_LIBS=off -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
     BUILD_DIR="../build/darwin/${ARCH}_static"
     echo "Building static library"
     build
 
-
-    #
-    # CPU first for the default library, set up as lowest common denominator for maximum compatibility (including Rosetta)
-    #
-    init_vars
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}"
-    BUILD_DIR="../build/darwin/${ARCH}/cpu"
-    echo "Building LCD CPU"
-    build
-    sign ${BUILD_DIR}/bin/ollama_llama_server
-    compress
-
-    #
-    # ~2011 CPU Dynamic library with more capabilities turned on to optimize performance
-    # Approximately 400% faster than LCD on same CPU
-    #
-    init_vars
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}"
-    BUILD_DIR="../build/darwin/${ARCH}/cpu_avx"
-    echo "Building AVX CPU"
-    build
-    sign ${BUILD_DIR}/bin/ollama_llama_server
-    compress
-
     #
     # ~2013 CPU Dynamic library
     # Approximately 10% faster than AVX on same CPU
     #
     init_vars
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on ${CMAKE_DEFS}"
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on -DLLAMA_METAL_EMBED_LIBRARY=on -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
     BUILD_DIR="../build/darwin/${ARCH}/cpu_avx2"
     echo "Building AVX2 CPU"
     EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation"

I removed the build for processors without AVX instructions and also leave the build with static libraries.

Adding a check for Metal support to select the appropriate binary would be beneficial. Currently, the runner is launched only for the CPU without considering Metal GPU support.

These changes allowed me to use my GPU for computing.
And now my build - utilizes AMD Radeon GPU

<!-- gh-comment-id:2080543278 --> @xakrume commented on GitHub (Apr 27, 2024): I corrected DCMAKE_OSX_DEPLOYMENT_TARGET to 13.3 because there was a warning during the build: ```llvm /Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:10800:17: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new] cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, ^~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0 void cblas_sgemm(const enum CBLAS_ORDER ORDER, ^ /Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:10800:17: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, ^~~~~~~~~~~ /Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:11266:9: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new] cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n); ^~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0 void cblas_sgemm(const enum CBLAS_ORDER ORDER, ^ /Volumes/256-A/test/ollama/llm/llama.cpp/ggml.c:11266:9: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n); ^~~~~~~~~~~ ``` I've decided it doesn't make sense to disable Metal support for macOS if the processor supports AVX2. ```diff diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh index f79534c..6f5124e 100755 --- a/llm/generate/gen_darwin.sh +++ b/llm/generate/gen_darwin.sh @@ -18,50 +18,26 @@ sign() { fi } -COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" +COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" case "${GOARCH}" in "amd64") - COMMON_CPU_DEFS="${COMMON_DARWIN_DEFS} -DCMAKE_SYSTEM_PROCESSOR=${ARCH} -DCMAKE_OSX_ARCHITECTURES=${ARCH} -DLLAMA_METAL=off -DLLAMA_NATIVE=off" + COMMON_CPU_DEFS="${COMMON_DARWIN_DEFS} -DCMAKE_SYSTEM_PROCESSOR=${ARCH} -DCMAKE_OSX_ARCHITECTURES=${ARCH} -DLLAMA_METAL=on -DLLAMA_NATIVE=on" # Static build for linking into the Go binary init_vars CMAKE_TARGETS="--target llama --target ggml" - CMAKE_DEFS="${COMMON_CPU_DEFS} -DBUILD_SHARED_LIBS=off -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}" + CMAKE_DEFS="${COMMON_CPU_DEFS} -DBUILD_SHARED_LIBS=off -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}" BUILD_DIR="../build/darwin/${ARCH}_static" echo "Building static library" build - - # - # CPU first for the default library, set up as lowest common denominator for maximum compatibility (including Rosetta) - # - init_vars - CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}" - BUILD_DIR="../build/darwin/${ARCH}/cpu" - echo "Building LCD CPU" - build - sign ${BUILD_DIR}/bin/ollama_llama_server - compress - - # - # ~2011 CPU Dynamic library with more capabilities turned on to optimize performance - # Approximately 400% faster than LCD on same CPU - # - init_vars - CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}" - BUILD_DIR="../build/darwin/${ARCH}/cpu_avx" - echo "Building AVX CPU" - build - sign ${BUILD_DIR}/bin/ollama_llama_server - compress - # # ~2013 CPU Dynamic library # Approximately 10% faster than AVX on same CPU # init_vars - CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on ${CMAKE_DEFS}" + CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on -DLLAMA_METAL_EMBED_LIBRARY=on -DLLAMA_NATIVE=on ${CMAKE_DEFS}" BUILD_DIR="../build/darwin/${ARCH}/cpu_avx2" echo "Building AVX2 CPU" EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation" ``` I removed the build for processors without AVX instructions and also leave the build with static libraries. Adding a check for Metal support to select the appropriate binary would be beneficial. Currently, the runner is launched only for the CPU without considering Metal GPU support. These changes allowed me to use my GPU for computing. And now my build - utilizes AMD Radeon GPU
Author
Owner

@xakrume commented on GitHub (Apr 27, 2024):

ggml_metal_init: using embedded metal library

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-04-27T15:27:29.609+03:00 level=DEBUG source=server.go:473 msg="server not yet available" error="server not responding"
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3584.00 MiB, offs =            0
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   982.98 MiB, offs =   3327152128, ( 4566.98 / 16368.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      Metal buffer size =  4155.99 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: picking default device: AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   AMD Radeon HD GFX10 Family Unknown Prototype
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32            (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32           (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32          (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32         (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32        (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32     (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32       (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32      (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32      (not supported)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   256.00 MiB, ( 4856.59 / 16368.00)
llama_kv_cache_init:      Metal KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   258.50 MiB, ( 5115.09 / 16368.00)
llama_new_context_with_model:      Metal compute buffer size =   258.50 MiB
llama_new_context_with_model:        CPU compute buffer size =    12.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
<!-- gh-comment-id:2080560392 --> @xakrume commented on GitHub (Apr 27, 2024): > ggml_metal_init: using embedded metal library ``` llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/rf/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-04-27T15:27:29.609+03:00 level=DEBUG source=server.go:473 msg="server not yet available" error="server not responding" llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_tensors: ggml ctx size = 0.30 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 3584.00 MiB, offs = 0 ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 982.98 MiB, offs = 3327152128, ( 4566.98 / 16368.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: Metal buffer size = 4155.99 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: picking default device: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: AMD Radeon HD GFX10 Family Unknown Prototype ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 256.00 MiB, ( 4856.59 / 16368.00) llama_kv_cache_init: Metal KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 258.50 MiB, ( 5115.09 / 16368.00) llama_new_context_with_model: Metal compute buffer size = 258.50 MiB llama_new_context_with_model: CPU compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 ```
Author
Owner

@herobs commented on GitHub (Apr 27, 2024):

@xakrume I've compiled successfully with metal enabled.

But it seems the GPU is slower than the CPU (about 2x). My setup is Intel i5-12400 + AMD 6600xt.

<!-- gh-comment-id:2081251968 --> @herobs commented on GitHub (Apr 27, 2024): @xakrume I've compiled successfully with metal enabled. But it seems the GPU is slower than the CPU (about 2x). My setup is Intel i5-12400 + AMD 6600xt.
Author
Owner

@night0wl0 commented on GitHub (Apr 28, 2024):

@herobs, @xakrume,

When you have a moment, would you be able to post your complete build steps starting with the clone of the fresh Ollama repo?

<!-- gh-comment-id:2081631446 --> @night0wl0 commented on GitHub (Apr 28, 2024): @herobs, @xakrume, When you have a moment, would you be able to post your complete build steps starting with the clone of the fresh Ollama repo?
Author
Owner

@herobs commented on GitHub (Apr 28, 2024):

@night0wl0 You should clone the backend llama.cpp instead. Then compile it with whatever method you want (like just make). Now you have a GPU enabled backend binary, and you can invoke it with a model path.

<!-- gh-comment-id:2081686664 --> @herobs commented on GitHub (Apr 28, 2024): @night0wl0 You should clone the backend [llama.cpp](https://github.com/ggerganov/llama.cpp) instead. Then compile it with whatever method you want (like just `make`). Now you have a GPU enabled backend binary, and you can invoke it with a model path.
Author
Owner

@xakrume commented on GitHub (Apr 29, 2024):

@herobs

My setup:
Intel Core i9 9900K
AMD RX 6950 XT 16GB
macOS: 14.4.1 (23E224) Sonoma

my performance with AMD GPU

(base) ➜  llama.cpp git:(master) ./build/bin/llama-bench --model ~/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B Q4_0                  |   4.33 GiB |     8.03 B | Metal      |  99 | pp 512     | 66349.61 ± 40501.99 |
| llama 7B Q4_0                  |   4.33 GiB |     8.03 B | Metal      |  99 | tg 128     |      2.78 ± 0.00 |

llama.cpp, compiled from sources.

CPU benchmark from Ubuntu 24.04 on this workstation (LiveUSB)

xakrume@ubuntu:~/llama.cpp/build$ ./bin/llama-bench -m /usr/share/ollama/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 -ngl 0
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |          8 | pp 512     |     30.26 ± 1.70 |
| llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CPU        |          8 | tg 128     |      7.69 ± 0.22 |

build: e00b4a8f (2755)

I found some issues in llama.cpp with low GPU performance https://github.com/ggerganov/llama.cpp/issues/3422

<!-- gh-comment-id:2082942131 --> @xakrume commented on GitHub (Apr 29, 2024): @herobs My setup: Intel Core i9 9900K AMD RX 6950 XT 16GB macOS: 14.4.1 (23E224) Sonoma my performance with AMD GPU ``` (base) ➜ llama.cpp git:(master) ./build/bin/llama-bench --model ~/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | | llama 7B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | pp 512 | 66349.61 ± 40501.99 | | llama 7B Q4_0 | 4.33 GiB | 8.03 B | Metal | 99 | tg 128 | 2.78 ± 0.00 | ``` llama.cpp, compiled from sources. CPU benchmark from Ubuntu 24.04 on this workstation (LiveUSB) ``` xakrume@ubuntu:~/llama.cpp/build$ ./bin/llama-bench -m /usr/share/ollama/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 -ngl 0 | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 8 | pp 512 | 30.26 ± 1.70 | | llama 8B Q4_0 | 4.33 GiB | 8.03 B | CPU | 8 | tg 128 | 7.69 ± 0.22 | build: e00b4a8f (2755) ``` I found some issues in llama.cpp with low GPU performance https://github.com/ggerganov/llama.cpp/issues/3422
Author
Owner

@dev-zero commented on GitHub (Apr 30, 2024):

@birchcode that sounds like a good step. What sort of performance are you able to achieve, and does it look promising?

Using the Metal API on Intel Mac for these other GPUs may complicate our memory detection and layer calculations. Somehow we'd need to refine https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go to retrieve the GPU memory from some metal API, and then use an algo similar to the cuda/rocm version

@dhiltgen Basically it would require to enumerate the GPU devices, filter by .metal3 and return the recommendedMaxWorkingSetSize per device?
And the scheduler should then pick the device with the largest memory?
I guess the CPU should not be added to the list but will automatically be chosen if the model doesn't fit into memory?

<!-- gh-comment-id:2085117789 --> @dev-zero commented on GitHub (Apr 30, 2024): > @birchcode that sounds like a good step. What sort of performance are you able to achieve, and does it look promising? > > Using the Metal API on Intel Mac for these other GPUs may complicate our memory detection and layer calculations. Somehow we'd need to refine https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go to retrieve the GPU memory from some metal API, and then use an algo similar to the cuda/rocm [version](https://github.com/ollama/ollama/blob/main/gpu/gpu.go#L244-L259) @dhiltgen Basically it would require to enumerate the GPU devices, [filter by `.metal3`](https://developer.apple.com/documentation/metal/gpu_devices_and_work_submission/detecting_gpu_features_and_metal_software_versions) and return the `recommendedMaxWorkingSetSize` per device? And the scheduler should then pick the device with the largest memory? I guess the CPU should not be added to the list but will automatically be chosen if the model doesn't fit into memory?
Author
Owner

@xakrume commented on GitHub (Apr 30, 2024):

@xakrume I've compiled successfully with metal enabled.

But it seems the GPU is slower than the CPU (about 2x). My setup is Intel i5-12400 + AMD 6600xt.

Which quant did you use?

https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix
Performance also based on type of compute device (CPU/GPU/Metal/etc.)

For Metal and AVX2 backends, a good choice for performance improvement is K-quants.
How did I figure it out.

Current ollama's Llama3 model - is Q4_0

~/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest))

llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
<!-- gh-comment-id:2085200540 --> @xakrume commented on GitHub (Apr 30, 2024): > @xakrume I've compiled successfully with metal enabled. > > But it seems the GPU is slower than the CPU (about 2x). My setup is Intel i5-12400 + AMD 6600xt. Which quant did you use? https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix Performance also based on type of compute device (CPU/GPU/Metal/etc.) For Metal and AVX2 backends, a good choice for performance improvement is K-quants. How did I figure it out. Current ollama's Llama3 model - is Q4_0 > ~/.ollama/models/blobs/sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29 (version GGUF V3 (latest)) ``` llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct ```
Author
Owner

@herobs commented on GitHub (Apr 30, 2024):

@xakrume I'm not familiar with these.

llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
ggml_metal_init: found device: AMD Radeon RX 6600 XT
ggml_metal_init: picking default device: AMD Radeon RX 6600 XT
ggml_metal_init: loading 'llama.cpp/build/bin/default.metallib'
ggml_metal_init: GPU name:   AMD Radeon RX 6600 XT
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  =  8573.16 MB
llama_print_timings:      sample time =       7.08 ms /    36 runs   (    0.20 ms per token,  5087.62 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   17005.49 ms /    36 runs   (  472.37 ms per token,     2.12 tokens
<!-- gh-comment-id:2085228396 --> @herobs commented on GitHub (Apr 30, 2024): @xakrume I'm not familiar with these. ``` llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct ``` ``` ggml_metal_init: found device: AMD Radeon RX 6600 XT ggml_metal_init: picking default device: AMD Radeon RX 6600 XT ggml_metal_init: loading 'llama.cpp/build/bin/default.metallib' ggml_metal_init: GPU name: AMD Radeon RX 6600 XT ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 8573.16 MB ``` ``` llama_print_timings: sample time = 7.08 ms / 36 runs ( 0.20 ms per token, 5087.62 tokens per second) llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) llama_print_timings: eval time = 17005.49 ms / 36 runs ( 472.37 ms per token, 2.12 tokens ```
Author
Owner

@xakrume commented on GitHub (May 1, 2024):

I'm not a Go lang developer, unfortunately. But currently, for macOS on Intel with AMD GPU, there is no option to select a server except CPU server, but I need a Metal server. Therefore, for the CPU binary, I'm using a server built for Metal. In llm/payload.go, there is no checks for the presence of a PCIe GPU.
Should we disable CPU servers and leave only the Metal server for macOS?"

<!-- gh-comment-id:2088603685 --> @xakrume commented on GitHub (May 1, 2024): I'm not a Go lang developer, unfortunately. But currently, for macOS on Intel with AMD GPU, there is no option to select a server except CPU server, but I need a Metal server. Therefore, for the CPU binary, I'm using a server built for Metal. In `llm/payload.go`, there is no checks for the presence of a PCIe GPU. Should we disable CPU servers and leave only the Metal server for macOS?"
Author
Owner

@xakrume commented on GitHub (May 1, 2024):

There is an options for only Metal support, without CPU (avx/avx2/avx512) usage.

diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh
index f79534c..9cd0875 100755
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -18,7 +18,7 @@ sign() {
     fi
 }
 
-COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
+COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
 
 case "${GOARCH}" in
 "amd64")
@@ -68,6 +68,19 @@ case "${GOARCH}" in
     build
     sign ${BUILD_DIR}/bin/ollama_llama_server
     compress
+
+    #
+    # ~2015 Metal GPU Dynamic library
+    # Approximately 200_000% faster than CPU
+    #
+    init_vars
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
+    BUILD_DIR="../build/darwin/${ARCH}/metal"
+    echo "Building for eGPU with Metal support"
+    EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders"
+    build
+    sign ${BUILD_DIR}/bin/ollama_llama_server
+    compress
     ;;
 "arm64")
<!-- gh-comment-id:2088604744 --> @xakrume commented on GitHub (May 1, 2024): There is an options for only Metal support, without CPU (avx/avx2/avx512) usage. ```diff diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh index f79534c..9cd0875 100755 --- a/llm/generate/gen_darwin.sh +++ b/llm/generate/gen_darwin.sh @@ -18,7 +18,7 @@ sign() { fi } -COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" +COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" case "${GOARCH}" in "amd64") @@ -68,6 +68,19 @@ case "${GOARCH}" in build sign ${BUILD_DIR}/bin/ollama_llama_server compress + + # + # ~2015 Metal GPU Dynamic library + # Approximately 200_000% faster than CPU + # + init_vars + CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}" + BUILD_DIR="../build/darwin/${ARCH}/metal" + echo "Building for eGPU with Metal support" + EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" + build + sign ${BUILD_DIR}/bin/ollama_llama_server + compress ;; "arm64") ```
Author
Owner

@l-m-mortal commented on GitHub (May 10, 2024):

There is an options for only Metal support, without CPU (avx/avx2/avx512) usage.

diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh
index f79534c..9cd0875 100755
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -18,7 +18,7 @@ sign() {
     fi
 }
 
-COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
+COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
 
 case "${GOARCH}" in
 "amd64")
@@ -68,6 +68,19 @@ case "${GOARCH}" in
     build
     sign ${BUILD_DIR}/bin/ollama_llama_server
     compress
+
+    #
+    # ~2015 Metal GPU Dynamic library
+    # Approximately 200_000% faster than CPU
+    #
+    init_vars
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
+    BUILD_DIR="../build/darwin/${ARCH}/metal"
+    echo "Building for eGPU with Metal support"
+    EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders"
+    build
+    sign ${BUILD_DIR}/bin/ollama_llama_server
+    compress
     ;;
 "arm64")

@xakrume
Followed your steps - it compiled well.
But I got this error while trying to run llama3:
"Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM"

I have r9 m370x and RX 570 as eGPU (both support Metal 2)
Is Metal 2 even relevant?

<!-- gh-comment-id:2105372755 --> @l-m-mortal commented on GitHub (May 10, 2024): > There is an options for only Metal support, without CPU (avx/avx2/avx512) usage. > > ```diff > diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh > index f79534c..9cd0875 100755 > --- a/llm/generate/gen_darwin.sh > +++ b/llm/generate/gen_darwin.sh > @@ -18,7 +18,7 @@ sign() { > fi > } > > -COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" > +COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" > > case "${GOARCH}" in > "amd64") > @@ -68,6 +68,19 @@ case "${GOARCH}" in > build > sign ${BUILD_DIR}/bin/ollama_llama_server > compress > + > + # > + # ~2015 Metal GPU Dynamic library > + # Approximately 200_000% faster than CPU > + # > + init_vars > + CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}" > + BUILD_DIR="../build/darwin/${ARCH}/metal" > + echo "Building for eGPU with Metal support" > + EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" > + build > + sign ${BUILD_DIR}/bin/ollama_llama_server > + compress > ;; > "arm64") > ``` @xakrume Followed your steps - it compiled well. But I got this error while trying to run llama3: "Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM" I have r9 m370x and RX 570 as eGPU (both support Metal 2) Is Metal 2 even relevant?
Author
Owner

@night0wl0 commented on GitHub (May 17, 2024):

There is an options for only Metal support, without CPU (avx/avx2/avx512) usage.

diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh
index f79534c..9cd0875 100755
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -18,7 +18,7 @@ sign() {
     fi
 }
 
-COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
+COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
 
 case "${GOARCH}" in
 "amd64")
@@ -68,6 +68,19 @@ case "${GOARCH}" in
     build
     sign ${BUILD_DIR}/bin/ollama_llama_server
     compress
+
+    #
+    # ~2015 Metal GPU Dynamic library
+    # Approximately 200_000% faster than CPU
+    #
+    init_vars
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
+    BUILD_DIR="../build/darwin/${ARCH}/metal"
+    echo "Building for eGPU with Metal support"
+    EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders"
+    build
+    sign ${BUILD_DIR}/bin/ollama_llama_server
+    compress
     ;;
 "arm64")

@xakrume Followed your steps - it compiled well. But I got this error while trying to run llama3: "Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM"

I have r9 m370x and RX 570 as eGPU (both support Metal 2) Is Metal 2 even relevant?

I am also encountering the same behavior, after some quick initial research it seems that RMS_NORM and in my case also MUL_MAT are unsupported operations in GGMLs Metal backend. My GPU is a Radeon Pro 560.

<!-- gh-comment-id:2117324117 --> @night0wl0 commented on GitHub (May 17, 2024): > > There is an options for only Metal support, without CPU (avx/avx2/avx512) usage. > > ```diff > > diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh > > index f79534c..9cd0875 100755 > > --- a/llm/generate/gen_darwin.sh > > +++ b/llm/generate/gen_darwin.sh > > @@ -18,7 +18,7 @@ sign() { > > fi > > } > > > > -COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" > > +COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" > > > > case "${GOARCH}" in > > "amd64") > > @@ -68,6 +68,19 @@ case "${GOARCH}" in > > build > > sign ${BUILD_DIR}/bin/ollama_llama_server > > compress > > + > > + # > > + # ~2015 Metal GPU Dynamic library > > + # Approximately 200_000% faster than CPU > > + # > > + init_vars > > + CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}" > > + BUILD_DIR="../build/darwin/${ARCH}/metal" > > + echo "Building for eGPU with Metal support" > > + EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" > > + build > > + sign ${BUILD_DIR}/bin/ollama_llama_server > > + compress > > ;; > > "arm64") > > ``` > > @xakrume Followed your steps - it compiled well. But I got this error while trying to run llama3: "Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM" > > I have r9 m370x and RX 570 as eGPU (both support Metal 2) Is Metal 2 even relevant? I am also encountering the same behavior, after some quick initial research it seems that `RMS_NORM` and in my case also `MUL_MAT` are unsupported operations in GGMLs Metal backend. My GPU is a Radeon Pro 560.
Author
Owner

@xakrume commented on GitHub (May 17, 2024):

There is an options for only Metal support, without CPU (avx/avx2/avx512) usage.

diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh
index f79534c..9cd0875 100755
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -18,7 +18,7 @@ sign() {
     fi
 }
 
-COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
+COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on"
 
 case "${GOARCH}" in
 "amd64")
@@ -68,6 +68,19 @@ case "${GOARCH}" in
     build
     sign ${BUILD_DIR}/bin/ollama_llama_server
     compress
+
+    #
+    # ~2015 Metal GPU Dynamic library
+    # Approximately 200_000% faster than CPU
+    #
+    init_vars
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
+    BUILD_DIR="../build/darwin/${ARCH}/metal"
+    echo "Building for eGPU with Metal support"
+    EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders"
+    build
+    sign ${BUILD_DIR}/bin/ollama_llama_server
+    compress
     ;;
 "arm64")

@xakrume Followed your steps - it compiled well. But I got this error while trying to run llama3: "Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM"

I have r9 m370x and RX 570 as eGPU (both support Metal 2) Is Metal 2 even relevant?

I deleted all CPU builds, and only the metal remains:

#!/bin/bash
# This script is intended to run inside the go generate
# working directory must be ./llm/generate/

# TODO - add hardening to detect missing tools (cmake, etc.)

set -ex
set -o pipefail
echo "Starting darwin generate script"
source $(dirname $0)/gen_common.sh
init_vars
git_module_setup
apply_patches

sign() {
    if [ -n "$APPLE_IDENTITY" ]; then
        codesign -f --timestamp --deep --options=runtime --sign "$APPLE_IDENTITY" --identifier ai.ollama.ollama $1
    fi
}

COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=off"

COMMON_CPU_DEFS="${COMMON_DARWIN_DEFS} -DCMAKE_SYSTEM_PROCESSOR=${ARCH} -DCMAKE_OSX_ARCHITECTURES=${ARCH} -DLLAMA_METAL=on -DLLAMA_NATIVE=off"

# Static build for linking into the Go binary
init_vars
CMAKE_TARGETS="--target llama --target ggml"
CMAKE_DEFS="${COMMON_CPU_DEFS} -DBUILD_SHARED_LIBS=off -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}"
BUILD_DIR="../build/darwin/${ARCH}_static"
echo "Building static library"
build

#
# ~2020 GPU Dynamic library
# Approximately 600% faster than CPU
#
init_vars
CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
BUILD_DIR="../build/darwin/${ARCH}/metal"
echo "Building for eGPU with Metal support"
EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders"
build
sign ${BUILD_DIR}/bin/ollama_llama_server
compress

cleanup
echo "go generate completed.  LLM runners: $(cd ${BUILD_DIR}/..; echo *)"
<!-- gh-comment-id:2117346085 --> @xakrume commented on GitHub (May 17, 2024): > > There is an options for only Metal support, without CPU (avx/avx2/avx512) usage. > > ```diff > > diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh > > index f79534c..9cd0875 100755 > > --- a/llm/generate/gen_darwin.sh > > +++ b/llm/generate/gen_darwin.sh > > @@ -18,7 +18,7 @@ sign() { > > fi > > } > > > > -COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" > > +COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=13.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=on" > > > > case "${GOARCH}" in > > "amd64") > > @@ -68,6 +68,19 @@ case "${GOARCH}" in > > build > > sign ${BUILD_DIR}/bin/ollama_llama_server > > compress > > + > > + # > > + # ~2015 Metal GPU Dynamic library > > + # Approximately 200_000% faster than CPU > > + # > > + init_vars > > + CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}" > > + BUILD_DIR="../build/darwin/${ARCH}/metal" > > + echo "Building for eGPU with Metal support" > > + EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" > > + build > > + sign ${BUILD_DIR}/bin/ollama_llama_server > > + compress > > ;; > > "arm64") > > ``` > > @xakrume Followed your steps - it compiled well. But I got this error while trying to run llama3: "Error: llama runner process has terminated: signal: abort trap error:unsupported op 'RMS_NORM" > > I have r9 m370x and RX 570 as eGPU (both support Metal 2) Is Metal 2 even relevant? I deleted all CPU builds, and only the metal remains: ```sh #!/bin/bash # This script is intended to run inside the go generate # working directory must be ./llm/generate/ # TODO - add hardening to detect missing tools (cmake, etc.) set -ex set -o pipefail echo "Starting darwin generate script" source $(dirname $0)/gen_common.sh init_vars git_module_setup apply_patches sign() { if [ -n "$APPLE_IDENTITY" ]; then codesign -f --timestamp --deep --options=runtime --sign "$APPLE_IDENTITY" --identifier ai.ollama.ollama $1 fi } COMMON_DARWIN_DEFS="-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DLLAMA_METAL_EMBED_LIBRARY=off" COMMON_CPU_DEFS="${COMMON_DARWIN_DEFS} -DCMAKE_SYSTEM_PROCESSOR=${ARCH} -DCMAKE_OSX_ARCHITECTURES=${ARCH} -DLLAMA_METAL=on -DLLAMA_NATIVE=off" # Static build for linking into the Go binary init_vars CMAKE_TARGETS="--target llama --target ggml" CMAKE_DEFS="${COMMON_CPU_DEFS} -DBUILD_SHARED_LIBS=off -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}" BUILD_DIR="../build/darwin/${ARCH}_static" echo "Building static library" build # # ~2020 GPU Dynamic library # Approximately 600% faster than CPU # init_vars CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_METAL_EMBED_LIBRARY=off -DLLAMA_NATIVE=on ${CMAKE_DEFS}" BUILD_DIR="../build/darwin/${ARCH}/metal" echo "Building for eGPU with Metal support" EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" build sign ${BUILD_DIR}/bin/ollama_llama_server compress cleanup echo "go generate completed. LLM runners: $(cd ${BUILD_DIR}/..; echo *)" ```
Author
Owner

@night0wl0 commented on GitHub (May 17, 2024):

@xakrume, thanks, while I was able to successfully build using the script, I still get the same RMS_NORM op unsupported error like @l-m-mortal.

<!-- gh-comment-id:2117885964 --> @night0wl0 commented on GitHub (May 17, 2024): @xakrume, thanks, while I was able to successfully build using the script, I still get the same `RMS_NORM` op unsupported error like @l-m-mortal.
Author
Owner

@l-m-mortal commented on GitHub (May 26, 2024):

@xakrume
Thanks for your reply
Built as you shown (only for gpu), and got this

Error: [0] server cpu not listed in available servers map[metal:/var/folders/t4/b5z_lr0d1j55vnjw32vmzzmh0000gn/T/ollama3650114820/runners/metal]

Руслан? значит можем proceed in другой language?)

<!-- gh-comment-id:2132247568 --> @l-m-mortal commented on GitHub (May 26, 2024): @xakrume Thanks for your reply Built as you shown (only for gpu), and got this Error: [0] server cpu not listed in available servers map[metal:/var/folders/t4/b5z_lr0d1j55vnjw32vmzzmh0000gn/T/ollama3650114820/runners/metal] Руслан? значит можем proceed in другой language?)
Author
Owner

@tristan-k commented on GitHub (May 27, 2024):

@xakrume, thanks, while I was able to successfully build using the script, I still get the same RMS_NORM op unsupported error like @l-m-mortal.

+1

Running a RX 6600 XT with a i5-10600K

$ ./main -m /Users/admin/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa -n 128
llama_kv_cache_init:      Metal KV buffer size =    64,00 MiB
llama_new_context_with_model: KV self size  =   64,00 MiB, K (f16):   32,00 MiB, V (f16):   32,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,49 MiB
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0,00 MiB to 258,50 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0,00 MiB to 9,01 MiB
llama_new_context_with_model:      Metal compute buffer size =   258,50 MiB
llama_new_context_with_model:        CPU compute buffer size =     9,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute_block_invoke: error: unsupported op 'RMS_NORM'
GGML_ASSERT: ggml-metal.m:918: !"unsupported op"
[1]    5137 abort      ./main -m  -n 128
<!-- gh-comment-id:2134004064 --> @tristan-k commented on GitHub (May 27, 2024): > @xakrume, thanks, while I was able to successfully build using the script, I still get the same `RMS_NORM` op unsupported error like @l-m-mortal. +1 Running a RX 6600 XT with a i5-10600K ``` $ ./main -m /Users/admin/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa -n 128 llama_kv_cache_init: Metal KV buffer size = 64,00 MiB llama_new_context_with_model: KV self size = 64,00 MiB, K (f16): 32,00 MiB, V (f16): 32,00 MiB llama_new_context_with_model: CPU output buffer size = 0,49 MiB ggml_gallocr_reserve_n: reallocating Metal buffer from size 0,00 MiB to 258,50 MiB ggml_gallocr_reserve_n: reallocating CPU buffer from size 0,00 MiB to 9,01 MiB llama_new_context_with_model: Metal compute buffer size = 258,50 MiB llama_new_context_with_model: CPU compute buffer size = 9,01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 ggml_metal_graph_compute_block_invoke: error: unsupported op 'RMS_NORM' GGML_ASSERT: ggml-metal.m:918: !"unsupported op" [1] 5137 abort ./main -m -n 128 ```
Author
Owner

@Ehco1996 commented on GitHub (Jun 1, 2024):

i also meet this error

 ❯ ./main run llama3
Error: [0] server cpu not listed in available servers map[metal:/var/folders/3t/sms3qfxs4kb1mw1jwg30z3h80000gn/T/ollama1854226978/runners/metal]

after diging into llm/server.go, finally i can make ollama running on metal server by a hardcode walkaround

diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh
index f79534cd..8841872a 100755
--- a/llm/generate/gen_darwin.sh
+++ b/llm/generate/gen_darwin.sh
@@ -32,36 +32,13 @@ case "${GOARCH}" in
     echo "Building static library"
     build

-
-    #
-    # CPU first for the default library, set up as lowest common denominator for maximum compatibility (including Rosetta)
-    #
-    init_vars
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}"
-    BUILD_DIR="../build/darwin/${ARCH}/cpu"
-    echo "Building LCD CPU"
-    build
-    sign ${BUILD_DIR}/bin/ollama_llama_server
-    compress
-
-    #
-    # ~2011 CPU Dynamic library with more capabilities turned on to optimize performance
-    # Approximately 400% faster than LCD on same CPU
-    #
-    init_vars
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}"
-    BUILD_DIR="../build/darwin/${ARCH}/cpu_avx"
-    echo "Building AVX CPU"
-    build
-    sign ${BUILD_DIR}/bin/ollama_llama_server
-    compress
-
     #
     # ~2013 CPU Dynamic library
     # Approximately 10% faster than AVX on same CPU
     #
     init_vars
-    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on ${CMAKE_DEFS}"
+    # CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on ${CMAKE_DEFS}"
+    CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on -DLLAMA_METAL_EMBED_LIBRARY=on -DLLAMA_NATIVE=on ${CMAKE_DEFS}"
     BUILD_DIR="../build/darwin/${ARCH}/cpu_avx2"
     echo "Building AVX2 CPU"
     EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation"
diff --git a/llm/server.go b/llm/server.go
index 9b5d0f06..aaeaf117 100644
--- a/llm/server.go
+++ b/llm/server.go
@@ -230,6 +230,7 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr
 	}

 	params = append(params, "--parallel", fmt.Sprintf("%d", numParallel))
+	servers = []string{"metal"}

 	for i := 0; i < len(servers); i++ {
 		dir := availableServers[servers[i]]

but i found that use gpu is mucher slower than cpu, see more in

Screenshot 2024-06-01 at 10 11 17
<!-- gh-comment-id:2143180059 --> @Ehco1996 commented on GitHub (Jun 1, 2024): i also meet this error ``` ❯ ./main run llama3 Error: [0] server cpu not listed in available servers map[metal:/var/folders/3t/sms3qfxs4kb1mw1jwg30z3h80000gn/T/ollama1854226978/runners/metal] ``` after diging into `llm/server.go`, finally i can make ollama running on metal server by a hardcode walkaround ``` diff --git a/llm/generate/gen_darwin.sh b/llm/generate/gen_darwin.sh index f79534cd..8841872a 100755 --- a/llm/generate/gen_darwin.sh +++ b/llm/generate/gen_darwin.sh @@ -32,36 +32,13 @@ case "${GOARCH}" in echo "Building static library" build - - # - # CPU first for the default library, set up as lowest common denominator for maximum compatibility (including Rosetta) - # - init_vars - CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}" - BUILD_DIR="../build/darwin/${ARCH}/cpu" - echo "Building LCD CPU" - build - sign ${BUILD_DIR}/bin/ollama_llama_server - compress - - # - # ~2011 CPU Dynamic library with more capabilities turned on to optimize performance - # Approximately 400% faster than LCD on same CPU - # - init_vars - CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off ${CMAKE_DEFS}" - BUILD_DIR="../build/darwin/${ARCH}/cpu_avx" - echo "Building AVX CPU" - build - sign ${BUILD_DIR}/bin/ollama_llama_server - compress - # # ~2013 CPU Dynamic library # Approximately 10% faster than AVX on same CPU # init_vars - CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on ${CMAKE_DEFS}" + # CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on ${CMAKE_DEFS}" + CMAKE_DEFS="${COMMON_CPU_DEFS} -DLLAMA_ACCELERATE=on -DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_AVX512=off -DLLAMA_FMA=on -DLLAMA_F16C=on -DLLAMA_METAL_EMBED_LIBRARY=on -DLLAMA_NATIVE=on ${CMAKE_DEFS}" BUILD_DIR="../build/darwin/${ARCH}/cpu_avx2" echo "Building AVX2 CPU" EXTRA_LIBS="${EXTRA_LIBS} -framework Accelerate -framework Foundation" diff --git a/llm/server.go b/llm/server.go index 9b5d0f06..aaeaf117 100644 --- a/llm/server.go +++ b/llm/server.go @@ -230,6 +230,7 @@ func NewLlamaServer(gpus gpu.GpuInfoList, model string, ggml *GGML, adapters, pr } params = append(params, "--parallel", fmt.Sprintf("%d", numParallel)) + servers = []string{"metal"} for i := 0; i < len(servers); i++ { dir := availableServers[servers[i]] ``` but i found that use gpu is mucher slower than cpu, see more in <img width="1148" alt="Screenshot 2024-06-01 at 10 11 17" src="https://github.com/ollama/ollama/assets/24697284/8fa81a51-7eb4-45dd-ae41-ac2ca1281a36">
Author
Owner

@guidocioni commented on GitHub (Jun 7, 2024):

Is this ever going to be supported?
I'm running on MacOS with i9-13900F and RX 6800XT, but right now, as I installed Ollama directly by downloading the pkg, it only uses the CPU cores

<!-- gh-comment-id:2154680884 --> @guidocioni commented on GitHub (Jun 7, 2024): Is this ever going to be supported? I'm running on MacOS with i9-13900F and RX 6800XT, but right now, as I installed Ollama directly by downloading the pkg, it only uses the CPU cores
Author
Owner

@cmarhoover commented on GitHub (Jun 15, 2024):

Does llama.cpp commit f8ec887 address this problem in some way? Seems that precompiled builds of llama.cpp after April 2 were impacted. Issue #7940

<!-- gh-comment-id:2169502416 --> @cmarhoover commented on GitHub (Jun 15, 2024): Does llama.cpp commit [f8ec887](https://github.com/ggerganov/llama.cpp/commit/f8ec8877b75774fc6c47559d529dac423877bcad) address this problem in some way? Seems that precompiled builds of llama.cpp after April 2 were impacted. [Issue #7940](https://github.com/ggerganov/llama.cpp/pull/7940)
Author
Owner

@tristan-k commented on GitHub (Jun 18, 2024):

Does llama.cpp commit f8ec887 address this problem in some way? Seems that precompiled builds of llama.cpp after April 2 were impacted. Issue #7940

Indeed the latest llama.cpp (b3173) does use the gpu on my macOS Sonoma installation.

Is there any way to exchange the latest llama.cpp binaries in ollama because I want to use Open WebUI which depands on ollama - or is there a time window when the changes will arrive in ollama?

llama-cli \
        --hf-repo "TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF" \
        --hf-file ggml-model-q4_0.gguf \
        -p "I believe the meaning of life is" \
        -n 128
Log start
main: build = 3173 (a94e6ff8)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0
main: seed  = 1718709174
llama_download_file: previous metadata file found /Users/admin/Library/Caches/llama.cpp/ggml-model-q4_0.gguf.json: {"etag":"\"4b084aeae725de00362289c272049bed-40\"","lastModified":"Wed, 27 Sep 2023 13:52:41 GMT","url":"https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"}
llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from /Users/admin/Library/Caches/llama.cpp/ggml-model-q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ..
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0,000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000,000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32003]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32003]   = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32003]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_0:  155 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 262
llm_load_vocab: token to piece cache size = 0,1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32003
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-05
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1,10 B
llm_load_print_meta: model size       = 606,54 MiB (4,63 BPW)
llm_load_print_meta: general.name     = ..
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32002 '<|im_end|>'
llm_load_tensors: ggml ctx size =    0,20 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =   571,38 MiB, (  571,38 /  8176,00)
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =    35,16 MiB
llm_load_tensors:      Metal buffer size =   571,38 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon PRO W6600
ggml_metal_init: picking default device: AMD Radeon PRO W6600
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   AMD Radeon PRO W6600
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  =  8573,16 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32                    (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32                    (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32             (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32              (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32             (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32              (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32              (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128           (not supported)
llama_kv_cache_init:      Metal KV buffer size =    44,00 MiB
llama_new_context_with_model: KV self size  =   44,00 MiB, K (f16):   22,00 MiB, V (f16):   22,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,12 MiB
llama_new_context_with_model:      Metal compute buffer size =   148,00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8,01 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 1


 I believe the meaning of life is
Trevor Noah, host of The Daily Show with Donald Trump as the character "Anthony DiNovo"

Trevor is a sharp and witty political satirist, known for his wit, style, and intelligence. He has been described as having a "masterful sense of timing," and "a mastery of language" that he uses to "satirize the news, politics, celebrities, sports, and anything else that comes his way." He is known for his ability to "take a joke and make it so many times that it becomes part of the conversation
llama_print_timings:        load time =    1431,87 ms
llama_print_timings:      sample time =       8,19 ms /   128 runs   (    0,06 ms per token, 15632,63 tokens per second)
llama_print_timings: prompt eval time =     603,70 ms /     8 tokens (   75,46 ms per token,    13,25 tokens per second)
llama_print_timings:        eval time =   12352,42 ms /   127 runs   (   97,26 ms per token,    10,28 tokens per second)
llama_print_timings:       total time =   12988,65 ms /   135 tokens
ggml_metal_free: deallocating
Log end
<!-- gh-comment-id:2175852333 --> @tristan-k commented on GitHub (Jun 18, 2024): > Does llama.cpp commit [f8ec887](https://github.com/ggerganov/llama.cpp/commit/f8ec8877b75774fc6c47559d529dac423877bcad) address this problem in some way? Seems that precompiled builds of llama.cpp after April 2 were impacted. [Issue #7940](https://github.com/ggerganov/llama.cpp/pull/7940) Indeed the latest `llama.cpp` (b3173) does use the gpu on my macOS Sonoma installation. Is there any way to exchange the latest `llama.cpp` binaries in `ollama` because I want to use Open WebUI which depands on `ollama` - or is there a time window when the changes will arrive in `ollama`? ``` llama-cli \ --hf-repo "TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF" \ --hf-file ggml-model-q4_0.gguf \ -p "I believe the meaning of life is" \ -n 128 Log start main: build = 3173 (a94e6ff8) main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.4.0 main: seed = 1718709174 llama_download_file: previous metadata file found /Users/admin/Library/Caches/llama.cpp/ggml-model-q4_0.gguf.json: {"etag":"\"4b084aeae725de00362289c272049bed-40\"","lastModified":"Wed, 27 Sep 2023 13:52:41 GMT","url":"https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"} llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from /Users/admin/Library/Caches/llama.cpp/ggml-model-q4_0.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = .. llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0,000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000,000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32003] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32003] = [0,000000, 0,000000, 0,000000, 0,0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q4_0: 155 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 262 llm_load_vocab: token to piece cache size = 0,1684 MB llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32003 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-05 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: f_logit_scale = 0,0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 1,10 B llm_load_print_meta: model size = 606,54 MiB (4,63 BPW) llm_load_print_meta: general.name = .. llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOT token = 32002 '<|im_end|>' llm_load_tensors: ggml ctx size = 0,20 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 571,38 MiB, ( 571,38 / 8176,00) llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU buffer size = 35,16 MiB llm_load_tensors: Metal buffer size = 571,38 MiB ..................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000,0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon PRO W6600 ggml_metal_init: picking default device: AMD Radeon PRO W6600 ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: AMD Radeon PRO W6600 ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 8573,16 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128 (not supported) llama_kv_cache_init: Metal KV buffer size = 44,00 MiB llama_new_context_with_model: KV self size = 44,00 MiB, K (f16): 22,00 MiB, V (f16): 22,00 MiB llama_new_context_with_model: CPU output buffer size = 0,12 MiB llama_new_context_with_model: Metal compute buffer size = 148,00 MiB llama_new_context_with_model: CPU compute buffer size = 8,01 MiB llama_new_context_with_model: graph nodes = 710 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000 top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 2048, n_batch = 2048, n_predict = 128, n_keep = 1 I believe the meaning of life is Trevor Noah, host of The Daily Show with Donald Trump as the character "Anthony DiNovo" Trevor is a sharp and witty political satirist, known for his wit, style, and intelligence. He has been described as having a "masterful sense of timing," and "a mastery of language" that he uses to "satirize the news, politics, celebrities, sports, and anything else that comes his way." He is known for his ability to "take a joke and make it so many times that it becomes part of the conversation llama_print_timings: load time = 1431,87 ms llama_print_timings: sample time = 8,19 ms / 128 runs ( 0,06 ms per token, 15632,63 tokens per second) llama_print_timings: prompt eval time = 603,70 ms / 8 tokens ( 75,46 ms per token, 13,25 tokens per second) llama_print_timings: eval time = 12352,42 ms / 127 runs ( 97,26 ms per token, 10,28 tokens per second) llama_print_timings: total time = 12988,65 ms / 135 tokens ggml_metal_free: deallocating Log end ```
Author
Owner

@dbl001 commented on GitHub (Jun 18, 2024):

GPT2 and Llama2 appear to be working.
Llama3 crashes: GGML_ASSERT: ggml-metal.m:1769: false && "not implemented"

I used these build parameters:

% OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" go generate -v ./...

% CGO_CFLAGS="-I/opt/local/include" CGO_LDFLAGS="-L/opt/local/lib -framework Accelerate" go build .

Here's an attempt to run llama3 bf16 on an iMac 27" with an AMD Radeon Pro 5700 XT:

% ./main -m /Users/davidlaxer/llama3/Meta-Llama-3-8B/ggml-model-bf16.gguf  -n 128 -ngl 1 -i
Log start
main: build = 3051 (5921b8f0)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.5.0
main: seed  = 1718739016
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/davidlaxer/llama3/Meta-Llama-3-8B/ggml-model-bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama3
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 32
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,128256]  = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type bf16:  226 tensors
llm_load_vocab: special tokens cache size = 96515
llm_load_vocab: token to piece cache size = 0.8876 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = BF16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = llama3
llm_load_print_meta: BOS token        = 128000 '[PAD128000]'
llm_load_print_meta: EOS token        = 128001 '[PAD128001]'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  1418.04 MiB, ( 1418.04 / 16368.00)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/33 layers to GPU
llm_load_tensors:      Metal buffer size =  1418.03 MiB
llm_load_tensors:        CPU buffer size = 15317.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 5700 XT
ggml_metal_init: picking default device: AMD Radeon Pro 5700 XT
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/davidlaxer/ollama/llm/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   AMD Radeon Pro 5700 XT
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32                    (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32                    (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32             (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32              (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32             (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32              (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32              (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128           (not supported)
llama_kv_cache_init:      Metal KV buffer size =     2.00 MiB
llama_kv_cache_init:        CPU KV buffer size =    62.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =    81.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 3
Asserting on type 30
GGML_ASSERT: ggml-metal.m:1769: false && "not implemented"
Asserting on type 30
GGML_ASSERT: ggml-metal.m:1769: false && "not implemented"
zsh: abort      ./main -m /Users/davidlaxer/llama3/Meta-Llama-3-8B/ggml-model-bf16.gguf -n 12

The quantized version of llama3: Llama-3-8B/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i output gibberish.

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

▅ permet In Mayor? Список?OFandidnice Inief▅

▅poISBN nature	
                D?
                  at
                    D?
                      at
                        D?
                          at
                            D?


Llama2 7B:

% ./main -m /Users/davidlaxer/llama.cpp/models/7B/ggml-model-q4_0.gguf -n 128 -ngl 1 -i       
...
↵v 12 See, for example, L. J. van der Velde and W. J. Baars, “The Social Brain: From Monkey Brain to Human Brain,” Social Neuroscience, vol. 1, pp. 1–11, 2006.
↵w R. J. Davidson et al., “Affective Neuroscience: The Cognitive Brain,” Oxford University Press, New York, 1992, pp. 68–71.
↵x J. A. Paulus,
 M. F. D’Esposito, and P. H. G. Lashley, “Neural Substrates of Memory and Attention: A Meta-Analytic Review,” Journal of Cognitive Neuroscience ...

Quantized GPT2

./main -m /Users/davidlaxer/llama2.cpp/models/gpt2/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i
...
Comments on Covid-19 Receptor Binding Sites?

Covid-19 Receptor Binding Sites

The three main sites on which recombinant CRM-1 (CRM-1a) and recombinant CRM-2 (CRM-2a) are found in the brain and the brainstem, and the two sites of the same protein and the two sites of the same cell line are the the same. The most recent (and the most recent and most recent) work in the last few years in the brain and brainstem, and in the brainstem, and in the brainstem, and in
 the brain and the brain and the brain and the brain and the brain and the brain and the brain, and in the brain, and in the brain, and in the brain, and the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain and in the brain and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain,
 and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the

<!-- gh-comment-id:2176925772 --> @dbl001 commented on GitHub (Jun 18, 2024): GPT2 and Llama2 appear to be working. Llama3 crashes: GGML_ASSERT: ggml-metal.m:1769: false && "not implemented" I used these build parameters: ``` % OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" go generate -v ./... % CGO_CFLAGS="-I/opt/local/include" CGO_LDFLAGS="-L/opt/local/lib -framework Accelerate" go build . ``` Here's an attempt to run llama3 bf16 on an iMac 27" with an AMD Radeon Pro 5700 XT: ``` % ./main -m /Users/davidlaxer/llama3/Meta-Llama-3-8B/ggml-model-bf16.gguf -n 128 -ngl 1 -i Log start main: build = 3051 (5921b8f0) main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for x86_64-apple-darwin23.5.0 main: seed = 1718739016 llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/davidlaxer/llama3/Meta-Llama-3-8B/ggml-model-bf16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = llama3 llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 32 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = llama llama_model_loader: - kv 14: tokenizer.ggml.pre str = default llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,128256] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type bf16: 226 tensors llm_load_vocab: special tokens cache size = 96515 llm_load_vocab: token to piece cache size = 0.8876 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = BF16 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 14.96 GiB (16.00 BPW) llm_load_print_meta: general.name = llama3 llm_load_print_meta: BOS token = 128000 '[PAD128000]' llm_load_print_meta: EOS token = 128001 '[PAD128001]' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.30 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 1418.04 MiB, ( 1418.04 / 16368.00) llm_load_tensors: offloading 1 repeating layers to GPU llm_load_tensors: offloaded 1/33 layers to GPU llm_load_tensors: Metal buffer size = 1418.03 MiB llm_load_tensors: CPU buffer size = 15317.02 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon Pro 5700 XT ggml_metal_init: picking default device: AMD Radeon Pro 5700 XT ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/Users/davidlaxer/ollama/llm/llama.cpp/ggml-metal.metal' ggml_metal_init: GPU name: AMD Radeon Pro 5700 XT ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128 (not supported) llama_kv_cache_init: Metal KV buffer size = 2.00 MiB llama_kv_cache_init: CPU KV buffer size = 62.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: Metal compute buffer size = 81.00 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 3 Asserting on type 30 GGML_ASSERT: ggml-metal.m:1769: false && "not implemented" Asserting on type 30 GGML_ASSERT: ggml-metal.m:1769: false && "not implemented" zsh: abort ./main -m /Users/davidlaxer/llama3/Meta-Llama-3-8B/ggml-model-bf16.gguf -n 12 ``` The quantized version of llama3: Llama-3-8B/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i output gibberish. ``` == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. ▅ permet In Mayor? Список?OFandidnice Inief▅ ▅poISBN nature D? at D? at D? at D? ``` Llama2 7B: ``` % ./main -m /Users/davidlaxer/llama.cpp/models/7B/ggml-model-q4_0.gguf -n 128 -ngl 1 -i ... ↵v 12 See, for example, L. J. van der Velde and W. J. Baars, “The Social Brain: From Monkey Brain to Human Brain,” Social Neuroscience, vol. 1, pp. 1–11, 2006. ↵w R. J. Davidson et al., “Affective Neuroscience: The Cognitive Brain,” Oxford University Press, New York, 1992, pp. 68–71. ↵x J. A. Paulus, M. F. D’Esposito, and P. H. G. Lashley, “Neural Substrates of Memory and Attention: A Meta-Analytic Review,” Journal of Cognitive Neuroscience ... ``` Quantized GPT2 ``` ./main -m /Users/davidlaxer/llama2.cpp/models/gpt2/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i ... Comments on Covid-19 Receptor Binding Sites? Covid-19 Receptor Binding Sites The three main sites on which recombinant CRM-1 (CRM-1a) and recombinant CRM-2 (CRM-2a) are found in the brain and the brainstem, and the two sites of the same protein and the two sites of the same cell line are the the same. The most recent (and the most recent and most recent) work in the last few years in the brain and brainstem, and in the brainstem, and in the brainstem, and in the brain and the brain and the brain and the brain and the brain and the brain and the brain, and in the brain, and in the brain, and in the brain, and the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain and in the brain and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the brain, and in the ```
Author
Owner

@dbl001 commented on GitHub (Jun 19, 2024):

'ollama server' shows that only the CPU running when computing llama3 and mistrial embeddings (see below). Is there a way to build ollama on the Mac (e.g. darwin) to utilize the AMD GPU, which runs with llama.cpp's 'main' (see comment above)?

 % ollama serve
2024/06/18 19:03:59 routes.go:1011: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-18T19:03:59.514-07:00 level=INFO source=images.go:725 msg="total blobs: 28"
time=2024-06-18T19:03:59.517-07:00 level=INFO source=images.go:732 msg="total unused blobs removed: 0"
time=2024-06-18T19:03:59.518-07:00 level=INFO source=routes.go:1057 msg="Listening on 127.0.0.1:11434 (version 0.1.44)"
time=2024-06-18T19:03:59.519-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1934560689/runners
time=2024-06-18T19:03:59.544-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-06-18T19:03:59.544-07:00 level=INFO source=types.go:71 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="0 B"
time=2024-06-18T19:05:08.336-07:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="0 B" memory.required.full="4.6 GiB" memory.required.partial="794.5 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-06-18T19:05:08.341-07:00 level=INFO source=server.go:341 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1934560689/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 61763"
time=2024-06-18T19:05:08.349-07:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-18T19:05:08.349-07:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-18T19:05:08.350-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3051 commit="5921b8f0" tid="0x7ff85e144fc0" timestamp=1718762708
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff85e144fc0" timestamp=1718762708 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61763" tid="0x7ff85e144fc0" timestamp=1718762708
time=2024-06-18T19:05:08.853-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 1.5928 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="0x7ff85e144fc0" timestamp=1718762712
time=2024-06-18T19:05:13.112-07:00 level=INFO source=server.go:572 msg="llama runner started in 4.76 seconds"
[GIN] 2024/06/18 - 19:05:13 | 200 |  6.277269591s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:05:17 | 200 |  1.343368538s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:05:56 | 200 | 38.264996115s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:06:14 | 200 |  18.67045649s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:06:17 | 200 |  1.513778261s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:06:56 | 200 | 39.255817896s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:07:14 | 200 | 18.166336108s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:07:16 | 200 |   1.37936724s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:07:55 | 200 | 38.925646118s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:08:14 | 200 | 18.518717157s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:08:16 | 200 |  1.464268007s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:08:56 | 200 | 40.078046141s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:09:15 | 200 | 19.080398256s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:09:23 | 200 |  1.224606132s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:04 | 200 | 40.776465321s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:09 | 200 |  5.814193246s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:12 | 200 |   699.11227ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:36 | 200 | 24.406945746s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:37 | 200 |   797.87609ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:45 | 200 |  7.457018591s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:10:47 | 200 |  1.278116828s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:11:05 | 200 | 18.156415839s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:11:07 | 200 |  1.385959066s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:11:46 | 200 | 38.410457618s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:12:04 | 200 | 18.345887955s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:12:06 | 200 |  1.010228518s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:12:47 | 200 | 40.333908619s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:12:50 | 200 |  1.281176651s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:13:28 | 200 | 37.757102547s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:13:49 | 200 |  20.93774188s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:13:51 | 200 |  1.153672288s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:14:26 | 200 | 35.123663292s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:14:29 | 200 |  1.072264108s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:14:46 | 200 | 16.726134411s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:14:49 | 200 |  1.492894824s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:15:26 | 200 | 36.819901976s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:15:44 | 200 | 18.748605609s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 19:16:44 | 200 |       35.27µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:16:44 | 200 |    8.064097ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/18 - 19:16:58 | 200 |      16.143µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:17:19 | 200 |      19.184µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:17:19 | 200 |    1.407553ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/18 - 19:17:38 | 200 |      18.687µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:17:38 | 200 |     470.335µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/18 - 19:17:48 | 200 |      17.298µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:17:48 | 200 |     467.322µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/18 - 19:18:02 | 200 |      17.465µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:18:02 | 200 |     451.951µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/06/18 - 19:18:38 | 200 |      17.455µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:18:38 | 200 |     992.195µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2024/06/18 - 19:19:02 | 200 |      20.008µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:19:02 | 200 |    1.112039ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/18 - 19:19:46 | 200 |      19.219µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/06/18 - 19:19:47 | 200 |  996.497378ms |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/06/18 - 19:34:14 | 200 |      16.751µs |       127.0.0.1 | HEAD     "/"
time=2024-06-18T19:34:16.930-07:00 level=INFO source=download.go:136 msg="downloading ff82381e2bea in 42 100 MB part(s)"
time=2024-06-18T19:40:43.159-07:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)"
time=2024-06-18T19:40:46.456-07:00 level=INFO source=download.go:136 msg="downloading c43332387573 in 1 67 B part(s)"
time=2024-06-18T19:40:48.487-07:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)"
time=2024-06-18T19:40:50.496-07:00 level=INFO source=download.go:136 msg="downloading 42347cd80dc8 in 1 485 B part(s)"
[GIN] 2024/06/18 - 19:41:00 | 200 |         6m45s |       127.0.0.1 | POST     "/api/pull"
time=2024-06-18T23:44:03.900-07:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="0 B" memory.required.full="4.3 GiB" memory.required.partial="302.0 MiB" memory.required.kv="256.0 MiB" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="185.0 MiB"
time=2024-06-18T23:44:03.902-07:00 level=INFO source=server.go:341 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1934560689/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 53287"
time=2024-06-18T23:44:03.912-07:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-18T23:44:03.912-07:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-18T23:44:03.912-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3051 commit="5921b8f0" tid="0x7ff85e144fc0" timestamp=1718779443
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff85e144fc0" timestamp=1718779443 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="53287" tid="0x7ff85e144fc0" timestamp=1718779443
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Mistral-7B-Instruct-v0.3
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 1027
llm_load_vocab: token to piece cache size = 0.3368 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.25 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  3922.02 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
time=2024-06-18T23:44:04.165-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.14 MiB
llama_new_context_with_model:        CPU compute buffer size =   164.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="0x7ff85e144fc0" timestamp=1718779452
time=2024-06-18T23:44:12.182-07:00 level=INFO source=server.go:572 msg="llama runner started in 8.27 seconds"
[GIN] 2024/06/18 - 23:44:13 | 200 |  9.947288057s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:44:57 | 200 | 44.102222619s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:45:19 | 200 | 21.654976187s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:45:21 | 200 |  1.572285132s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:46:04 | 200 | 42.162966208s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:46:25 | 200 | 21.730762097s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:46:28 | 200 |  1.550663412s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:47:10 | 200 |  42.36141677s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:47:32 | 200 | 21.342104959s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:47:34 | 200 |  1.461576373s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:48:16 | 200 | 42.071383927s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:48:38 | 200 | 21.835165107s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:48:46 | 200 |  1.217782813s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/06/18 - 23:49:28 | 200 |  42.35996573s |       127.0.0.1 | POST     "/api/embeddings"


<!-- gh-comment-id:2177887261 --> @dbl001 commented on GitHub (Jun 19, 2024): 'ollama server' shows that only the CPU running when computing llama3 and mistrial embeddings (see below). Is there a way to build ollama on the Mac (e.g. darwin) to utilize the AMD GPU, which runs with llama.cpp's 'main' (see comment above)? ``` % ollama serve 2024/06/18 19:03:59 routes.go:1011: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-06-18T19:03:59.514-07:00 level=INFO source=images.go:725 msg="total blobs: 28" time=2024-06-18T19:03:59.517-07:00 level=INFO source=images.go:732 msg="total unused blobs removed: 0" time=2024-06-18T19:03:59.518-07:00 level=INFO source=routes.go:1057 msg="Listening on 127.0.0.1:11434 (version 0.1.44)" time=2024-06-18T19:03:59.519-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1934560689/runners time=2024-06-18T19:03:59.544-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-06-18T19:03:59.544-07:00 level=INFO source=types.go:71 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="0 B" time=2024-06-18T19:05:08.336-07:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="0 B" memory.required.full="4.6 GiB" memory.required.partial="794.5 MiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-06-18T19:05:08.341-07:00 level=INFO source=server.go:341 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1934560689/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 61763" time=2024-06-18T19:05:08.349-07:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-18T19:05:08.349-07:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-18T19:05:08.350-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3051 commit="5921b8f0" tid="0x7ff85e144fc0" timestamp=1718762708 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff85e144fc0" timestamp=1718762708 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61763" tid="0x7ff85e144fc0" timestamp=1718762708 time=2024-06-18T19:05:08.853-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 1.5928 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: CPU buffer size = 4437.80 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 INFO [main] model loaded | tid="0x7ff85e144fc0" timestamp=1718762712 time=2024-06-18T19:05:13.112-07:00 level=INFO source=server.go:572 msg="llama runner started in 4.76 seconds" [GIN] 2024/06/18 - 19:05:13 | 200 | 6.277269591s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:05:17 | 200 | 1.343368538s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:05:56 | 200 | 38.264996115s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:06:14 | 200 | 18.67045649s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:06:17 | 200 | 1.513778261s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:06:56 | 200 | 39.255817896s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:07:14 | 200 | 18.166336108s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:07:16 | 200 | 1.37936724s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:07:55 | 200 | 38.925646118s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:08:14 | 200 | 18.518717157s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:08:16 | 200 | 1.464268007s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:08:56 | 200 | 40.078046141s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:09:15 | 200 | 19.080398256s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:09:23 | 200 | 1.224606132s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:04 | 200 | 40.776465321s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:09 | 200 | 5.814193246s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:12 | 200 | 699.11227ms | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:36 | 200 | 24.406945746s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:37 | 200 | 797.87609ms | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:45 | 200 | 7.457018591s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:10:47 | 200 | 1.278116828s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:11:05 | 200 | 18.156415839s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:11:07 | 200 | 1.385959066s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:11:46 | 200 | 38.410457618s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:12:04 | 200 | 18.345887955s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:12:06 | 200 | 1.010228518s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:12:47 | 200 | 40.333908619s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:12:50 | 200 | 1.281176651s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:13:28 | 200 | 37.757102547s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:13:49 | 200 | 20.93774188s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:13:51 | 200 | 1.153672288s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:14:26 | 200 | 35.123663292s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:14:29 | 200 | 1.072264108s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:14:46 | 200 | 16.726134411s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:14:49 | 200 | 1.492894824s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:15:26 | 200 | 36.819901976s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:15:44 | 200 | 18.748605609s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 19:16:44 | 200 | 35.27µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:16:44 | 200 | 8.064097ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/06/18 - 19:16:58 | 200 | 16.143µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:17:19 | 200 | 19.184µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:17:19 | 200 | 1.407553ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/18 - 19:17:38 | 200 | 18.687µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:17:38 | 200 | 470.335µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/18 - 19:17:48 | 200 | 17.298µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:17:48 | 200 | 467.322µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/18 - 19:18:02 | 200 | 17.465µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:18:02 | 200 | 451.951µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/18 - 19:18:38 | 200 | 17.455µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:18:38 | 200 | 992.195µs | 127.0.0.1 | GET "/api/ps" [GIN] 2024/06/18 - 19:19:02 | 200 | 20.008µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:19:02 | 200 | 1.112039ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/06/18 - 19:19:46 | 200 | 19.219µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/18 - 19:19:47 | 200 | 996.497378ms | 127.0.0.1 | POST "/api/pull" [GIN] 2024/06/18 - 19:34:14 | 200 | 16.751µs | 127.0.0.1 | HEAD "/" time=2024-06-18T19:34:16.930-07:00 level=INFO source=download.go:136 msg="downloading ff82381e2bea in 42 100 MB part(s)" time=2024-06-18T19:40:43.159-07:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)" time=2024-06-18T19:40:46.456-07:00 level=INFO source=download.go:136 msg="downloading c43332387573 in 1 67 B part(s)" time=2024-06-18T19:40:48.487-07:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)" time=2024-06-18T19:40:50.496-07:00 level=INFO source=download.go:136 msg="downloading 42347cd80dc8 in 1 485 B part(s)" [GIN] 2024/06/18 - 19:41:00 | 200 | 6m45s | 127.0.0.1 | POST "/api/pull" time=2024-06-18T23:44:03.900-07:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=0 memory.available="0 B" memory.required.full="4.3 GiB" memory.required.partial="302.0 MiB" memory.required.kv="256.0 MiB" memory.weights.total="3.8 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="185.0 MiB" time=2024-06-18T23:44:03.902-07:00 level=INFO source=server.go:341 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1934560689/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 53287" time=2024-06-18T23:44:03.912-07:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-18T23:44:03.912-07:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-18T23:44:03.912-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3051 commit="5921b8f0" tid="0x7ff85e144fc0" timestamp=1718779443 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff85e144fc0" timestamp=1718779443 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="53287" tid="0x7ff85e144fc0" timestamp=1718779443 llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Mistral-7B-Instruct-v0.3 llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 32768 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = llama llama_model_loader: - kv 14: tokenizer.ggml.pre str = default llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 23: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 24: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 1027 llm_load_vocab: token to piece cache size = 0.3368 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.25 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = Mistral-7B-Instruct-v0.3 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_tensors: ggml ctx size = 0.15 MiB llm_load_tensors: CPU buffer size = 3922.02 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 time=2024-06-18T23:44:04.165-07:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.14 MiB llama_new_context_with_model: CPU compute buffer size = 164.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 INFO [main] model loaded | tid="0x7ff85e144fc0" timestamp=1718779452 time=2024-06-18T23:44:12.182-07:00 level=INFO source=server.go:572 msg="llama runner started in 8.27 seconds" [GIN] 2024/06/18 - 23:44:13 | 200 | 9.947288057s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:44:57 | 200 | 44.102222619s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:45:19 | 200 | 21.654976187s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:45:21 | 200 | 1.572285132s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:46:04 | 200 | 42.162966208s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:46:25 | 200 | 21.730762097s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:46:28 | 200 | 1.550663412s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:47:10 | 200 | 42.36141677s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:47:32 | 200 | 21.342104959s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:47:34 | 200 | 1.461576373s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:48:16 | 200 | 42.071383927s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:48:38 | 200 | 21.835165107s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:48:46 | 200 | 1.217782813s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/06/18 - 23:49:28 | 200 | 42.35996573s | 127.0.0.1 | POST "/api/embeddings" ```
Author
Owner

@cracksauce commented on GitHub (Jun 19, 2024):

Is there a way to build ollama on the Mac (e.g. darwin) to utilize the AMD GPU, which runs with llama.cpp's 'main' (see comment above)?

Would also appreciate a solution for this

<!-- gh-comment-id:2178903533 --> @cracksauce commented on GitHub (Jun 19, 2024): > Is there a way to build ollama on the Mac (e.g. darwin) to utilize the AMD GPU, which runs with llama.cpp's 'main' (see comment above)? Would also appreciate a solution for this
Author
Owner

@dhiltgen commented on GitHub (Jun 19, 2024):

For folks in the community working on this, keep in mind there are 2 pieces of the puzzle that will need to be implemented to make it work.

First is figuring out the right flags to pass to cmake for llama.cpp to compile the x86 metal variant and wire that up as a new runner, most likely called "metal" here.

Second is wiring up "GPU discovery" with VRAM lookup. At startup we discover what GPUs are present, and specifically how much VRAM they have available so we can schedule model loads that don't exceed the available memory. Modifications in gpu_darwin.go and gpu_info_darwin.m will be needed. We need to get the current VRAM usage during runtime for the scheduler to be able to support concurrency as well.

<!-- gh-comment-id:2178982244 --> @dhiltgen commented on GitHub (Jun 19, 2024): For folks in the community working on this, keep in mind there are 2 pieces of the puzzle that will need to be implemented to make it work. First is figuring out the right flags to pass to cmake for llama.cpp to compile the x86 metal variant and wire that up as a new runner, most likely called "metal" [here](https://github.com/ollama/ollama/blob/main/llm/generate/gen_darwin.sh#L23-L72). Second is wiring up "GPU discovery" with VRAM lookup. At startup we discover what GPUs are present, and specifically how much VRAM they have available so we can schedule model loads that don't exceed the available memory. Modifications in [gpu_darwin.go](https://github.com/ollama/ollama/blob/main/gpu/gpu_darwin.go) and [gpu_info_darwin.m](https://github.com/ollama/ollama/blob/main/gpu/gpu_info_darwin.m) will be needed. We need to get the current VRAM usage during runtime for the scheduler to be able to support concurrency as well.
Author
Owner

@l-m-mortal commented on GitHub (Jun 25, 2024):

thanks to @xakrume my MacBook Pro 15 2015 with amd gpu managed to run ollama serve,
but it prioritizes embedded gpu (AMD Radeon R9 M370X) instead of eGPU (AMD Radeon RX 570)
Screen Shot 2024-06-25 at 20 31 59

<!-- gh-comment-id:2189549270 --> @l-m-mortal commented on GitHub (Jun 25, 2024): thanks to @xakrume my MacBook Pro 15 2015 with amd gpu managed to run ollama serve, but it prioritizes embedded gpu (AMD Radeon R9 M370X) instead of eGPU (AMD Radeon RX 570) <img width="1044" alt="Screen Shot 2024-06-25 at 20 31 59" src="https://github.com/ollama/ollama/assets/107005667/39e12550-356c-4769-bbbd-6113bc056e2b">
Author
Owner

@ahornby commented on GitHub (Jul 13, 2024):

I got it working based on above info in https://github.com/ahornby/ollama/tree/macos_amd64_metal, however the metal backend on my macbook 15 2019 560X was slower than the CPU, so giving up. Maybe the commit will be useful to someone else with a faster gpu, it has the commands used in the commit message

<!-- gh-comment-id:2226905130 --> @ahornby commented on GitHub (Jul 13, 2024): I got it working based on above info in https://github.com/ahornby/ollama/tree/macos_amd64_metal, however the metal backend on my macbook 15 2019 560X was slower than the CPU, so giving up. Maybe the commit will be useful to someone else with a faster gpu, it has the commands used in the [commit message](https://github.com/ollama/ollama/commit/cb85daa89e28131447fb738958706391a5ad57a3)
Author
Owner

@dbl001 commented on GitHub (Jul 13, 2024):

@ahornby I tried cloning your fork and running ollama on my 2022 iMac 27" with an AMD Radeon Pro 5700 XT. It doesn't appear to find the GPU. Do you see anything I missed?

% git clone -b macos_amd64_metal https://github.com/ahornby/ollama.git
Cloning into 'ollama'...
remote: Enumerating objects: 14985, done.
remote: Counting objects: 100% (361/361), done.
remote: Compressing objects: 100% (215/215), done.
remote: Total 14985 (delta 191), reused 257 (delta 146), pack-reused 14624
Receiving objects: 100% (14985/14985), 8.02 MiB | 8.04 MiB/s, done.
Resolving deltas: 100% (9525/9525), done.
(AI-Feynman) davidlaxer@bluediamond ~ % cd ollama 
(AI-Feynman) davidlaxer@bluediamond ollama % env | grep clang
CLANG=/usr/bin/clang
CC=/usr/bin/clang
OBJC=/usr/bin/clang
CC_FOR_BUILD=/usr/bin/clang
OBJC_FOR_BUILD=/usr/bin/clang
CXX=/usr/bin/clang++

(AI-Feynman) davidlaxer@bluediamond ollama % export CGO_CFLAGS="-I/opt/local/include/libomp"
(AI-Feynman) davidlaxer@bluediamond ollama % export CGO_LDFLAGS="-L/opt/local/lib/libomp -lomp -framework Accelerate"
(AI-Feynman) davidlaxer@bluediamond ollama % OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" go generate -v ./...

main.go
api/client.go
api/client_test.go
api/types.go
api/types_test.go
app/main.go
app/assets/assets.go
app/lifecycle/getstarted_nonwindows.go
app/lifecycle/lifecycle.go
app/lifecycle/logging.go
app/lifecycle/logging_nonwindows.go
app/lifecycle/logging_test.go
app/lifecycle/paths.go
app/lifecycle/server.go
app/lifecycle/server_unix.go
app/lifecycle/updater.go
app/lifecycle/updater_nonwindows.go
app/store/store.go
app/store/store_darwin.go
app/tray/tray.go
app/tray/tray_nonwindows.go
app/tray/commontray/types.go
auth/auth.go
cmd/cmd.go
cmd/interactive.go
cmd/interactive_test.go
cmd/start.go
cmd/start_darwin.go
convert/convert.go
convert/gemma.go
convert/llama.go
convert/mistral.go
convert/mixtral.go
convert/safetensors.go
convert/tokenizer.go
convert/torch.go
convert/sentencepiece/sentencepiece_model.pb.go
envconfig/config.go
envconfig/config_test.go
examples/go-chat/main.go
examples/go-generate/main.go
examples/go-generate-streaming/main.go
examples/go-http-generate/main.go
examples/go-multimodal/main.go
examples/go-pull-progress/main.go
format/bytes.go
format/format.go
format/format_test.go
format/time.go
format/time_test.go
gpu/assets.go
gpu/cpu_common.go
gpu/gpu_darwin.go
gpu/gpu_test.go
gpu/types.go
llm/filetype.go
llm/ggla.go
llm/ggml.go
llm/ggml_test.go
llm/gguf.go
llm/llm.go
llm/llm_darwin_amd64.go
llm/memory.go
llm/memory_test.go
llm/payload.go
llm/server.go
llm/status.go
llm/generate/generate_darwin.go
+ set -o pipefail
+ echo 'Starting darwin generate script'
Starting darwin generate script
++ dirname ./gen_darwin.sh
+ source ./gen_common.sh
+ init_vars
+ case "${GOARCH}" in
+ ARCH=x86_64
+ LLAMACPP_DIR=../llama.cpp
+ CMAKE_DEFS=
+ CMAKE_TARGETS='--target ollama_llama_server'
+ echo -I/opt/local/include/libomp
+ grep -- -g
+ CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ case $(uname -s) in
++ uname -s
+ LIB_EXT=dylib
+ WHOLE_ARCHIVE=-Wl,-force_load
+ NO_WHOLE_ARCHIVE=
+ GCC_ARCH='-arch x86_64'
+ '[' -z '' ']'
+ CMAKE_CUDA_ARCHITECTURES='50;52;61;70;75;80'
+ git_module_setup
+ '[' -n '' ']'
+ '[' -d ../llama.cpp/gguf ']'
+ git submodule init
Submodule 'llama.cpp' (https://github.com/ggerganov/llama.cpp.git) registered for path '../llama.cpp'
+ git submodule update --force ../llama.cpp
Cloning into '/Users/davidlaxer/ollama/llm/llama.cpp'...
remote: Enumerating objects: 19612, done.
remote: Counting objects: 100% (19611/19611), done.
remote: Compressing objects: 100% (5184/5184), done.
remote: Total 19141 (delta 14341), reused 18567 (delta 13786), pack-reused 0
Receiving objects: 100% (19141/19141), 18.21 MiB | 10.27 MiB/s, done.
Resolving deltas: 100% (14341/14341), completed with 360 local objects.
From https://github.com/ggerganov/llama.cpp
 * branch              a8db2a9ce64cd4417f6a312ab61858f17f0f8584 -> FETCH_HEAD
Submodule path '../llama.cpp': checked out 'a8db2a9ce64cd4417f6a312ab61858f17f0f8584'
+ apply_patches
+ grep ollama ../llama.cpp/CMakeLists.txt
+ echo 'add_subdirectory(../ext_server ext_server) # ollama'
++ ls -A ../patches/01-load-progress.diff ../patches/02-clip-log.diff ../patches/03-load_exception.diff ../patches/04-metal.diff ../patches/05-default-pretokenizer.diff ../patches/06-qwen2.diff ../patches/07-embeddings.diff ../patches/08-clip-unicode.diff ../patches/09-pooling.diff
+ '[' -n '../patches/01-load-progress.diff
../patches/02-clip-log.diff
../patches/03-load_exception.diff
../patches/04-metal.diff
../patches/05-default-pretokenizer.diff
../patches/06-qwen2.diff
../patches/07-embeddings.diff
../patches/08-clip-unicode.diff
../patches/09-pooling.diff' ']'
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/01-load-progress.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout common/common.cpp
Updated 0 paths from the index
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout common/common.h
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/02-clip-log.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout examples/llava/clip.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/03-load_exception.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/04-metal.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout ggml/src/ggml-metal.m
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/05-default-pretokenizer.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/06-qwen2.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/07-embeddings.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/08-clip-unicode.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout examples/llava/clip.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/09-pooling.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/01-load-progress.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/02-clip-log.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/03-load_exception.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/04-metal.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/05-default-pretokenizer.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/06-qwen2.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/07-embeddings.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/08-clip-unicode.diff
+ for patch in ../patches/*.diff
+ cd ../llama.cpp
+ git apply ../patches/09-pooling.diff
+ COMMON_DARWIN_DEFS='-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off'
+ case "${GOARCH}" in
+ '[' -z on ']'
+ '[' -z '' ']'
+ init_vars
+ case "${GOARCH}" in
+ ARCH=x86_64
+ LLAMACPP_DIR=../llama.cpp
+ CMAKE_DEFS=
+ CMAKE_TARGETS='--target ollama_llama_server'
+ echo -I/opt/local/include/libomp
+ grep -- -g
+ CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ case $(uname -s) in
++ uname -s
+ LIB_EXT=dylib
+ WHOLE_ARCHIVE=-Wl,-force_load
+ NO_WHOLE_ARCHIVE=
+ GCC_ARCH='-arch x86_64'
+ '[' -z '50;52;61;70;75;80' ']'
+ CMAKE_TARGETS='--target llama --target ggml'
+ CMAKE_DEFS='-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ BUILD_DIR=../build/darwin/x86_64_static
+ echo 'Building static library'
Building static library
+ build
+ cmake -S ../llama.cpp -B ../build/darwin/x86_64_static -DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off
-- The C compiler identification is AppleClang 15.0.0.15000309
-- The CXX compiler identification is AppleClang 15.0.0.15000309
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /opt/local/bin/git (found version "2.45.2")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Accelerate framework found
-- Metal framework found
-- The ASM compiler identification is AppleClang
-- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Includes: 
-- Using ggml SGEMM
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (8.1s)
-- Generating done (0.6s)
CMake Warning:
  Manually-specified variables were not used by the project:

    LLAMA_METAL_MACOSX_VERSION_MIN


-- Build files have been written to: /Users/davidlaxer/ollama/llm/build/darwin/x86_64_static
+ cmake --build ../build/darwin/x86_64_static --target llama --target ggml -j8
[ 12%] Generate assembly for embedded Metal library
Embedding Metal library
[ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[ 37%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 50%] Building CXX object ggml/src/CMakeFiles/ggml.dir/sgemm.cpp.o
[ 50%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 62%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-metal.m.o
[ 75%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-blas.cpp.o
[ 75%] Building ASM object ggml/src/CMakeFiles/ggml.dir/__/__/autogenerated/ggml-metal-embed.s.o
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
            ^~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0
void cblas_sgemm(const enum CBLAS_ORDER ORDER,
     ^
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
            ^~~~~~~~~~~
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
    cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n);
    ^~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0
void cblas_sgemm(const enum CBLAS_ORDER ORDER,
     ^
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning
    cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n);
    ^~~~~~~~~~~
2 warnings generated.
[ 75%] Linking CXX static library libggml.a
[ 75%] Built target ggml
[ 75%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[100%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
[100%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[100%] Linking CXX static library libllama.a
[100%] Built target llama
[100%] Built target ggml
+ init_vars
+ case "${GOARCH}" in
+ ARCH=x86_64
+ LLAMACPP_DIR=../llama.cpp
+ CMAKE_DEFS=
+ CMAKE_TARGETS='--target ollama_llama_server'
+ echo -I/opt/local/include/libomp
+ grep -- -g
+ CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ case $(uname -s) in
++ uname -s
+ LIB_EXT=dylib
+ WHOLE_ARCHIVE=-Wl,-force_load
+ NO_WHOLE_ARCHIVE=
+ GCC_ARCH='-arch x86_64'
+ '[' -z '50;52;61;70;75;80' ']'
+ CMAKE_DEFS='-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off '
+ BUILD_DIR=../build/darwin/x86_64/metal
+ EXTRA_LIBS=' -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders'
+ build
+ cmake -S ../llama.cpp -B ../build/darwin/x86_64/metal -DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off
-- The C compiler identification is AppleClang 15.0.0.15000309
-- The CXX compiler identification is AppleClang 15.0.0.15000309
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /opt/local/bin/git (found version "2.45.2")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Accelerate framework found
-- Metal framework found
-- The ASM compiler identification is AppleClang
-- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework
-- BLAS found, Includes: 
-- Using ggml SGEMM
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (1.9s)
-- Generating done (0.5s)
CMake Warning:
  Manually-specified variables were not used by the project:

    LLAMA_METAL_MACOSX_VERSION_MIN


-- Build files have been written to: /Users/davidlaxer/ollama/llm/build/darwin/x86_64/metal
+ cmake --build ../build/darwin/x86_64/metal --target ollama_llama_server -j8
[  6%] Generate assembly for embedded Metal library
Embedding Metal library
[  6%] Generating build details from Git
-- Found Git: /opt/local/bin/git (found version "2.45.2")
[  6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[ 20%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 20%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 26%] Building ASM object ggml/src/CMakeFiles/ggml.dir/__/__/autogenerated/ggml-metal-embed.s.o
[ 26%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-blas.cpp.o
[ 33%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-metal.m.o
[ 40%] Building CXX object ggml/src/CMakeFiles/ggml.dir/sgemm.cpp.o
[ 46%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 46%] Built target build_info
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
            ^~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0
void cblas_sgemm(const enum CBLAS_ORDER ORDER,
     ^
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
            ^~~~~~~~~~~
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new]
    cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n);
    ^~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0
void cblas_sgemm(const enum CBLAS_ORDER ORDER,
     ^
/Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning
    cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n);
    ^~~~~~~~~~~
2 warnings generated.
[ 46%] Linking CXX static library libggml.a
[ 46%] Built target ggml
[ 46%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
[ 60%] Linking CXX static library libllama.a
[ 60%] Built target llama
[ 60%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
[ 66%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
[ 66%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 73%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 86%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o
[ 86%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o
[ 93%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 93%] Built target llava
[ 93%] Linking CXX static library libcommon.a
[ 93%] Built target common
[100%] Building CXX object ext_server/CMakeFiles/ollama_llama_server.dir/server.cpp.o
/Users/davidlaxer/ollama/llm/ext_server/server.cpp:263:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
        sprintf(buffer, "prompt eval time     = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
        ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here
__deprecated_msg("This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.")
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg'
        #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg)))
                                                      ^
/Users/davidlaxer/ollama/llm/ext_server/server.cpp:277:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
        sprintf(buffer, "generation eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)",
        ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here
__deprecated_msg("This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.")
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg'
        #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg)))
                                                      ^
/Users/davidlaxer/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations]
        sprintf(buffer, "          total time = %10.2f ms", t_prompt_processing + t_token_generation);
        ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here
__deprecated_msg("This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.")
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg'
        #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg)))
                                                      ^
3 warnings generated.
[100%] Linking CXX executable ../bin/ollama_llama_server
ld: warning: ignoring duplicate libraries: '../ggml/src/libggml.a', '../src/libllama.a'
[100%] Built target ollama_llama_server
+ sign ../build/darwin/x86_64/metal/bin/ollama_llama_server
+ '[' -n '' ']'
+ compress
+ echo 'Compressing payloads to reduce overall binary size...'
Compressing payloads to reduce overall binary size...
+ pids=
+ rm -rf '../build/darwin/x86_64/metal/bin/*.gz'
+ for f in ${BUILD_DIR}/bin/*
+ pids+=' 4090'
+ for f in ${BUILD_DIR}/bin/*
+ pids+=' 4091'
+ gzip -n --best -f ../build/darwin/x86_64/metal/bin/ggml-common.h
+ for f in ${BUILD_DIR}/bin/*
+ gzip -n --best -f ../build/darwin/x86_64/metal/bin/ggml-metal.metal
+ pids+=' 4092'
+ '[' -d ../build/darwin/x86_64/metal/lib ']'
+ echo

+ for pid in ${pids}
+ wait 4090
+ gzip -n --best -f ../build/darwin/x86_64/metal/bin/ollama_llama_server
+ for pid in ${pids}
+ wait 4091
+ for pid in ${pids}
+ wait 4092
+ echo 'Finished compression'
Finished compression
+ cleanup
+ cd ../llama.cpp/
+ git checkout CMakeLists.txt
Updated 1 path from the index
++ ls -A ../patches/01-load-progress.diff ../patches/02-clip-log.diff ../patches/03-load_exception.diff ../patches/04-metal.diff ../patches/05-default-pretokenizer.diff ../patches/06-qwen2.diff ../patches/07-embeddings.diff ../patches/08-clip-unicode.diff ../patches/09-pooling.diff
+ '[' -n '../patches/01-load-progress.diff
../patches/02-clip-log.diff
../patches/03-load_exception.diff
../patches/04-metal.diff
../patches/05-default-pretokenizer.diff
../patches/06-qwen2.diff
../patches/07-embeddings.diff
../patches/08-clip-unicode.diff
../patches/09-pooling.diff' ']'
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/01-load-progress.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout common/common.cpp
Updated 1 path from the index
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout common/common.h
Updated 1 path from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/02-clip-log.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout examples/llava/clip.cpp
Updated 1 path from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/03-load_exception.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 1 path from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/04-metal.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout ggml/src/ggml-metal.m
Updated 1 path from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/05-default-pretokenizer.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/06-qwen2.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/07-embeddings.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/08-clip-unicode.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout examples/llava/clip.cpp
Updated 0 paths from the index
+ for patch in ../patches/*.diff
++ grep '^+++ ' ../patches/09-pooling.diff
++ cut -f2 '-d '
++ cut -f2- -d/
+ for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/)
+ cd ../llama.cpp
+ git checkout src/llama.cpp
Updated 0 paths from the index
++ cd ../build/darwin/x86_64/metal/..
++ echo metal
+ echo 'go generate completed.  LLM runners: metal'
go generate completed.  LLM runners: metal
openai/openai.go
openai/openai_test.go
parser/parser.go
parser/parser_test.go
progress/bar.go
progress/progress.go
progress/spinner.go
readline/buffer.go
readline/errors.go
readline/history.go
readline/readline.go
readline/readline_unix.go
readline/term.go
readline/term_bsd.go
readline/types.go
server/auth.go
server/download.go
server/fixblobs.go
server/fixblobs_test.go
server/images.go
server/layer.go
server/manifest.go
server/manifest_test.go
server/model.go
server/model_test.go
server/modelpath.go
server/modelpath_test.go
server/prompt.go
server/prompt_test.go
server/routes.go
server/routes_create_test.go
server/routes_delete_test.go
server/routes_list_test.go
server/routes_test.go
server/sched.go
server/sched_test.go
server/upload.go
template/template.go
template/template_test.go
types/errtypes/errtypes.go
types/model/name.go
types/model/name_test.go
util/bufioutil/buffer_seeker.go
util/bufioutil/buffer_seeker_test.go
version/version.go
(AI-Feynman) davidlaxer@bluediamond ollama % go build .   
# github.com/ollama/ollama
ld: warning: ignoring duplicate libraries: '-lomp', '-lpthread'
(AI-Feynman) davidlaxer@bluediamond ollama % ollama serve 
2024/07/13 08:26:16 routes.go:940: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-07-13T08:26:16.514-07:00 level=INFO source=images.go:760 msg="total blobs: 33"
time=2024-07-13T08:26:16.518-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
time=2024-07-13T08:26:16.520-07:00 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.2.3)"
time=2024-07-13T08:26:16.522-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3077127912/runners
time=2024-07-13T08:26:16.571-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-07-13T08:26:16.571-07:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="56.3 GiB"
time=2024-07-13T08:26:42.227-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[55.3 GiB]" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-13T08:26:42.229-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3077127912/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 59661"
time=2024-07-13T08:26:42.238-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-13T08:26:42.238-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-13T08:26:42.240-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720884402
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720884402 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="59661" tid="0x7ff844b27fc0" timestamp=1720884402
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-13T08:26:42.743-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.02 MiB
llama_new_context_with_model:        CPU compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1720884406
time=2024-07-13T08:26:47.007-07:00 level=INFO source=server.go:617 msg="llama runner started in 4.77 seconds"
[GIN] 2024/07/13 - 08:26:47 | 200 |  5.210192837s |       127.0.0.1 | POST     "/api/embeddings"
time=2024-07-13T08:26:50.756-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[52.3 GiB]" memory.required.full="5.5 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
time=2024-07-13T08:26:50.757-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3077127912/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 59714"
time=2024-07-13T08:26:50.759-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=2
time=2024-07-13T08:26:50.759-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-13T08:26:50.759-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720884410
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720884410 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="59714" tid="0x7ff844b27fc0" timestamp=1720884410
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Mistral-7B-Instruct-v0.3
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 1027
llm_load_vocab: token to piece cache size = 0.1731 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.25 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = Mistral-7B-Instruct-v0.3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3922.02 MiB
time=2024-07-13T08:26:51.011-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.56 MiB
llama_new_context_with_model:        CPU compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1720884416
time=2024-07-13T08:26:56.571-07:00 level=INFO source=server.go:617 msg="llama runner started in 5.81 seconds"
[GIN] 2024/07/13 - 08:26:57 | 200 |  6.627215589s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/07/13 - 08:27:27 | 200 | 30.554324376s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/07/13 - 08:27:29 | 200 |  922.172297ms |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/07/13 - 08:28:01 | 200 | 32.359720208s |       127.0.0.1 | POST     "/api/embeddings"
[GIN] 2024/07/13 - 08:28:03 | 200 |  857.146081ms |       127.0.0.1 | POST     "/api/embeddings"


<!-- gh-comment-id:2226958934 --> @dbl001 commented on GitHub (Jul 13, 2024): @ahornby I tried cloning your fork and running ollama on my 2022 iMac 27" with an AMD Radeon Pro 5700 XT. It doesn't appear to find the GPU. Do you see anything I missed? ``` % git clone -b macos_amd64_metal https://github.com/ahornby/ollama.git Cloning into 'ollama'... remote: Enumerating objects: 14985, done. remote: Counting objects: 100% (361/361), done. remote: Compressing objects: 100% (215/215), done. remote: Total 14985 (delta 191), reused 257 (delta 146), pack-reused 14624 Receiving objects: 100% (14985/14985), 8.02 MiB | 8.04 MiB/s, done. Resolving deltas: 100% (9525/9525), done. (AI-Feynman) davidlaxer@bluediamond ~ % cd ollama (AI-Feynman) davidlaxer@bluediamond ollama % env | grep clang CLANG=/usr/bin/clang CC=/usr/bin/clang OBJC=/usr/bin/clang CC_FOR_BUILD=/usr/bin/clang OBJC_FOR_BUILD=/usr/bin/clang CXX=/usr/bin/clang++ (AI-Feynman) davidlaxer@bluediamond ollama % export CGO_CFLAGS="-I/opt/local/include/libomp" (AI-Feynman) davidlaxer@bluediamond ollama % export CGO_LDFLAGS="-L/opt/local/lib/libomp -lomp -framework Accelerate" (AI-Feynman) davidlaxer@bluediamond ollama % OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" go generate -v ./... main.go api/client.go api/client_test.go api/types.go api/types_test.go app/main.go app/assets/assets.go app/lifecycle/getstarted_nonwindows.go app/lifecycle/lifecycle.go app/lifecycle/logging.go app/lifecycle/logging_nonwindows.go app/lifecycle/logging_test.go app/lifecycle/paths.go app/lifecycle/server.go app/lifecycle/server_unix.go app/lifecycle/updater.go app/lifecycle/updater_nonwindows.go app/store/store.go app/store/store_darwin.go app/tray/tray.go app/tray/tray_nonwindows.go app/tray/commontray/types.go auth/auth.go cmd/cmd.go cmd/interactive.go cmd/interactive_test.go cmd/start.go cmd/start_darwin.go convert/convert.go convert/gemma.go convert/llama.go convert/mistral.go convert/mixtral.go convert/safetensors.go convert/tokenizer.go convert/torch.go convert/sentencepiece/sentencepiece_model.pb.go envconfig/config.go envconfig/config_test.go examples/go-chat/main.go examples/go-generate/main.go examples/go-generate-streaming/main.go examples/go-http-generate/main.go examples/go-multimodal/main.go examples/go-pull-progress/main.go format/bytes.go format/format.go format/format_test.go format/time.go format/time_test.go gpu/assets.go gpu/cpu_common.go gpu/gpu_darwin.go gpu/gpu_test.go gpu/types.go llm/filetype.go llm/ggla.go llm/ggml.go llm/ggml_test.go llm/gguf.go llm/llm.go llm/llm_darwin_amd64.go llm/memory.go llm/memory_test.go llm/payload.go llm/server.go llm/status.go llm/generate/generate_darwin.go + set -o pipefail + echo 'Starting darwin generate script' Starting darwin generate script ++ dirname ./gen_darwin.sh + source ./gen_common.sh + init_vars + case "${GOARCH}" in + ARCH=x86_64 + LLAMACPP_DIR=../llama.cpp + CMAKE_DEFS= + CMAKE_TARGETS='--target ollama_llama_server' + echo -I/opt/local/include/libomp + grep -- -g + CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + case $(uname -s) in ++ uname -s + LIB_EXT=dylib + WHOLE_ARCHIVE=-Wl,-force_load + NO_WHOLE_ARCHIVE= + GCC_ARCH='-arch x86_64' + '[' -z '' ']' + CMAKE_CUDA_ARCHITECTURES='50;52;61;70;75;80' + git_module_setup + '[' -n '' ']' + '[' -d ../llama.cpp/gguf ']' + git submodule init Submodule 'llama.cpp' (https://github.com/ggerganov/llama.cpp.git) registered for path '../llama.cpp' + git submodule update --force ../llama.cpp Cloning into '/Users/davidlaxer/ollama/llm/llama.cpp'... remote: Enumerating objects: 19612, done. remote: Counting objects: 100% (19611/19611), done. remote: Compressing objects: 100% (5184/5184), done. remote: Total 19141 (delta 14341), reused 18567 (delta 13786), pack-reused 0 Receiving objects: 100% (19141/19141), 18.21 MiB | 10.27 MiB/s, done. Resolving deltas: 100% (14341/14341), completed with 360 local objects. From https://github.com/ggerganov/llama.cpp * branch a8db2a9ce64cd4417f6a312ab61858f17f0f8584 -> FETCH_HEAD Submodule path '../llama.cpp': checked out 'a8db2a9ce64cd4417f6a312ab61858f17f0f8584' + apply_patches + grep ollama ../llama.cpp/CMakeLists.txt + echo 'add_subdirectory(../ext_server ext_server) # ollama' ++ ls -A ../patches/01-load-progress.diff ../patches/02-clip-log.diff ../patches/03-load_exception.diff ../patches/04-metal.diff ../patches/05-default-pretokenizer.diff ../patches/06-qwen2.diff ../patches/07-embeddings.diff ../patches/08-clip-unicode.diff ../patches/09-pooling.diff + '[' -n '../patches/01-load-progress.diff ../patches/02-clip-log.diff ../patches/03-load_exception.diff ../patches/04-metal.diff ../patches/05-default-pretokenizer.diff ../patches/06-qwen2.diff ../patches/07-embeddings.diff ../patches/08-clip-unicode.diff ../patches/09-pooling.diff' ']' + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/01-load-progress.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout common/common.cpp Updated 0 paths from the index + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout common/common.h Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/02-clip-log.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout examples/llava/clip.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/03-load_exception.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/04-metal.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout ggml/src/ggml-metal.m Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/05-default-pretokenizer.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/06-qwen2.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/07-embeddings.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/08-clip-unicode.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout examples/llava/clip.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/09-pooling.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/01-load-progress.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/02-clip-log.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/03-load_exception.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/04-metal.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/05-default-pretokenizer.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/06-qwen2.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/07-embeddings.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/08-clip-unicode.diff + for patch in ../patches/*.diff + cd ../llama.cpp + git apply ../patches/09-pooling.diff + COMMON_DARWIN_DEFS='-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off' + case "${GOARCH}" in + '[' -z on ']' + '[' -z '' ']' + init_vars + case "${GOARCH}" in + ARCH=x86_64 + LLAMACPP_DIR=../llama.cpp + CMAKE_DEFS= + CMAKE_TARGETS='--target ollama_llama_server' + echo -I/opt/local/include/libomp + grep -- -g + CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + case $(uname -s) in ++ uname -s + LIB_EXT=dylib + WHOLE_ARCHIVE=-Wl,-force_load + NO_WHOLE_ARCHIVE= + GCC_ARCH='-arch x86_64' + '[' -z '50;52;61;70;75;80' ']' + CMAKE_TARGETS='--target llama --target ggml' + CMAKE_DEFS='-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + BUILD_DIR=../build/darwin/x86_64_static + echo 'Building static library' Building static library + build + cmake -S ../llama.cpp -B ../build/darwin/x86_64_static -DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off -- The C compiler identification is AppleClang 15.0.0.15000309 -- The CXX compiler identification is AppleClang 15.0.0.15000309 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /opt/local/bin/git (found version "2.45.2") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Accelerate framework found -- Metal framework found -- The ASM compiler identification is AppleClang -- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -- Looking for dgemm_ -- Looking for dgemm_ - found -- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework -- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework -- BLAS found, Includes: -- Using ggml SGEMM -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- x86 detected -- Configuring done (8.1s) -- Generating done (0.6s) CMake Warning: Manually-specified variables were not used by the project: LLAMA_METAL_MACOSX_VERSION_MIN -- Build files have been written to: /Users/davidlaxer/ollama/llm/build/darwin/x86_64_static + cmake --build ../build/darwin/x86_64_static --target llama --target ggml -j8 [ 12%] Generate assembly for embedded Metal library Embedding Metal library [ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o [ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o [ 37%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o [ 50%] Building CXX object ggml/src/CMakeFiles/ggml.dir/sgemm.cpp.o [ 50%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o [ 62%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-metal.m.o [ 75%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-blas.cpp.o [ 75%] Building ASM object ggml/src/CMakeFiles/ggml.dir/__/__/autogenerated/ggml-metal-embed.s.o /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new] cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, ^~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0 void cblas_sgemm(const enum CBLAS_ORDER ORDER, ^ /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, ^~~~~~~~~~~ /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new] cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n); ^~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0 void cblas_sgemm(const enum CBLAS_ORDER ORDER, ^ /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n); ^~~~~~~~~~~ 2 warnings generated. [ 75%] Linking CXX static library libggml.a [ 75%] Built target ggml [ 75%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o [100%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o [100%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o [100%] Linking CXX static library libllama.a [100%] Built target llama [100%] Built target ggml + init_vars + case "${GOARCH}" in + ARCH=x86_64 + LLAMACPP_DIR=../llama.cpp + CMAKE_DEFS= + CMAKE_TARGETS='--target ollama_llama_server' + echo -I/opt/local/include/libomp + grep -- -g + CMAKE_DEFS='-DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + case $(uname -s) in ++ uname -s + LIB_EXT=dylib + WHOLE_ARCHIVE=-Wl,-force_load + NO_WHOLE_ARCHIVE= + GCC_ARCH='-arch x86_64' + '[' -z '50;52;61;70;75;80' ']' + CMAKE_DEFS='-DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off ' + BUILD_DIR=../build/darwin/x86_64/metal + EXTRA_LIBS=' -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders' + build + cmake -S ../llama.cpp -B ../build/darwin/x86_64/metal -DBUILD_SHARED_LIBS=off -DCMAKE_OSX_DEPLOYMENT_TARGET=11.3 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DCMAKE_SYSTEM_NAME=Darwin -DGGML_METAL_EMBED_LIBRARY=on -DGGML_OPENMP=off -DCMAKE_SYSTEM_PROCESSOR=x86_64 -DCMAKE_OSX_ARCHITECTURES=x86_64 -DCMAKE_BUILD_TYPE=Release -DLLAMA_SERVER_VERBOSE=off -- The C compiler identification is AppleClang 15.0.0.15000309 -- The CXX compiler identification is AppleClang 15.0.0.15000309 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /opt/local/bin/git (found version "2.45.2") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Accelerate framework found -- Metal framework found -- The ASM compiler identification is AppleClang -- Found assembler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -- Looking for dgemm_ -- Looking for dgemm_ - found -- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework -- BLAS found, Libraries: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/Accelerate.framework -- BLAS found, Includes: -- Using ggml SGEMM -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- x86 detected -- Configuring done (1.9s) -- Generating done (0.5s) CMake Warning: Manually-specified variables were not used by the project: LLAMA_METAL_MACOSX_VERSION_MIN -- Build files have been written to: /Users/davidlaxer/ollama/llm/build/darwin/x86_64/metal + cmake --build ../build/darwin/x86_64/metal --target ollama_llama_server -j8 [ 6%] Generate assembly for embedded Metal library Embedding Metal library [ 6%] Generating build details from Git -- Found Git: /opt/local/bin/git (found version "2.45.2") [ 6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o [ 6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o [ 20%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o [ 20%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o [ 26%] Building ASM object ggml/src/CMakeFiles/ggml.dir/__/__/autogenerated/ggml-metal-embed.s.o [ 26%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-blas.cpp.o [ 33%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-metal.m.o [ 40%] Building CXX object ggml/src/CMakeFiles/ggml.dir/sgemm.cpp.o [ 46%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o [ 46%] Built target build_info /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new] cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, ^~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0 void cblas_sgemm(const enum CBLAS_ORDER ORDER, ^ /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:160:13: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, ^~~~~~~~~~~ /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: warning: 'cblas_sgemm' is only available on macOS 13.3 or newer [-Wunguarded-availability-new] cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n); ^~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas_new.h:891:6: note: 'cblas_sgemm' has been marked as being introduced in macOS 13.3 here, but the deployment target is macOS 11.3.0 void cblas_sgemm(const enum CBLAS_ORDER ORDER, ^ /Users/davidlaxer/ollama/llm/llama.cpp/ggml/src/ggml-blas.cpp:225:5: note: enclose 'cblas_sgemm' in a __builtin_available check to silence this warning cblas_sgemm(CblasRowMajor, transposeA, CblasNoTrans, m, n, k, 1.0, a, lda, b, n, 0.0, c, n); ^~~~~~~~~~~ 2 warnings generated. [ 46%] Linking CXX static library libggml.a [ 46%] Built target ggml [ 46%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o [ 60%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o [ 60%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o [ 60%] Linking CXX static library libllama.a [ 60%] Built target llama [ 60%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o [ 66%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o [ 66%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o [ 73%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o [ 80%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o [ 80%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o [ 86%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o [ 86%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o [ 93%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o [ 93%] Built target llava [ 93%] Linking CXX static library libcommon.a [ 93%] Built target common [100%] Building CXX object ext_server/CMakeFiles/ollama_llama_server.dir/server.cpp.o /Users/davidlaxer/ollama/llm/ext_server/server.cpp:263:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations] sprintf(buffer, "prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)", ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here __deprecated_msg("This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.") ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg' #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg))) ^ /Users/davidlaxer/ollama/llm/ext_server/server.cpp:277:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations] sprintf(buffer, "generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)", ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here __deprecated_msg("This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.") ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg' #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg))) ^ /Users/davidlaxer/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [-Wdeprecated-declarations] sprintf(buffer, " total time = %10.2f ms", t_prompt_processing + t_token_generation); ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here __deprecated_msg("This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.") ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg' #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg))) ^ 3 warnings generated. [100%] Linking CXX executable ../bin/ollama_llama_server ld: warning: ignoring duplicate libraries: '../ggml/src/libggml.a', '../src/libllama.a' [100%] Built target ollama_llama_server + sign ../build/darwin/x86_64/metal/bin/ollama_llama_server + '[' -n '' ']' + compress + echo 'Compressing payloads to reduce overall binary size...' Compressing payloads to reduce overall binary size... + pids= + rm -rf '../build/darwin/x86_64/metal/bin/*.gz' + for f in ${BUILD_DIR}/bin/* + pids+=' 4090' + for f in ${BUILD_DIR}/bin/* + pids+=' 4091' + gzip -n --best -f ../build/darwin/x86_64/metal/bin/ggml-common.h + for f in ${BUILD_DIR}/bin/* + gzip -n --best -f ../build/darwin/x86_64/metal/bin/ggml-metal.metal + pids+=' 4092' + '[' -d ../build/darwin/x86_64/metal/lib ']' + echo + for pid in ${pids} + wait 4090 + gzip -n --best -f ../build/darwin/x86_64/metal/bin/ollama_llama_server + for pid in ${pids} + wait 4091 + for pid in ${pids} + wait 4092 + echo 'Finished compression' Finished compression + cleanup + cd ../llama.cpp/ + git checkout CMakeLists.txt Updated 1 path from the index ++ ls -A ../patches/01-load-progress.diff ../patches/02-clip-log.diff ../patches/03-load_exception.diff ../patches/04-metal.diff ../patches/05-default-pretokenizer.diff ../patches/06-qwen2.diff ../patches/07-embeddings.diff ../patches/08-clip-unicode.diff ../patches/09-pooling.diff + '[' -n '../patches/01-load-progress.diff ../patches/02-clip-log.diff ../patches/03-load_exception.diff ../patches/04-metal.diff ../patches/05-default-pretokenizer.diff ../patches/06-qwen2.diff ../patches/07-embeddings.diff ../patches/08-clip-unicode.diff ../patches/09-pooling.diff' ']' + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/01-load-progress.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout common/common.cpp Updated 1 path from the index + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout common/common.h Updated 1 path from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/02-clip-log.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout examples/llava/clip.cpp Updated 1 path from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/03-load_exception.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 1 path from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/04-metal.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout ggml/src/ggml-metal.m Updated 1 path from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/05-default-pretokenizer.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/06-qwen2.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/07-embeddings.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/08-clip-unicode.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout examples/llava/clip.cpp Updated 0 paths from the index + for patch in ../patches/*.diff ++ grep '^+++ ' ../patches/09-pooling.diff ++ cut -f2 '-d ' ++ cut -f2- -d/ + for file in $(grep "^+++ " ${patch} | cut -f2 -d' ' | cut -f2- -d/) + cd ../llama.cpp + git checkout src/llama.cpp Updated 0 paths from the index ++ cd ../build/darwin/x86_64/metal/.. ++ echo metal + echo 'go generate completed. LLM runners: metal' go generate completed. LLM runners: metal openai/openai.go openai/openai_test.go parser/parser.go parser/parser_test.go progress/bar.go progress/progress.go progress/spinner.go readline/buffer.go readline/errors.go readline/history.go readline/readline.go readline/readline_unix.go readline/term.go readline/term_bsd.go readline/types.go server/auth.go server/download.go server/fixblobs.go server/fixblobs_test.go server/images.go server/layer.go server/manifest.go server/manifest_test.go server/model.go server/model_test.go server/modelpath.go server/modelpath_test.go server/prompt.go server/prompt_test.go server/routes.go server/routes_create_test.go server/routes_delete_test.go server/routes_list_test.go server/routes_test.go server/sched.go server/sched_test.go server/upload.go template/template.go template/template_test.go types/errtypes/errtypes.go types/model/name.go types/model/name_test.go util/bufioutil/buffer_seeker.go util/bufioutil/buffer_seeker_test.go version/version.go (AI-Feynman) davidlaxer@bluediamond ollama % go build . # github.com/ollama/ollama ld: warning: ignoring duplicate libraries: '-lomp', '-lpthread' (AI-Feynman) davidlaxer@bluediamond ollama % ollama serve 2024/07/13 08:26:16 routes.go:940: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-07-13T08:26:16.514-07:00 level=INFO source=images.go:760 msg="total blobs: 33" time=2024-07-13T08:26:16.518-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0" time=2024-07-13T08:26:16.520-07:00 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.2.3)" time=2024-07-13T08:26:16.522-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3077127912/runners time=2024-07-13T08:26:16.571-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-07-13T08:26:16.571-07:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="56.3 GiB" time=2024-07-13T08:26:42.227-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[55.3 GiB]" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-13T08:26:42.229-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3077127912/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 59661" time=2024-07-13T08:26:42.238-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-13T08:26:42.238-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-13T08:26:42.240-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720884402 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720884402 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="59661" tid="0x7ff844b27fc0" timestamp=1720884402 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-13T08:26:42.743-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: CPU buffer size = 4437.80 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 2.02 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1720884406 time=2024-07-13T08:26:47.007-07:00 level=INFO source=server.go:617 msg="llama runner started in 4.77 seconds" [GIN] 2024/07/13 - 08:26:47 | 200 | 5.210192837s | 127.0.0.1 | POST "/api/embeddings" time=2024-07-13T08:26:50.756-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[52.3 GiB]" memory.required.full="5.5 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="105.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" time=2024-07-13T08:26:50.757-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3077127912/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 59714" time=2024-07-13T08:26:50.759-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=2 time=2024-07-13T08:26:50.759-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-13T08:26:50.759-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720884410 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720884410 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="59714" tid="0x7ff844b27fc0" timestamp=1720884410 llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-ff82381e2bea77d91c1b824c7afb83f6fb73e9f7de9dda631bcdbca564aa5435 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Mistral-7B-Instruct-v0.3 llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 32768 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = llama llama_model_loader: - kv 14: tokenizer.ggml.pre str = default llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 23: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 24: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 1027 llm_load_vocab: token to piece cache size = 0.1731 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.25 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = Mistral-7B-Instruct-v0.3 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: CPU buffer size = 3922.02 MiB time=2024-07-13T08:26:51.011-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.56 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1720884416 time=2024-07-13T08:26:56.571-07:00 level=INFO source=server.go:617 msg="llama runner started in 5.81 seconds" [GIN] 2024/07/13 - 08:26:57 | 200 | 6.627215589s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/07/13 - 08:27:27 | 200 | 30.554324376s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/07/13 - 08:27:29 | 200 | 922.172297ms | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/07/13 - 08:28:01 | 200 | 32.359720208s | 127.0.0.1 | POST "/api/embeddings" [GIN] 2024/07/13 - 08:28:03 | 200 | 857.146081ms | 127.0.0.1 | POST "/api/embeddings" ```
Author
Owner

@ahornby commented on GitHub (Jul 13, 2024):

@dbl001 spotted two differences, your log indicates:

  • you built with go build . whereas in my commit message the command is CGO_CFLAGS="-I/usr/local/include" CGO_LDFLAGS="-L/usr/local/lib -framework Accelerate" go build .
  • you ran with ollama serve whereas in my commit message the command is GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve
<!-- gh-comment-id:2226961265 --> @ahornby commented on GitHub (Jul 13, 2024): @dbl001 spotted two differences, your log indicates: * you built with `go build .` whereas in my commit message the command is `CGO_CFLAGS="-I/usr/local/include" CGO_LDFLAGS="-L/usr/local/lib -framework Accelerate" go build .` * you ran with `ollama serve` whereas in my commit message the command is `GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve`
Author
Owner

@dbl001 commented on GitHub (Jul 13, 2024):

@ahornby My GPU has 16GB... Any suggestions?

ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Metal buffer (tensor size: 16384, max buffer size: 0)
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
libc++abi: terminating due to uncaught exception of type std::runtime_error: unable to allocate backend buffer
time=2024-07-13T08:49:52.729-07:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer"

Here's the full output.

 % CGO_CFLAGS="-I/usr/local/include" CGO_LDFLAGS="-L/usr/local/lib -framework Accelerate" go build .
# github.com/ollama/ollama
ld: warning: ignoring duplicate libraries: '-lpthread'
(AI-Feynman) davidlaxer@bluediamond ollama % GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve
2024/07/13 08:49:29 routes.go:940: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-07-13T08:49:29.734-07:00 level=INFO source=images.go:760 msg="total blobs: 33"
time=2024-07-13T08:49:29.738-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-07-13T08:49:29.739-07:00 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-07-13T08:49:29.739-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3507161543/runners
time=2024-07-13T08:49:29.769-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-07-13T08:49:29.769-07:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="65.7 GiB"
time=2024-07-13T08:49:51.721-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[65.5 GiB]" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-13T08:49:51.721-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3507161543/runners/metal
time=2024-07-13T08:49:51.722-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3507161543/runners/metal/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 61564"
time=2024-07-13T08:49:51.725-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-13T08:49:51.725-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-13T08:49:51.725-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720885792
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720885792 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61564" tid="0x7ff844b27fc0" timestamp=1720885792
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-13T08:49:52.228-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Metal buffer (tensor size: 16384, max buffer size: 0)
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
libc++abi: terminating due to uncaught exception of type std::runtime_error: unable to allocate backend buffer
time=2024-07-13T08:49:52.729-07:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer"
[GIN] 2024/07/13 - 08:49:52 | 500 |  1.030025673s |       127.0.0.1 | POST     "/api/embeddings"


<!-- gh-comment-id:2226968041 --> @dbl001 commented on GitHub (Jul 13, 2024): @ahornby My GPU has 16GB... Any suggestions? ``` ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Metal buffer (tensor size: 16384, max buffer size: 0) llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: exception loading model libc++abi: terminating due to uncaught exception of type std::runtime_error: unable to allocate backend buffer time=2024-07-13T08:49:52.729-07:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer" ``` Here's the full output. ``` % CGO_CFLAGS="-I/usr/local/include" CGO_LDFLAGS="-L/usr/local/lib -framework Accelerate" go build . # github.com/ollama/ollama ld: warning: ignoring duplicate libraries: '-lpthread' (AI-Feynman) davidlaxer@bluediamond ollama % GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve 2024/07/13 08:49:29 routes.go:940: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-07-13T08:49:29.734-07:00 level=INFO source=images.go:760 msg="total blobs: 33" time=2024-07-13T08:49:29.738-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-07-13T08:49:29.739-07:00 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-07-13T08:49:29.739-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3507161543/runners time=2024-07-13T08:49:29.769-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]" time=2024-07-13T08:49:29.769-07:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="65.7 GiB" time=2024-07-13T08:49:51.721-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[65.5 GiB]" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-13T08:49:51.721-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3507161543/runners/metal time=2024-07-13T08:49:51.722-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3507161543/runners/metal/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 61564" time=2024-07-13T08:49:51.725-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-13T08:49:51.725-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-13T08:49:51.725-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720885792 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720885792 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61564" tid="0x7ff844b27fc0" timestamp=1720885792 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-13T08:49:52.228-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.27 MiB ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Metal buffer (tensor size: 16384, max buffer size: 0) llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: exception loading model libc++abi: terminating due to uncaught exception of type std::runtime_error: unable to allocate backend buffer time=2024-07-13T08:49:52.729-07:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer" [GIN] 2024/07/13 - 08:49:52 | 500 | 1.030025673s | 127.0.0.1 | POST "/api/embeddings" ```
Author
Owner

@ahornby commented on GitHub (Jul 13, 2024):

@dbl001 first try with codellama, that's the only model I tried. If that works you got as far as I did

<!-- gh-comment-id:2226970305 --> @ahornby commented on GitHub (Jul 13, 2024): @dbl001 first try with codellama, that's the only model I tried. If that works you got as far as I did
Author
Owner

@dbl001 commented on GitHub (Jul 13, 2024):

@ahornby same thing with codellama

% ollama run codellama
Error: llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer

 % GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve
2024/07/13 09:24:36 routes.go:940: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-07-13T09:24:36.795-07:00 level=INFO source=images.go:760 msg="total blobs: 33"
time=2024-07-13T09:24:36.799-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-07-13T09:24:36.800-07:00 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-07-13T09:24:36.803-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama504657975/runners
time=2024-07-13T09:24:36.831-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-07-13T09:24:36.831-07:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="71.1 GiB"
[GIN] 2024/07/13 - 09:24:59 | 200 |      52.888µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/13 - 09:24:59 | 200 |    7.411698ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-13T09:24:59.151-07:00 level=WARN source=types.go:406 msg="invalid option provided" option=""
time=2024-07-13T09:24:59.155-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[71.1 GiB]" memory.required.full="8.3 GiB" memory.required.partial="0 B" memory.required.kv="4.0 GiB" memory.required.allocations="[8.3 GiB]" memory.weights.total="7.4 GiB" memory.weights.repeating="7.3 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="681.0 MiB"
time=2024-07-13T09:24:59.156-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama504657975/runners/metal
time=2024-07-13T09:24:59.157-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama504657975/runners/metal/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 64349"
time=2024-07-13T09:24:59.160-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-13T09:24:59.160-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-13T09:24:59.160-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720887899
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720887899 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="64349" tid="0x7ff844b27fc0" timestamp=1720887899
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = codellama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
llm_load_print_meta: MID token        = 32009 '▁<MID>'
llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Metal buffer (tensor size: 16384, max buffer size: 0)
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
libc++abi: terminating due to uncaught exception of type std::runtime_error: unable to allocate backend buffer
time=2024-07-13T09:24:59.662-07:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer"
[GIN] 2024/07/13 - 09:24:59 | 500 |  515.156813ms |       127.0.0.1 | POST     "/api/chat"


<!-- gh-comment-id:2226988887 --> @dbl001 commented on GitHub (Jul 13, 2024): @ahornby same thing with codellama ``` % ollama run codellama Error: llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer % GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve 2024/07/13 09:24:36 routes.go:940: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-07-13T09:24:36.795-07:00 level=INFO source=images.go:760 msg="total blobs: 33" time=2024-07-13T09:24:36.799-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-07-13T09:24:36.800-07:00 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-07-13T09:24:36.803-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama504657975/runners time=2024-07-13T09:24:36.831-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]" time=2024-07-13T09:24:36.831-07:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="128.0 GiB" available="71.1 GiB" [GIN] 2024/07/13 - 09:24:59 | 200 | 52.888µs | 127.0.0.1 | HEAD "/" [GIN] 2024/07/13 - 09:24:59 | 200 | 7.411698ms | 127.0.0.1 | POST "/api/show" time=2024-07-13T09:24:59.151-07:00 level=WARN source=types.go:406 msg="invalid option provided" option="" time=2024-07-13T09:24:59.155-07:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[71.1 GiB]" memory.required.full="8.3 GiB" memory.required.partial="0 B" memory.required.kv="4.0 GiB" memory.required.allocations="[8.3 GiB]" memory.weights.total="7.4 GiB" memory.weights.repeating="7.3 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="681.0 MiB" time=2024-07-13T09:24:59.156-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama504657975/runners/metal time=2024-07-13T09:24:59.157-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama504657975/runners/metal/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 64349" time=2024-07-13T09:24:59.160-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-13T09:24:59.160-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-13T09:24:59.160-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1720887899 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1720887899 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="64349" tid="0x7ff844b27fc0" timestamp=1720887899 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1686 MB llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: PRE token = 32007 '▁<PRE>' llm_load_print_meta: SUF token = 32008 '▁<SUF>' llm_load_print_meta: MID token = 32009 '▁<MID>' llm_load_print_meta: EOT token = 32010 '▁<EOT>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.27 MiB ggml_backend_alloc_ctx_tensors_from_buft: tensor output_norm.weight is too large to fit in a Metal buffer (tensor size: 16384, max buffer size: 0) llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: exception loading model libc++abi: terminating due to uncaught exception of type std::runtime_error: unable to allocate backend buffer time=2024-07-13T09:24:59.662-07:00 level=ERROR source=sched.go:443 msg="error loading llama server" error="llama runner process has terminated: signal: abort trap error:unable to allocate backend buffer" [GIN] 2024/07/13 - 09:24:59 | 500 | 515.156813ms | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@ahornby commented on GitHub (Jul 13, 2024):

in that case I don't know, guessing its due to different GPU. Hopefully you can work out the problem on your card now that you have a way to build and test

<!-- gh-comment-id:2226991008 --> @ahornby commented on GitHub (Jul 13, 2024): in that case I don't know, guessing its due to different GPU. Hopefully you can work out the problem on your card now that you have a way to build and test
Author
Owner

@dbl001 commented on GitHub (Jul 13, 2024):

@ahornby Thank You!

<!-- gh-comment-id:2226993332 --> @dbl001 commented on GitHub (Jul 13, 2024): @ahornby Thank You!
Author
Owner

@Grergo commented on GitHub (Jul 14, 2024):

Why don't we give Vulkan a try? I've successfully run the Vulkan backend of llama.cpp on an Intel Mac, utilizing the AMD GPU. The performance is slightly better than using just the CPU.
Intel 12700 AVX2:
CPU_AVX2
AMD 6600XT Vulkan:
AMD

<!-- gh-comment-id:2227297280 --> @Grergo commented on GitHub (Jul 14, 2024): Why don't we give Vulkan a try? I've successfully run the Vulkan backend of llama.cpp on an Intel Mac, utilizing the AMD GPU. The performance is slightly better than using just the CPU. Intel 12700 AVX2: <img width="1432" alt="CPU_AVX2" src="https://github.com/user-attachments/assets/384786bb-5aa4-40f6-a9b9-2afd5ed333d9"> AMD 6600XT Vulkan: <img width="1432" alt="AMD" src="https://github.com/user-attachments/assets/7c599a6b-aa21-4aee-ac9b-1f8d08c5b4f1">
Author
Owner

@ahornby commented on GitHub (Jul 14, 2024):

@Grergo was able to build ollama with vulkan but output is garbled on my 560X. Feel free to try where I left off from commit 5709e59e10 (branch is https://github.com/ahornby/ollama/tree/macos_amd64_gpu)

<!-- gh-comment-id:2227362270 --> @ahornby commented on GitHub (Jul 14, 2024): @Grergo was able to build ollama with vulkan but output is garbled on my 560X. Feel free to try where I left off from commit https://github.com/ollama/ollama/commit/5709e59e10808b3621c35910bd5df948ed6a740e (branch is https://github.com/ahornby/ollama/tree/macos_amd64_gpu)
Author
Owner

@dbl001 commented on GitHub (Jul 14, 2024):

@ahornby In my previous attempt from the master branch ollama was able to detect the GPU:

ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 5700 XT
ggml_metal_init: picking default device: AMD Radeon Pro 5700 XT
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/davidlaxer/ollama/llm/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   AMD Radeon Pro 5700 XT
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB

With your branch (e.g. - branch is https://github.com/ahornby/ollama/tree/macos_amd64_gpu) I'm ggml_metal_init didn't get as far:

ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 5700 XT
ggml_metal_init: picking default device: (null)
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   (null)
ggml_metal_init: simdgroup reduction support   = false
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  =     0.00 MB

Here's what's changed.

info := GpuInfo{
		Library: "metal",
		ID:      "0",
	}

https://github.com/ollama/ollama/compare/main...ahornby:macos_amd64_gpu

Screenshot 2024-07-14 at 8 39 49 AM
<!-- gh-comment-id:2227390630 --> @dbl001 commented on GitHub (Jul 14, 2024): @ahornby In my previous attempt from the master branch ollama was able to detect the GPU: ``` ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon Pro 5700 XT ggml_metal_init: picking default device: AMD Radeon Pro 5700 XT ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/Users/davidlaxer/ollama/llm/llama.cpp/ggml-metal.metal' ggml_metal_init: GPU name: AMD Radeon Pro 5700 XT ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ``` With your branch (e.g. - branch is https://github.com/ahornby/ollama/tree/macos_amd64_gpu) I'm ggml_metal_init didn't get as far: ``` ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon Pro 5700 XT ggml_metal_init: picking default device: (null) ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: (null) ggml_metal_init: simdgroup reduction support = false ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 0.00 MB ``` Here's what's changed. ``` info := GpuInfo{ Library: "metal", ID: "0", } ``` https://github.com/ollama/ollama/compare/main...ahornby:macos_amd64_gpu <img width="2560" alt="Screenshot 2024-07-14 at 8 39 49 AM" src="https://github.com/user-attachments/assets/98b96fe7-38c8-45e0-910e-356e12eab1b7">
Author
Owner

@ahornby commented on GitHub (Jul 14, 2024):

@dbl001 try the vulkan mode, that's what's new

<!-- gh-comment-id:2227392934 --> @ahornby commented on GitHub (Jul 14, 2024): @dbl001 try the vulkan mode, that's what's new
Author
Owner

@Grergo commented on GitHub (Jul 15, 2024):

@ahornby I built Ollama from commit 5709e9 and it outputs content smoothly without any garbled output.
screen1
screen2
screen3

<!-- gh-comment-id:2228479965 --> @Grergo commented on GitHub (Jul 15, 2024): @ahornby I built Ollama from commit 5709e9 and it outputs content smoothly without any garbled output. <img width="1300" alt="screen1" src="https://github.com/user-attachments/assets/6231d8e9-2a6b-4909-b730-2dd4f8d015a1"> <img width="1300" alt="screen2" src="https://github.com/user-attachments/assets/ff2cfa21-f72e-4660-8a85-7fa13b719297"> <img width="1300" alt="screen3" src="https://github.com/user-attachments/assets/948c51f9-2dbb-4ee2-bd17-0637b895d826">
Author
Owner

@ahornby commented on GitHub (Jul 15, 2024):

@Grergo nice! Glad it was useful to someone. I guess my 560X is just too old

<!-- gh-comment-id:2228483298 --> @ahornby commented on GitHub (Jul 15, 2024): @Grergo nice! Glad it was useful to someone. I guess my 560X is just too old
Author
Owner

@Grergo commented on GitHub (Jul 15, 2024):

@ahornby Thank you for your work. I suspect the issue might be caused by the R560x not supporting Metal 3, but this is just my guess. Currently, there are still performance issues with Vulkan, and the GPU utilization is not high.

<!-- gh-comment-id:2228505033 --> @Grergo commented on GitHub (Jul 15, 2024): @ahornby Thank you for your work. I suspect the issue might be caused by the R560x not supporting Metal 3, but this is just my guess. Currently, there are still performance issues with Vulkan, and the GPU utilization is not high.
Author
Owner

@dbl001 commented on GitHub (Jul 16, 2024):

@ahornby Why do I get:

WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="0x7ff844b27fc0" timestamp=1721142787

I tried setting: LLAMA_SUPPORTS_GPU_OFFLOAD=on, but it doesn't help.

 % GIN_MODE=debug OLLAMA_LLM_LIBRARY=vulkan ./ollama serve
2024/07/16 08:13:01 routes.go:958: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:vulkan OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-07-16T08:13:01.280-07:00 level=INFO source=images.go:760 msg="total blobs: 38"
time=2024-07-16T08:13:01.284-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-07-16T08:13:01.285-07:00 level=INFO source=routes.go:1005 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-07-16T08:13:01.288-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326: directory not empty"
time=2024-07-16T08:13:01.289-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533: directory not empty"
time=2024-07-16T08:13:01.289-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3491634299/runners
time=2024-07-16T08:13:01.328-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal cpu cpu_avx cpu_avx2]"
time=2024-07-16T08:13:01.375-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=vulkan compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB"
[GIN] 2024/07/16 - 08:13:07 | 200 |      78.551µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/16 - 08:13:07 | 200 |    8.798674ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-16T08:13:07.555-07:00 level=WARN source=types.go:408 msg="invalid option provided" option=""
time=2024-07-16T08:13:07.560-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac gpu=0 parallel=4 available=17163091968 required="8.8 GiB"
time=2024-07-16T08:13:07.560-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="4.0 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.4 GiB" memory.weights.repeating="7.3 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="681.0 MiB"
time=2024-07-16T08:13:07.560-07:00 level=INFO source=server.go:170 msg="Invalid OLLAMA_LLM_LIBRARY vulkan - not found"
time=2024-07-16T08:13:07.561-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3491634299/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 64970"
time=2024-07-16T08:13:07.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-16T08:13:07.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-16T08:13:07.566-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="0x7ff844b27fc0" timestamp=1721142787
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1721142787
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1721142787 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="64970" tid="0x7ff844b27fc0" timestamp=1721142787
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = codellama
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
llm_load_print_meta: MID token        = 32009 '▁<MID>'
llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3647.95 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
time=2024-07-16T08:13:08.068-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.55 MiB
llama_new_context_with_model:        CPU compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1721142796
time=2024-07-16T08:13:16.541-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server not responding"
time=2024-07-16T08:13:16.793-07:00 level=INFO source=server.go:617 msg="llama runner started in 9.23 seconds"
[GIN] 2024/07/16 - 08:13:16 | 200 |    9.2416228s |       127.0.0.1 | POST     "/api/chat"


<!-- gh-comment-id:2231204486 --> @dbl001 commented on GitHub (Jul 16, 2024): @ahornby Why do I get: ``` WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="0x7ff844b27fc0" timestamp=1721142787 ``` I tried setting: LLAMA_SUPPORTS_GPU_OFFLOAD=on, but it doesn't help. ``` % GIN_MODE=debug OLLAMA_LLM_LIBRARY=vulkan ./ollama serve 2024/07/16 08:13:01 routes.go:958: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:vulkan OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-07-16T08:13:01.280-07:00 level=INFO source=images.go:760 msg="total blobs: 38" time=2024-07-16T08:13:01.284-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-07-16T08:13:01.285-07:00 level=INFO source=routes.go:1005 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-07-16T08:13:01.288-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326: directory not empty" time=2024-07-16T08:13:01.289-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533: directory not empty" time=2024-07-16T08:13:01.289-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3491634299/runners time=2024-07-16T08:13:01.328-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal cpu cpu_avx cpu_avx2]" time=2024-07-16T08:13:01.375-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=vulkan compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB" [GIN] 2024/07/16 - 08:13:07 | 200 | 78.551µs | 127.0.0.1 | HEAD "/" [GIN] 2024/07/16 - 08:13:07 | 200 | 8.798674ms | 127.0.0.1 | POST "/api/show" time=2024-07-16T08:13:07.555-07:00 level=WARN source=types.go:408 msg="invalid option provided" option="" time=2024-07-16T08:13:07.560-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac gpu=0 parallel=4 available=17163091968 required="8.8 GiB" time=2024-07-16T08:13:07.560-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="4.0 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.4 GiB" memory.weights.repeating="7.3 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="681.0 MiB" time=2024-07-16T08:13:07.560-07:00 level=INFO source=server.go:170 msg="Invalid OLLAMA_LLM_LIBRARY vulkan - not found" time=2024-07-16T08:13:07.561-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama3491634299/runners/cpu_avx2/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 64970" time=2024-07-16T08:13:07.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-16T08:13:07.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-16T08:13:07.566-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="0x7ff844b27fc0" timestamp=1721142787 INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1721142787 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1721142787 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="64970" tid="0x7ff844b27fc0" timestamp=1721142787 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1686 MB llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: PRE token = 32007 '▁<PRE>' llm_load_print_meta: SUF token = 32008 '▁<SUF>' llm_load_print_meta: MID token = 32009 '▁<MID>' llm_load_print_meta: EOT token = 32010 '▁<EOT>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: CPU buffer size = 3647.95 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 time=2024-07-16T08:13:08.068-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 4096.00 MiB llama_new_context_with_model: KV self size = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB llama_new_context_with_model: CPU output buffer size = 0.55 MiB llama_new_context_with_model: CPU compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1721142796 time=2024-07-16T08:13:16.541-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server not responding" time=2024-07-16T08:13:16.793-07:00 level=INFO source=server.go:617 msg="llama runner started in 9.23 seconds" [GIN] 2024/07/16 - 08:13:16 | 200 | 9.2416228s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@ahornby commented on GitHub (Jul 16, 2024):

@dbl001 the log indicates you've not built the vulkan support. The indications being:

  • vulkan is not in the list of Dynamic LLM libraries from your log, and
  • you also get the message "Invalid OLLAMA_LLM_LIBRARY vulkan - not found"

Perhaps you had a build error, or didn't run the right build command.

Please read the commit message for the commands to install vulkan libs and build. If no vulkan binary is produced from the build then there is no way to run it at the next step.

<!-- gh-comment-id:2231697676 --> @ahornby commented on GitHub (Jul 16, 2024): @dbl001 the log indicates you've not built the vulkan support. The indications being: * vulkan is not in the list of Dynamic LLM libraries from your log, and * you also get the message "Invalid OLLAMA_LLM_LIBRARY vulkan - not found" Perhaps you had a build error, or didn't run the right build command. Please read the commit message for the commands to install vulkan libs and build. If no vulkan binary is produced from the build then there is no way to run it at the next step.
Author
Owner

@dbl001 commented on GitHub (Jul 17, 2024):

I had to make some changes in llm/generate/gen_darwin.sh to switch from homebrew to macports. The server log looks reasonable. However, the output is either garbled, null or in the case of embeddings, all zeros.

(AI-Feynman) davidlaxer@BlueDiamond-2 ollama % ollama run llama3        
>>> Hello World!
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

>>> 1+1=?
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

>>> 

Server output

(AI-Feynman) davidlaxer@BlueDiamond-2 ollama % GIN_MODE=debug OLLAMA_LLM_LIBRARY=vulkan ./ollama serve
2024/07/16 19:10:17 routes.go:958: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:vulkan OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-07-16T19:10:17.641-07:00 level=INFO source=images.go:760 msg="total blobs: 38"
time=2024-07-16T19:10:17.646-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

  • using env: export GIN_MODE=release
  • using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-07-16T19:10:17.647-07:00 level=INFO source=routes.go:1005 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-07-16T19:10:17.649-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326: directory not empty"
time=2024-07-16T19:10:17.649-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533: directory not empty"
time=2024-07-16T19:10:17.650-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners
time=2024-07-16T19:10:17.684-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal vulkan]"
time=2024-07-16T19:10:17.708-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=vulkan compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB"
[GIN] 2024/07/16 - 19:16:26 | 200 | 61.932µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/07/16 - 19:16:26 | 200 | 20.510101ms | 127.0.0.1 | POST "/api/show"
time=2024-07-16T19:16:26.264-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=0 parallel=4 available=17163091968 required="6.3 GiB"
time=2024-07-16T19:16:26.265-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="6.3 GiB" memory.required.partial="6.3 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.3 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-16T19:16:26.265-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=vulkan path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners/vulkan
time=2024-07-16T19:16:26.266-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners/vulkan/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 52304"
time=2024-07-16T19:16:26.269-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-16T19:16:26.269-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-16T19:16:26.269-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1721182586
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1721182586 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="52304" tid="0x7ff844b27fc0" timestamp=1721182586
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-07-16T19:16:26.771-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Pro 5700 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 281.81 MiB
llm_load_tensors: AMD Radeon Pro 5700 XT buffer size = 4155.99 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon Pro 5700 XT KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 2.02 MiB
llama_new_context_with_model: AMD Radeon Pro 5700 XT compute buffer size = 560.00 MiB
llama_new_context_with_model: CPU compute buffer size = 0.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1721182602
time=2024-07-16T19:16:42.066-07:00 level=INFO source=server.go:617 msg="llama runner started in 15.80 seconds"
[GIN] 2024/07/16 - 19:16:42 | 200 | 15.831344335s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/07/16 - 19:16:51 | 200 | 2.97464185s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/07/16 - 19:17:05 | 200 | 2.97567865s | 127.0.0.1 | POST "/api/chat"
time=2024-07-16T19:22:10.921-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00131764 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-16T19:22:11.172-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251858415 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-16T19:22:11.422-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501923465 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa

<!-- gh-comment-id:2232218748 --> @dbl001 commented on GitHub (Jul 17, 2024): I had to make some changes in llm/generate/gen_darwin.sh to switch from homebrew to macports. The server log looks reasonable. However, the output is either garbled, null or in the case of embeddings, all zeros. ``` (AI-Feynman) davidlaxer@BlueDiamond-2 ollama % ollama run llama3 >>> Hello World! @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >>> 1+1=? @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ >>> Server output ``` (AI-Feynman) davidlaxer@BlueDiamond-2 ollama % GIN_MODE=debug OLLAMA_LLM_LIBRARY=vulkan ./ollama serve 2024/07/16 19:10:17 routes.go:958: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:vulkan OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-07-16T19:10:17.641-07:00 level=INFO source=images.go:760 msg="total blobs: 38" time=2024-07-16T19:10:17.646-07:00 level=INFO source=images.go:767 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-07-16T19:10:17.647-07:00 level=INFO source=routes.go:1005 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-07-16T19:10:17.649-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4050472326: directory not empty" time=2024-07-16T19:10:17.649-07:00 level=WARN source=assets.go:100 msg="unable to cleanup stale tmpdir" path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533 error="remove /var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama4115672533: directory not empty" time=2024-07-16T19:10:17.650-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners time=2024-07-16T19:10:17.684-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal vulkan]" time=2024-07-16T19:10:17.708-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=vulkan compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB" [GIN] 2024/07/16 - 19:16:26 | 200 | 61.932µs | 127.0.0.1 | HEAD "/" [GIN] 2024/07/16 - 19:16:26 | 200 | 20.510101ms | 127.0.0.1 | POST "/api/show" time=2024-07-16T19:16:26.264-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=0 parallel=4 available=17163091968 required="6.3 GiB" time=2024-07-16T19:16:26.265-07:00 level=INFO source=memory.go:309 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="6.3 GiB" memory.required.partial="6.3 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.3 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-16T19:16:26.265-07:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=vulkan path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners/vulkan time=2024-07-16T19:16:26.266-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1471678898/runners/vulkan/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 52304" time=2024-07-16T19:16:26.269-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-16T19:16:26.269-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-16T19:16:26.269-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff844b27fc0" timestamp=1721182586 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff844b27fc0" timestamp=1721182586 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="52304" tid="0x7ff844b27fc0" timestamp=1721182586 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-16T19:16:26.771-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon Pro 5700 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: AMD Radeon Pro 5700 XT buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: AMD Radeon Pro 5700 XT KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: Vulkan_Host output buffer size = 2.02 MiB llama_new_context_with_model: AMD Radeon Pro 5700 XT compute buffer size = 560.00 MiB llama_new_context_with_model: CPU compute buffer size = 0.00 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 INFO [main] model loaded | tid="0x7ff844b27fc0" timestamp=1721182602 time=2024-07-16T19:16:42.066-07:00 level=INFO source=server.go:617 msg="llama runner started in 15.80 seconds" [GIN] 2024/07/16 - 19:16:42 | 200 | 15.831344335s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/16 - 19:16:51 | 200 | 2.97464185s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/16 - 19:17:05 | 200 | 2.97567865s | 127.0.0.1 | POST "/api/chat" time=2024-07-16T19:22:10.921-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00131764 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-16T19:22:11.172-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251858415 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-16T19:22:11.422-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501923465 model=/Users/davidlaxer/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa ```
Author
Owner

@Gantaronee commented on GitHub (Aug 2, 2024):

Error: llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 8891928576
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/Users/xxxxxx/Downloads/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf'
main: error: unable to load model

Solution (no error):
./llama-cli -m /Users/xxxx/Downloads/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -n 32 --n-gpu-layers 35 --ctx_size 2048 --batch-size 512

llama_kv_cache_init: Metal KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: Metal compute buffer size = 258.50 MiB
llama_new_context_with_model: CPU compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 512, n_predict = 32, n_keep = 1

def dance_party(dance_moves):
"""
This function takes a list of dance moves and returns a string representing a dance party.

llama_print_timings: load time = 6747.93 ms
llama_print_timings: sample time = 7.48 ms / 32 runs ( 0.23 ms per token, 4280.94 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( nan ms per token, nan tokens per second)
llama_print_timings: eval time = 29722.94 ms / 32 runs ( 928.84 ms per token, 1.08 tokens per second)
llama_print_timings: total time = 29744.97 ms / 32 tokens
Log end

But poor performance , It seems that in the processes, the GPU is not being used. I have an AMD Radeon Pro 5600M 8GB.

<!-- gh-comment-id:2265433174 --> @Gantaronee commented on GitHub (Aug 2, 2024): Error: llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: Metal KV buffer size = 16384.00 MiB llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 8891928576 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model '/Users/xxxxxx/Downloads/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf' main: error: unable to load model Solution (no error): **./llama-cli -m /Users/xxxx/Downloads/Meta-Llama-3.1-8B-Instruct-IQ4_XS.gguf -n 32 --n-gpu-layers 35 --ctx_size 2048 --batch-size 512** llama_kv_cache_init: Metal KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CPU output buffer size = 0.49 MiB llama_new_context_with_model: Metal compute buffer size = 258.50 MiB llama_new_context_with_model: CPU compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 2048, n_batch = 512, n_predict = 32, n_keep = 1 def dance_party(dance_moves): """ This function takes a list of dance moves and returns a string representing a dance party. llama_print_timings: load time = 6747.93 ms llama_print_timings: sample time = 7.48 ms / 32 runs ( 0.23 ms per token, 4280.94 tokens per second) llama_print_timings: prompt eval time = 0.00 ms / 0 tokens ( nan ms per token, nan tokens per second) llama_print_timings: eval time = 29722.94 ms / 32 runs ( 928.84 ms per token, 1.08 tokens per second) llama_print_timings: total time = 29744.97 ms / 32 tokens Log end But poor performance , It seems that in the processes, the GPU is not being used. I have an **AMD Radeon Pro 5600M 8GB**.
Author
Owner

@Dirrelito071 commented on GitHub (Aug 3, 2024):

Hi, I'm running with a eGPU and it seems like it's making a bad choice taking my internal GPU. Is there a way to force to use the Vega64 instead using a flag or something? Or to change my default device?

ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon RX Vega 64
ggml_metal_init: found device: Intel(R) UHD Graphics 630
ggml_metal_init: picking default device: Intel(R) UHD Graphics 630
ggml_metal_init: using embedded metal library

<!-- gh-comment-id:2266713618 --> @Dirrelito071 commented on GitHub (Aug 3, 2024): Hi, I'm running with a eGPU and it seems like it's making a bad choice taking my internal GPU. Is there a way to force to use the Vega64 instead using a flag or something? Or to change my default device? ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon RX Vega 64 ggml_metal_init: found device: Intel(R) UHD Graphics 630 ggml_metal_init: picking default device: Intel(R) UHD Graphics 630 ggml_metal_init: using embedded metal library
Author
Owner

@Dirrelito071 commented on GitHub (Aug 4, 2024):

Hi, I'm running with a eGPU and it seems like it's making a bad choice taking my internal GPU. Is there a way to force to use the Vega64 instead using a flag or something? Or to change my default device?

ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon RX Vega 64 ggml_metal_init: found device: Intel(R) UHD Graphics 630 ggml_metal_init: picking default device: Intel(R) UHD Graphics 630 ggml_metal_init: using embedded metal library

Solved this problem myself.
Just by connecting my eGPU to a monitor.
this made my Mac declare the default GFX used to be my Vega 64. And also Ollama.

Now my problem became that the output was garbled. Don’t know any solution to that, if someone has a possible solution I’m happy to try it.

<!-- gh-comment-id:2267465344 --> @Dirrelito071 commented on GitHub (Aug 4, 2024): > Hi, I'm running with a eGPU and it seems like it's making a bad choice taking my internal GPU. Is there a way to force to use the Vega64 instead using a flag or something? Or to change my default device? > > ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon RX Vega 64 ggml_metal_init: found device: Intel(R) UHD Graphics 630 ggml_metal_init: picking default device: Intel(R) UHD Graphics 630 ggml_metal_init: using embedded metal library Solved this problem myself. Just by connecting my eGPU to a monitor. this made my Mac declare the default GFX used to be my Vega 64. And also Ollama. Now my problem became that the output was garbled. Don’t know any solution to that, if someone has a possible solution I’m happy to try it.
Author
Owner

@mfoxworthy commented on GitHub (Aug 4, 2024):

I have it "working" on a Macbook Pro i9 AMD Radeon Pro 5500M but it too is slower than the CPU. It used 99% of the CPU with the codellama LLM. Unless I am missing something, I'm just going to give up and just use my M1 for this. That performs extremely well even with 27b LLMs.

INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff84acf1fc0" timestamp=1722784374
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff84acf1fc0" timestamp=1722784374 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60974" tid="0x7ff84acf1fc0" timestamp=1722784374
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/mfoxworthy/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 2
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = codellama
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: PRE token = 32007 '▁

'
llm_load_print_meta: SUF token = 32008 '▁'
llm_load_print_meta: MID token = 32009 '▁'
llm_load_print_meta: EOT token = 32010 '▁'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.27 MiB
time=2024-08-04T08:12:54.489-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250269658 model=/Users/mfoxworthy/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-04T08:12:54.502-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
time=2024-08-04T08:12:54.745-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.506195893 model=/Users/mfoxworthy/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
ggml_backend_metal_log_allocated_size: allocated buffer, size = 3577.61 MiB, ( 3577.61 / 8176.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.35 MiB
llm_load_tensors: Metal buffer size = 3577.61 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 5500M
ggml_metal_init: found device: Intel(R) UHD Graphics 630
ggml_metal_init: picking default device: AMD Radeon Pro 5500M
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: AMD Radeon Pro 5500M
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory = false
ggml_metal_init: recommendedMaxWorkingSetSize = 8573.16 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128 (not supported)
llama_kv_cache_init: Metal KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.14 MiB
llama_new_context_with_model: Metal compute buffer size = 164.00 MiB
llama_new_context_with_model: CPU compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="0x7ff84acf1fc0" timestamp=1722784384
time=2024-08-04T08:13:04.336-07:00 level=INFO source=server.go:617 msg="llama runner started in 10.09 seconds"
[GIN] 2024/08/04 - 08:15:06 | 200 | 2m16s | 127.0.0.1 | POST "/api/chat"
time=2024-08-04T08:16:24.718-07:00 level=WARN source=types.go:406 msg="invalid option provided" option=""
[GIN] 2024/08/04 - 08:17:53 | 200 | 1m28s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2267579995 --> @mfoxworthy commented on GitHub (Aug 4, 2024): I have it "working" on a Macbook Pro i9 AMD Radeon Pro 5500M but it too is slower than the CPU. It used 99% of the CPU with the codellama LLM. Unless I am missing something, I'm just going to give up and just use my M1 for this. That performs extremely well even with 27b LLMs. INFO [main] build info | build=3337 commit="a8db2a9c" tid="0x7ff84acf1fc0" timestamp=1722784374 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="0x7ff84acf1fc0" timestamp=1722784374 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60974" tid="0x7ff84acf1fc0" timestamp=1722784374 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/mfoxworthy/.ollama/models/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = codellama llama_model_loader: - kv 2: llama.context_length u32 = 16384 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 259 llm_load_vocab: token to piece cache size = 0.1686 MB llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = codellama llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: PRE token = 32007 '▁<PRE>' llm_load_print_meta: SUF token = 32008 '▁<SUF>' llm_load_print_meta: MID token = 32009 '▁<MID>' llm_load_print_meta: EOT token = 32010 '▁<EOT>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.27 MiB time=2024-08-04T08:12:54.489-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250269658 model=/Users/mfoxworthy/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-04T08:12:54.502-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" time=2024-08-04T08:12:54.745-07:00 level=WARN source=sched.go:634 msg="gpu VRAM usage didn't recover within timeout" seconds=5.506195893 model=/Users/mfoxworthy/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 ggml_backend_metal_log_allocated_size: allocated buffer, size = 3577.61 MiB, ( 3577.61 / 8176.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.35 MiB llm_load_tensors: Metal buffer size = 3577.61 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon Pro 5500M ggml_metal_init: found device: Intel(R) UHD Graphics 630 ggml_metal_init: picking default device: AMD Radeon Pro 5500M ggml_metal_init: using embedded metal library ggml_metal_init: GPU name: AMD Radeon Pro 5500M ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 8573.16 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128 (not supported) llama_kv_cache_init: Metal KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 0.14 MiB llama_new_context_with_model: Metal compute buffer size = 164.00 MiB llama_new_context_with_model: CPU compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 INFO [main] model loaded | tid="0x7ff84acf1fc0" timestamp=1722784384 time=2024-08-04T08:13:04.336-07:00 level=INFO source=server.go:617 msg="llama runner started in 10.09 seconds" [GIN] 2024/08/04 - 08:15:06 | 200 | 2m16s | 127.0.0.1 | POST "/api/chat" time=2024-08-04T08:16:24.718-07:00 level=WARN source=types.go:406 msg="invalid option provided" option="" [GIN] 2024/08/04 - 08:17:53 | 200 | 1m28s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@Dirrelito071 commented on GitHub (Aug 4, 2024):

Seems like many now got their GFX-card identified. Then it's whole other matter to make use of it. Hope someone smart and knowledgeable can try out some solutions to get more power to be used. And get rid of of the garbled outputs.

I'm mostly saying this, since a working use case could drive this issue / matter forward to a more complete solution for everyone.

<!-- gh-comment-id:2267608452 --> @Dirrelito071 commented on GitHub (Aug 4, 2024): Seems like many now got their GFX-card identified. Then it's whole other matter to make use of it. Hope someone smart and knowledgeable can try out some solutions to get more power to be used. And get rid of of the garbled outputs. I'm mostly saying this, since a working use case could drive this issue / matter forward to a more complete solution for everyone.
Author
Owner

@akaraon8bit commented on GitHub (Aug 5, 2024):

Hello all am runing % Sonoma sw_vers
ProductName: macOS
ProductVersion: 14.6
BuildVersion: 23G80

  Chipset Model: AMD Radeon Pro 5600M
  Type: GPU
  Bus: PCIe
  PCIe Lane Width: x16
  VRAM (Total): 8 GB
  Vendor: AMD (0x1002)
  Device ID: 0x7360
  Revision ID: 0x0041
  ROM Revision: 113-D3000E-192
  VBIOS Version: 113-D3000A0U-015
  Option ROM Version: 113-D3000A0U-015
  EFI Driver Version: 01.A1.192
  Automatic Graphics Switching: Supported
  gMux Version: 5.0.0

followed different build from the thread

Ollama call failed with status code 500: llama runner process has terminated: signal: segmentation fault

build with OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DLLAMA_SUPPORTS_GPU_OFFLOAD=on go generate -v ./...

% GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve
2024/08/05 16:04:08 routes.go:1108: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11436 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/beanscake/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-08-05T16:04:08.844-04:00 level=INFO source=images.go:781 msg="total blobs: 5"
time=2024-08-05T16:04:08.846-04:00 level=INFO source=images.go:788 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-08-05T16:04:08.848-04:00 level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11436 (version 0.0.0)"
time=2024-08-05T16:04:08.850-04:00 level=WARN source=assets.go:94 msg="found running ollama" pid=21909 path=/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama2246786142
time=2024-08-05T16:04:08.851-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama4191842182/runners
time=2024-08-05T16:04:09.136-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 metal]"
time=2024-08-05T16:04:09.136-04:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="64.0 GiB" available="44.3 GiB"```



got ```time=2024-08-05T16:27:31.591-04:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[43.2 GiB]" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-08-05T16:27:31.592-04:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama4191842182/runners/metal
time=2024-08-05T16:27:31.596-04:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama4191842182/runners/metal/ollama_llama_server --model /Users/beanscake/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 53239"
time=2024-08-05T16:27:31.609-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-05T16:27:31.609-04:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-08-05T16:27:31.611-04:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3485 commit="6eeaeba1" tid="0x7ff85a6abdc0" timestamp=1722889651
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff85a6abdc0" timestamp=1722889651 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="53239" tid="0x7ff85a6abdc0" timestamp=1722889651
time=2024-08-05T16:27:31.867-04:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /Users/beanscake/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
time=2024-08-05T16:27:35.142-04:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault"
[GIN] 2024/08/05 - 16:27:35 | 500 |  3.710459805s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2269866361 --> @akaraon8bit commented on GitHub (Aug 5, 2024): Hello all am runing % Sonoma sw_vers ProductName: macOS ProductVersion: 14.6 BuildVersion: 23G80 Chipset Model: AMD Radeon Pro 5600M Type: GPU Bus: PCIe PCIe Lane Width: x16 VRAM (Total): 8 GB Vendor: AMD (0x1002) Device ID: 0x7360 Revision ID: 0x0041 ROM Revision: 113-D3000E-192 VBIOS Version: 113-D3000A0U-015 Option ROM Version: 113-D3000A0U-015 EFI Driver Version: 01.A1.192 Automatic Graphics Switching: Supported gMux Version: 5.0.0 followed different build from the thread Ollama call failed with status code 500: llama runner process has terminated: signal: segmentation fault build with ```OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1" -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DLLAMA_SUPPORTS_GPU_OFFLOAD=on go generate -v ./...``` ```16:03 beanscake@nneka /Users/beanscake/currentporoject/ollama % GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve 2024/08/05 16:04:08 routes.go:1108: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11436 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/beanscake/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-08-05T16:04:08.844-04:00 level=INFO source=images.go:781 msg="total blobs: 5" time=2024-08-05T16:04:08.846-04:00 level=INFO source=images.go:788 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-05T16:04:08.848-04:00 level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11436 (version 0.0.0)" time=2024-08-05T16:04:08.850-04:00 level=WARN source=assets.go:94 msg="found running ollama" pid=21909 path=/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama2246786142 time=2024-08-05T16:04:08.851-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama4191842182/runners time=2024-08-05T16:04:09.136-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 metal]" time=2024-08-05T16:04:09.136-04:00 level=INFO source=types.go:105 msg="inference compute" id="" library=cpu compute="" driver=0.0 name="" total="64.0 GiB" available="44.3 GiB"``` got ```time=2024-08-05T16:27:31.591-04:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[43.2 GiB]" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-08-05T16:27:31.592-04:00 level=INFO source=server.go:172 msg="user override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama4191842182/runners/metal time=2024-08-05T16:27:31.596-04:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/var/folders/96/n7_tkx8d6ks_xk1w4mzzjsnm0000gr/T/ollama4191842182/runners/metal/ollama_llama_server --model /Users/beanscake/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 53239" time=2024-08-05T16:27:31.609-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-05T16:27:31.609-04:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-08-05T16:27:31.611-04:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3485 commit="6eeaeba1" tid="0x7ff85a6abdc0" timestamp=1722889651 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff85a6abdc0" timestamp=1722889651 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="53239" tid="0x7ff85a6abdc0" timestamp=1722889651 time=2024-08-05T16:27:31.867-04:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /Users/beanscake/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.27 MiB time=2024-08-05T16:27:35.142-04:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault" [GIN] 2024/08/05 - 16:27:35 | 500 | 3.710459805s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@dbl001 commented on GitHub (Aug 12, 2024):

I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT.

Screenshot 2024-08-12 at 9 18 43 AM

I had to modify ggm-metal.m to circumvent an problem where id device = MTLCreateSystemDefaultDevice(); always returned 'nil'. I still don't know why.

Running llama 3.1:

A clever play on words!

In the sense that they are apex predators in their native range and can be 
quite powerful, yes, moose do "rule" in their own domain. Here are a few 
reasons why:

1. **King of the Forest**: Moose (Alces alces) are the largest members of 
the deer family (Cervidae) in North America, with males weighing up to 
1,500 pounds (680 kg). They're well-adapted to their environment and play 
a crucial role in shaping the ecosystem.
2. **Predators, not prey**: While they don't have many natural predators 
due to their size, moose are capable of defending themselves if 
threatened. They can also be fierce competitors for food resources with 
other animals, like bears and wolves.
3. **Environmental engineers**: Moose contribute significantly to shaping 
their environment through their feeding activities. By browsing on 
vegetation, they influence the growth patterns of plants, which in turn 
affect the entire ecosystem.

So, while "rule" might be a bit of an exaggeration, moose do indeed hold a 
prominent position within their ecological niche!

>>> Send a message (/? for help)

Here is my patch:

cat ~/ollama/llm/patches/04-metal.diff
diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index 48b81313..1f386703 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -237,6 +237,19 @@ struct ggml_metal_context {
 @implementation GGMLMetalClass
 @end
 
+static id<MTLDevice> createCustomMTLDevice(void);
+
+static id<MTLDevice> createCustomMTLDevice(void) {
+    NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices();
+    for (id<MTLDevice> dev in devices) {
+        if (![dev isLowPower] && ![dev isHeadless]) {
+            return dev;
+        }
+    }
+    return nil;
+}
+
+
 static void ggml_metal_default_log_callback(enum ggml_log_level level, const char * msg, void * user_data) {
     fprintf(stderr, "%s", msg);
 
@@ -302,8 +315,8 @@ static struct ggml_metal_context * ggml_metal_init(int n_cb) {
 #endif
 
     // Pick and show default Metal device
-    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
-    GGML_METAL_LOG_INFO("%s: picking default device: %s\n", __func__, [[device name] UTF8String]);
+    id<MTLDevice> device = createCustomMTLDevice();
+    GGML_METAL_LOG_INFO("%s: picking cutom device: %s\n", __func__, [[device name] UTF8String]);
 
     // Configure context
     struct ggml_metal_context * ctx = malloc(sizeof(struct ggml_metal_context));
@@ -2869,11 +2882,31 @@ static int g_backend_device_ref_count = 0;
 
 static id<MTLDevice> ggml_backend_metal_get_device(void) {
     if (g_backend_device == nil) {
-        g_backend_device = MTLCreateSystemDefaultDevice();
+        g_backend_device = createCustomMTLDevice();
+        if (g_backend_device == nil) {
+            fprintf(stderr, "Error: createCustomMTLDevice() returned nil\n");
+            
+            // Check if Metal is supported
+            if (@available(macOS 10.11, *)) {
+                fprintf(stderr, "Metal framework is available\n");
+            } else {
+                fprintf(stderr, "Metal framework is not available on this system\n");
+            }
+            
+            // List available devices
+            NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices();
+            fprintf(stderr, "Available Metal devices:\n");
+            for (id<MTLDevice> device in devices) {
+                fprintf(stderr, "  %s\n", device.name.UTF8String);
+            }
+            
+            // Additional system info
+            fprintf(stderr, "macOS Version: %s\n", [[[NSProcessInfo processInfo] operatingSystemVersionString] UTF8String]);
+        } else {
+            fprintf(stderr, "Successfully created Metal device: %s\n", g_backend_device.name.UTF8String);
+        }
     }
-
     g_backend_device_ref_count++;
-
     return g_backend_device;
 }
 
@@ -3072,6 +3105,10 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void) {
 // buffer from ptr
 
 GGML_CALL ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t size, size_t max_size) {
+
+    // Right at the start of the function
+    fprintf(stderr, "DEBUG: Entering ggml_backend_metal_buffer_from_ptr with size = %zu\n", size);
+
     struct ggml_backend_metal_buffer_context * ctx = malloc(sizeof(struct ggml_backend_metal_buffer_context));
 
     ctx->all_data = data;
@@ -3095,13 +3132,24 @@ GGML_CALL ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data,
 
     id<MTLDevice> device = ggml_backend_metal_get_device();
 
-    // the buffer fits into the max buffer size allowed by the device
+    fprintf(stderr, "DEBUG: Metal device: %s, maxBufferLength: %lu\n", 
+        device.name.UTF8String, (unsigned long)device.maxBufferLength);
+
+    // Before the if statement
+    fprintf(stderr, "DEBUG: size_aligned = %zu, device.maxBufferLength = %lu\n", size_aligned, (unsigned long)device.maxBufferLength);
+
+    // Inside the if statement, before allocation
     if (size_aligned <= device.maxBufferLength) {
+
+        fprintf(stderr, "DEBUG: Attempting to allocate buffer of size %zu\n", size_aligned);
+
         ctx->buffers[ctx->n_buffers].data = data;
         ctx->buffers[ctx->n_buffers].size = size;
 
         ctx->buffers[ctx->n_buffers].metal = [device newBufferWithBytesNoCopy:data length:size_aligned options:MTLResourceStorageModeShared deallocator:nil];
 
+    fprintf(stderr, "DEBUG: ctx->buffer[ctx->n_buffers].metal= %zu \n", ctx->buffers[ctx->n_buffers].metal);
+
         if (ctx->buffers[ctx->n_buffers].metal == nil) {
             GGML_METAL_LOG_ERROR("%s: error: failed to allocate buffer, size = %8.2f MiB\n", __func__, size_aligned / 1024.0 / 1024.0);
             return false;
@@ -3221,12 +3269,34 @@ ggml_backend_t ggml_backend_metal_init(void) {
         return NULL;
     }
 
+    // Use the custom device selection method
+    id<MTLDevice> device = createCustomMTLDevice();
+    if (device == nil) {
+        GGML_METAL_LOG_ERROR("%s: error: could not create custom Metal device\n", __func__);
+        ggml_metal_free(ctx);
+        return NULL;
+    }
+
+    // Set the custom device in the context
+    ctx->device = device;
+
+    // Initialize other Metal-related resources using the custom device
+    // For example:
+    ctx->queue = [ctx->device newCommandQueue];
+    if (ctx->queue == nil) {
+        GGML_METAL_LOG_ERROR("%s: error: could not create command queue\n", __func__);
+        ggml_metal_free(ctx);
+        return NULL;
+    }
+
+    // Initialize other necessary resources...
+
     ggml_backend_t metal_backend = malloc(sizeof(struct ggml_backend));
 
     *metal_backend = (struct ggml_backend) {
-        /* .guid      = */ ggml_backend_metal_guid(),
-        /* .interface = */ ggml_backend_metal_i,
-        /* .context   = */ ctx,
+            /* .guid      = */ ggml_backend_metal_guid(),
+            /* .interface = */ ggml_backend_metal_i,
+            /* .context   = */ ctx,
     };
 
     return metal_backend;

Server log:

%  GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve

2024/08/12 09:16:10 routes.go:1108: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-08-12T09:16:10.847-07:00 level=INFO source=images.go:781 msg="total blobs: 48"
time=2024-08-12T09:16:10.849-07:00 level=INFO source=images.go:788 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-08-12T09:16:10.850-07:00 level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-08-12T09:16:10.851-07:00 level=INFO source=payload.go:25 msg=payloadsDir payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners
time=2024-08-12T09:16:10.851-07:00 level=INFO source=payload.go:31 msg="extracting embedded files" payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners
time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:56 msg="gpuPayloadsDir: " payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners
time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server
time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan/ollama_llama_server
time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:45 msg="Dynamic LLM libraries [metal vulkan]"
time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:46 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-12T09:16:10.889-07:00 level=INFO source=gpu_darwin.go:29 msg="Using Metal GPU" gpu_info="{memInfo:{TotalMemory:0 FreeMemory:0 FreeSwap:0} Library:vulkan Variant:no vector extensions MinimumMemory:0 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}"
2024-08-12 09:16:10.916 ollama[86765:1415327] Debug: Recommended Max VRAM: 17163091968 bytes
time=2024-08-12T09:16:10.917-07:00 level=INFO source=gpu_darwin.go:40 msg=GpuInfo info="{memInfo:{TotalMemory:17163091968 FreeMemory:17163091968 FreeSwap:0} Library:metal Variant:no vector extensions MinimumMemory:536870912 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}"
time=2024-08-12T09:16:10.917-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB"
[GIN] 2024/08/12 - 09:16:29 | 200 |      60.451µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/12 - 09:16:29 | 200 |   15.647801ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-12T09:16:29.255-07:00 level=INFO source=gpu_darwin.go:29 msg="Using Metal GPU" gpu_info="{memInfo:{TotalMemory:0 FreeMemory:0 FreeSwap:0} Library:vulkan Variant:no vector extensions MinimumMemory:0 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}"
2024-08-12 09:16:29.255 ollama[86765:1415332] Debug: Recommended Max VRAM: 17163091968 bytes
time=2024-08-12T09:16:29.255-07:00 level=INFO source=gpu_darwin.go:40 msg=GpuInfo info="{memInfo:{TotalMemory:17163091968 FreeMemory:17163091968 FreeSwap:0} Library:metal Variant:no vector extensions MinimumMemory:536870912 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}"
time=2024-08-12T09:16:29.275-07:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 gpu=0 parallel=4 available=17163091968 required="6.3 GiB"
2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Total Physical Memory: 137438953472 bytes
2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Page Size: 4096 bytes
2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Free Count: 7107184
2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Speculative Count: 639177
2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Inactive Count: 12183125
2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Total Free Memory: 81631174656 bytes
time=2024-08-12T09:16:29.275-07:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="6.3 GiB" memory.required.partial="6.3 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.3 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="560.0 MiB"
time=2024-08-12T09:16:29.275-07:00 level=INFO source=payload.go:56 msg="gpuPayloadsDir: " payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners
time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server
time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan/ollama_llama_server
time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:56 msg="gpuPayloadsDir: " payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners
time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server
time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan/ollama_llama_server
time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:87 msg="availableServers : found" availableServers="map[metal:/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal vulkan:/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan]"
time=2024-08-12T09:16:29.276-07:00 level=INFO msg="User override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal
time=2024-08-12T09:16:29.277-07:00 level=INFO msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 51659"
time=2024-08-12T09:16:29.283-07:00 level=INFO msg="loaded runners" count=1
time=2024-08-12T09:16:29.283-07:00 level=INFO msg="waiting for llama runner to start responding"
time=2024-08-12T09:16:29.283-07:00 level=INFO msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3485 commit="6eeaeba1" tid="0x7ff8493c8dc0" timestamp=1723479389
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff8493c8dc0" timestamp=1723479389 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="51659" tid="0x7ff8493c8dc0" timestamp=1723479389
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-12T09:16:29.786-07:00 level=INFO msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
DEBUG: Entering ggml_backend_metal_buffer_from_ptr with size = 4357873664
Successfully created Metal device: AMD Radeon Pro 5700 XT
DEBUG: Metal device: AMD Radeon Pro 5700 XT, maxBufferLength: 3758096384
DEBUG: size_aligned = 4357877760, device.maxBufferLength = 3758096384
ggml_backend_metal_log_allocated_size: allocated buffer, size =  3584.00 MiB, ( 3584.00 / 16368.00)

ggml_backend_metal_log_allocated_size: allocated buffer, size =   982.98 MiB, ( 4566.98 / 16368.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      Metal buffer size =  4155.99 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 5700 XT
ggml_metal_init: picking cutom device: AMD Radeon Pro 5700 XT
ggml_metal_init: using embedded metal library
time=2024-08-12T09:16:30.290-07:00 level=DEBUG msg="model load progress 1.00"
time=2024-08-12T09:16:30.540-07:00 level=DEBUG msg="model load completed, waiting for server to become available" status="llm server loading model"
ggml_metal_init: GPU name:   AMD Radeon Pro 5700 XT
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = false
ggml_metal_init: hasUnifiedMemory              = false
ggml_metal_init: recommendedMaxWorkingSetSize  = 17163.09 MB
ggml_metal_init: skipping kernel_mul_mm_f32_f32                    (not supported)
ggml_metal_init: skipping kernel_mul_mm_f16_f32                    (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_1_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_1_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q8_0_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q2_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q3_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q4_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q5_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_q6_K_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32                  (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f32_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_f16_f32                 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32             (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32              (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32             (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32               (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32              (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32              (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96            (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128           (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.02 MiB
llama_new_context_with_model:      Metal compute buffer size =   560.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded | tid="0x7ff8493c8dc0" timestamp=1723479394
time=2024-08-12T09:16:34.556-07:00 level=INFO msg="llama runner started in 5.27 seconds"
time=2024-08-12T09:16:34.556-07:00 level=DEBUG msg="finished setting up runner" model=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
[GIN] 2024/08/12 - 09:16:34 | 200 |  5.315834006s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-12T09:16:34.556-07:00 level=DEBUG msg="context for request finished"
time=2024-08-12T09:16:34.557-07:00 level=DEBUG msg="runner with non-zero duration has gone idle, adding timer" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 duration=5m0s
time=2024-08-12T09:16:34.557-07:00 level=DEBUG msg="after processing request finished event" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 refCount=0
time=2024-08-12T09:16:40.753-07:00 level=DEBUG msg="evaluating already loaded" model=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-12T09:16:40.754-07:00 level=DEBUG msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nDo moose rule?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
[GIN] 2024/08/12 - 09:20:55 | 200 |         4m15s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-12T09:20:55.755-07:00 level=DEBUG msg="context for request finished"
time=2024-08-12T09:20:55.755-07:00 level=DEBUG msg="runner with non-zero duration has gone idle, adding timer" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 duration=5m0s
time=2024-08-12T09:20:55.755-07:00 level=DEBUG msg="after processing request finished event" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 refCount=0
time=2024-08-12T09:25:55.825-07:00 level=DEBUG msg="timer expired, expiring to unload" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="runner expired event received" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="got lock to unload" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="stopping llama server"
time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="waiting for llama server to exit"
time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="llama server stopped"
time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="runner released" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="sending an unloaded event" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="ignoring unload event with no pending requests"

Build commands:

 % export CLANG=/usr/bin/clang
export CC=/usr/bin/clang
export OBJC=/usr/bin/clang
export CC_FOR_BUILD=/usr/bin/clang
export OBJC_FOR_BUILD=/usr/bin/clang
export CXX=/usr/bin/clang++

% OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DLLAMA_SUPPORTS_GPU_OFFLOAD=on" go generate -v ./...

(AI-Feynman) davidlaxer@bluediamond ollama % CGO_CFLAGS="-I/opt/local/include -Xclang -fopenmp" CGO_LDFLAGS="-L/opt/local/lib -framework Accelerate -L/opt/local/lib/libomp -lomp" go build .
# github.com/ollama/ollama
ld: warning: ignoring duplicate libraries: '-lomp', '-lpthread'

Eval:

./ollama run llama3.1 --verbose
>>> Please elaborate on pilot wave theory and the collapse of the Schroedinger w
... wave function.
The Pilot-Wave Theory, also known as the de Broglie-Bohm theory or Bohmian 
Mechanics, is an alternative interpretation of quantum mechanics proposed 
by Louis de Broglie and David Bohm in the 1920s and 1950s, respectively.

**The Basic Idea:**

In pilot-wave theory, particles like electrons do not follow a 
probabilistic wave function (like the Schroedinger equation) but are 
guided by a deterministic "pilot" or "hidden" wave, known as the "quantum 
potential." This pilot wave is responsible for guiding the particle's 
motion in such a way that it appears to be following the predictions of 
quantum mechanics.

**The Collapse of the Schroedinger Wave Function:**

In standard quantum mechanics, when a measurement is made on a system, the 
wave function collapses to one of its possible eigenstates. This collapse 
is a fundamental aspect of the Copenhagen interpretation, where the act of 
measurement itself causes the wave function to "jump" into one specific 
state.

However, in pilot-wave theory, there is no collapse of the wave function. 
Instead, the particle's motion is guided by the pilot wave, which 
determines its position and momentum at any given time. When a measurement 
is made, the pilot wave simply reflects the new information obtained from 
the measurement, without changing the underlying wave function.

**Key Features:**

1. **No Wave Function Collapse:** In pilot-wave theory, the wave function 
remains intact, and there is no collapse of the state upon measurement.
2. **Deterministic Motion:** The particle's motion is deterministic, 
guided by the pilot wave.
3. **Hidden Variables:** Pilot-wave theory requires the introduction of 
hidden variables (the pilot wave) to explain the behavior of particles at 
a microscopic level.

**Challenges and Criticisms:**

Pilot-wave theory has faced several criticisms and challenges:

1. **Lack of Predictive Power:** The theory relies on an additional, 
unmeasurable variable (the pilot wave), which makes it difficult to make 
precise predictions.
2. **Quantum Interpretation:** Pilot-wave theory is often seen as a 
non-standard interpretation of quantum mechanics, differing from the 
Copenhagen interpretation and other popular approaches.

**Advantages:**

1. **Solves the Measurement Problem:** Pilot-wave theory provides an 
alternative explanation for the measurement process, sidestepping the 
difficulties associated with wave function collapse.
2. **Classical-Like Behavior:** The deterministic motion in pilot-wave 
theory can exhibit classical-like behavior, which is appealing from a 
philosophical and intuitive perspective.

In conclusion, pilot-wave theory offers an intriguing alternative to 
standard quantum mechanics, where particles are guided by a deterministic 
"pilot" wave rather than probabilistic wave functions. While it faces 
challenges and criticisms, the theory remains an active area of research 
and debate in the foundations of quantum mechanics.

total duration:       10m27.349376696s
load duration:        16.784202ms
prompt eval count:    28 token(s)
prompt eval duration: 26.016955s
prompt eval rate:     1.08 tokens/s
eval count:           573 token(s)
eval duration:        10m1.304367s
eval rate:            0.95 tokens/s

<!-- gh-comment-id:2284454736 --> @dbl001 commented on GitHub (Aug 12, 2024): I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT. <img width="418" alt="Screenshot 2024-08-12 at 9 18 43 AM" src="https://github.com/user-attachments/assets/23dde166-bff6-4660-b82f-537e5d696a05"> I had to modify ggm-metal.m to circumvent an problem where id<MTLDevice> device = MTLCreateSystemDefaultDevice(); always returned 'nil'. I still don't know why. Running llama 3.1: ``` A clever play on words! In the sense that they are apex predators in their native range and can be quite powerful, yes, moose do "rule" in their own domain. Here are a few reasons why: 1. **King of the Forest**: Moose (Alces alces) are the largest members of the deer family (Cervidae) in North America, with males weighing up to 1,500 pounds (680 kg). They're well-adapted to their environment and play a crucial role in shaping the ecosystem. 2. **Predators, not prey**: While they don't have many natural predators due to their size, moose are capable of defending themselves if threatened. They can also be fierce competitors for food resources with other animals, like bears and wolves. 3. **Environmental engineers**: Moose contribute significantly to shaping their environment through their feeding activities. By browsing on vegetation, they influence the growth patterns of plants, which in turn affect the entire ecosystem. So, while "rule" might be a bit of an exaggeration, moose do indeed hold a prominent position within their ecological niche! >>> Send a message (/? for help) ``` Here is my patch: ``` cat ~/ollama/llm/patches/04-metal.diff diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m index 48b81313..1f386703 100644 --- a/ggml/src/ggml-metal.m +++ b/ggml/src/ggml-metal.m @@ -237,6 +237,19 @@ struct ggml_metal_context { @implementation GGMLMetalClass @end +static id<MTLDevice> createCustomMTLDevice(void); + +static id<MTLDevice> createCustomMTLDevice(void) { + NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices(); + for (id<MTLDevice> dev in devices) { + if (![dev isLowPower] && ![dev isHeadless]) { + return dev; + } + } + return nil; +} + + static void ggml_metal_default_log_callback(enum ggml_log_level level, const char * msg, void * user_data) { fprintf(stderr, "%s", msg); @@ -302,8 +315,8 @@ static struct ggml_metal_context * ggml_metal_init(int n_cb) { #endif // Pick and show default Metal device - id<MTLDevice> device = MTLCreateSystemDefaultDevice(); - GGML_METAL_LOG_INFO("%s: picking default device: %s\n", __func__, [[device name] UTF8String]); + id<MTLDevice> device = createCustomMTLDevice(); + GGML_METAL_LOG_INFO("%s: picking cutom device: %s\n", __func__, [[device name] UTF8String]); // Configure context struct ggml_metal_context * ctx = malloc(sizeof(struct ggml_metal_context)); @@ -2869,11 +2882,31 @@ static int g_backend_device_ref_count = 0; static id<MTLDevice> ggml_backend_metal_get_device(void) { if (g_backend_device == nil) { - g_backend_device = MTLCreateSystemDefaultDevice(); + g_backend_device = createCustomMTLDevice(); + if (g_backend_device == nil) { + fprintf(stderr, "Error: createCustomMTLDevice() returned nil\n"); + + // Check if Metal is supported + if (@available(macOS 10.11, *)) { + fprintf(stderr, "Metal framework is available\n"); + } else { + fprintf(stderr, "Metal framework is not available on this system\n"); + } + + // List available devices + NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices(); + fprintf(stderr, "Available Metal devices:\n"); + for (id<MTLDevice> device in devices) { + fprintf(stderr, " %s\n", device.name.UTF8String); + } + + // Additional system info + fprintf(stderr, "macOS Version: %s\n", [[[NSProcessInfo processInfo] operatingSystemVersionString] UTF8String]); + } else { + fprintf(stderr, "Successfully created Metal device: %s\n", g_backend_device.name.UTF8String); + } } - g_backend_device_ref_count++; - return g_backend_device; } @@ -3072,6 +3105,10 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void) { // buffer from ptr GGML_CALL ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t size, size_t max_size) { + + // Right at the start of the function + fprintf(stderr, "DEBUG: Entering ggml_backend_metal_buffer_from_ptr with size = %zu\n", size); + struct ggml_backend_metal_buffer_context * ctx = malloc(sizeof(struct ggml_backend_metal_buffer_context)); ctx->all_data = data; @@ -3095,13 +3132,24 @@ GGML_CALL ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, id<MTLDevice> device = ggml_backend_metal_get_device(); - // the buffer fits into the max buffer size allowed by the device + fprintf(stderr, "DEBUG: Metal device: %s, maxBufferLength: %lu\n", + device.name.UTF8String, (unsigned long)device.maxBufferLength); + + // Before the if statement + fprintf(stderr, "DEBUG: size_aligned = %zu, device.maxBufferLength = %lu\n", size_aligned, (unsigned long)device.maxBufferLength); + + // Inside the if statement, before allocation if (size_aligned <= device.maxBufferLength) { + + fprintf(stderr, "DEBUG: Attempting to allocate buffer of size %zu\n", size_aligned); + ctx->buffers[ctx->n_buffers].data = data; ctx->buffers[ctx->n_buffers].size = size; ctx->buffers[ctx->n_buffers].metal = [device newBufferWithBytesNoCopy:data length:size_aligned options:MTLResourceStorageModeShared deallocator:nil]; + fprintf(stderr, "DEBUG: ctx->buffer[ctx->n_buffers].metal= %zu \n", ctx->buffers[ctx->n_buffers].metal); + if (ctx->buffers[ctx->n_buffers].metal == nil) { GGML_METAL_LOG_ERROR("%s: error: failed to allocate buffer, size = %8.2f MiB\n", __func__, size_aligned / 1024.0 / 1024.0); return false; @@ -3221,12 +3269,34 @@ ggml_backend_t ggml_backend_metal_init(void) { return NULL; } + // Use the custom device selection method + id<MTLDevice> device = createCustomMTLDevice(); + if (device == nil) { + GGML_METAL_LOG_ERROR("%s: error: could not create custom Metal device\n", __func__); + ggml_metal_free(ctx); + return NULL; + } + + // Set the custom device in the context + ctx->device = device; + + // Initialize other Metal-related resources using the custom device + // For example: + ctx->queue = [ctx->device newCommandQueue]; + if (ctx->queue == nil) { + GGML_METAL_LOG_ERROR("%s: error: could not create command queue\n", __func__); + ggml_metal_free(ctx); + return NULL; + } + + // Initialize other necessary resources... + ggml_backend_t metal_backend = malloc(sizeof(struct ggml_backend)); *metal_backend = (struct ggml_backend) { - /* .guid = */ ggml_backend_metal_guid(), - /* .interface = */ ggml_backend_metal_i, - /* .context = */ ctx, + /* .guid = */ ggml_backend_metal_guid(), + /* .interface = */ ggml_backend_metal_i, + /* .context = */ ctx, }; return metal_backend; ``` Server log: ``` % GIN_MODE=debug OLLAMA_LLM_LIBRARY=metal ./ollama serve 2024/08/12 09:16:10 routes.go:1108: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY:metal OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/davidlaxer/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]" time=2024-08-12T09:16:10.847-07:00 level=INFO source=images.go:781 msg="total blobs: 48" time=2024-08-12T09:16:10.849-07:00 level=INFO source=images.go:788 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-08-12T09:16:10.850-07:00 level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2024-08-12T09:16:10.851-07:00 level=INFO source=payload.go:25 msg=payloadsDir payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners time=2024-08-12T09:16:10.851-07:00 level=INFO source=payload.go:31 msg="extracting embedded files" payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:56 msg="gpuPayloadsDir: " payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan/ollama_llama_server time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:45 msg="Dynamic LLM libraries [metal vulkan]" time=2024-08-12T09:16:10.889-07:00 level=INFO source=payload.go:46 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-08-12T09:16:10.889-07:00 level=INFO source=gpu_darwin.go:29 msg="Using Metal GPU" gpu_info="{memInfo:{TotalMemory:0 FreeMemory:0 FreeSwap:0} Library:vulkan Variant:no vector extensions MinimumMemory:0 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}" 2024-08-12 09:16:10.916 ollama[86765:1415327] Debug: Recommended Max VRAM: 17163091968 bytes time=2024-08-12T09:16:10.917-07:00 level=INFO source=gpu_darwin.go:40 msg=GpuInfo info="{memInfo:{TotalMemory:17163091968 FreeMemory:17163091968 FreeSwap:0} Library:metal Variant:no vector extensions MinimumMemory:536870912 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}" time=2024-08-12T09:16:10.917-07:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="16.0 GiB" available="16.0 GiB" [GIN] 2024/08/12 - 09:16:29 | 200 | 60.451µs | 127.0.0.1 | HEAD "/" [GIN] 2024/08/12 - 09:16:29 | 200 | 15.647801ms | 127.0.0.1 | POST "/api/show" time=2024-08-12T09:16:29.255-07:00 level=INFO source=gpu_darwin.go:29 msg="Using Metal GPU" gpu_info="{memInfo:{TotalMemory:0 FreeMemory:0 FreeSwap:0} Library:vulkan Variant:no vector extensions MinimumMemory:0 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}" 2024-08-12 09:16:29.255 ollama[86765:1415332] Debug: Recommended Max VRAM: 17163091968 bytes time=2024-08-12T09:16:29.255-07:00 level=INFO source=gpu_darwin.go:40 msg=GpuInfo info="{memInfo:{TotalMemory:17163091968 FreeMemory:17163091968 FreeSwap:0} Library:metal Variant:no vector extensions MinimumMemory:536870912 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:0 Name: Compute: DriverMajor:0 DriverMinor:0}" time=2024-08-12T09:16:29.275-07:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 gpu=0 parallel=4 available=17163091968 required="6.3 GiB" 2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Total Physical Memory: 137438953472 bytes 2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Page Size: 4096 bytes 2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Free Count: 7107184 2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Speculative Count: 639177 2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Inactive Count: 12183125 2024-08-12 09:16:29.275 ollama[86765:1415334] Debug: Total Free Memory: 81631174656 bytes time=2024-08-12T09:16:29.275-07:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[16.0 GiB]" memory.required.full="6.3 GiB" memory.required.partial="6.3 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.3 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="560.0 MiB" time=2024-08-12T09:16:29.275-07:00 level=INFO source=payload.go:56 msg="gpuPayloadsDir: " payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan/ollama_llama_server time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:56 msg="gpuPayloadsDir: " payloadsDir=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:74 msg="Available servers found" file=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan/ollama_llama_server time=2024-08-12T09:16:29.276-07:00 level=INFO source=payload.go:87 msg="availableServers : found" availableServers="map[metal:/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal vulkan:/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/vulkan]" time=2024-08-12T09:16:29.276-07:00 level=INFO msg="User override" OLLAMA_LLM_LIBRARY=metal path=/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal time=2024-08-12T09:16:29.277-07:00 level=INFO msg="starting llama server" cmd="/var/folders/3n/56fpv14n4wj0c1l1sb106pzw0000gn/T/ollama1380849569/runners/metal/ollama_llama_server --model /Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 51659" time=2024-08-12T09:16:29.283-07:00 level=INFO msg="loaded runners" count=1 time=2024-08-12T09:16:29.283-07:00 level=INFO msg="waiting for llama runner to start responding" time=2024-08-12T09:16:29.283-07:00 level=INFO msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3485 commit="6eeaeba1" tid="0x7ff8493c8dc0" timestamp=1723479389 INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x7ff8493c8dc0" timestamp=1723479389 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="51659" tid="0x7ff8493c8dc0" timestamp=1723479389 llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-08-12T09:16:29.786-07:00 level=INFO msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.27 MiB DEBUG: Entering ggml_backend_metal_buffer_from_ptr with size = 4357873664 Successfully created Metal device: AMD Radeon Pro 5700 XT DEBUG: Metal device: AMD Radeon Pro 5700 XT, maxBufferLength: 3758096384 DEBUG: size_aligned = 4357877760, device.maxBufferLength = 3758096384 ggml_backend_metal_log_allocated_size: allocated buffer, size = 3584.00 MiB, ( 3584.00 / 16368.00) ggml_backend_metal_log_allocated_size: allocated buffer, size = 982.98 MiB, ( 4566.98 / 16368.00) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: Metal buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: AMD Radeon Pro 5700 XT ggml_metal_init: picking cutom device: AMD Radeon Pro 5700 XT ggml_metal_init: using embedded metal library time=2024-08-12T09:16:30.290-07:00 level=DEBUG msg="model load progress 1.00" time=2024-08-12T09:16:30.540-07:00 level=DEBUG msg="model load completed, waiting for server to become available" status="llm server loading model" ggml_metal_init: GPU name: AMD Radeon Pro 5700 XT ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = false ggml_metal_init: hasUnifiedMemory = false ggml_metal_init: recommendedMaxWorkingSetSize = 17163.09 MB ggml_metal_init: skipping kernel_mul_mm_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f32_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_f16_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_1_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q8_0_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q2_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q3_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q4_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q5_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_q6_K_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_xs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_xxs_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq3_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq2_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_s_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq1_m_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_nl_f32 (not supported) ggml_metal_init: skipping kernel_mul_mm_id_iq4_xs_f32 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h64 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h80 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h96 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h112 (not supported) ggml_metal_init: skipping kernel_flash_attn_ext_f16_h128 (not supported) llama_kv_cache_init: Metal KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CPU output buffer size = 2.02 MiB llama_new_context_with_model: Metal compute buffer size = 560.00 MiB llama_new_context_with_model: CPU compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 INFO [main] model loaded | tid="0x7ff8493c8dc0" timestamp=1723479394 time=2024-08-12T09:16:34.556-07:00 level=INFO msg="llama runner started in 5.27 seconds" time=2024-08-12T09:16:34.556-07:00 level=DEBUG msg="finished setting up runner" model=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 [GIN] 2024/08/12 - 09:16:34 | 200 | 5.315834006s | 127.0.0.1 | POST "/api/chat" time=2024-08-12T09:16:34.556-07:00 level=DEBUG msg="context for request finished" time=2024-08-12T09:16:34.557-07:00 level=DEBUG msg="runner with non-zero duration has gone idle, adding timer" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 duration=5m0s time=2024-08-12T09:16:34.557-07:00 level=DEBUG msg="after processing request finished event" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 refCount=0 time=2024-08-12T09:16:40.753-07:00 level=DEBUG msg="evaluating already loaded" model=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-12T09:16:40.754-07:00 level=DEBUG msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nDo moose rule?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" [GIN] 2024/08/12 - 09:20:55 | 200 | 4m15s | 127.0.0.1 | POST "/api/chat" time=2024-08-12T09:20:55.755-07:00 level=DEBUG msg="context for request finished" time=2024-08-12T09:20:55.755-07:00 level=DEBUG msg="runner with non-zero duration has gone idle, adding timer" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 duration=5m0s time=2024-08-12T09:20:55.755-07:00 level=DEBUG msg="after processing request finished event" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 refCount=0 time=2024-08-12T09:25:55.825-07:00 level=DEBUG msg="timer expired, expiring to unload" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="runner expired event received" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="got lock to unload" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="stopping llama server" time=2024-08-12T09:25:55.826-07:00 level=DEBUG msg="waiting for llama server to exit" time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="llama server stopped" time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="runner released" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="sending an unloaded event" modelPath=/Users/davidlaxer/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 time=2024-08-12T09:25:55.920-07:00 level=DEBUG msg="ignoring unload event with no pending requests" ``` Build commands: ``` % export CLANG=/usr/bin/clang export CC=/usr/bin/clang export OBJC=/usr/bin/clang export CC_FOR_BUILD=/usr/bin/clang export OBJC_FOR_BUILD=/usr/bin/clang export CXX=/usr/bin/clang++ % OLLAMA_SKIP_CPU_GENERATE=on OLLAMA_CUSTOM_CPU_DEFS="-DLLAMA_AVX=on -DLLAMA_AVX2=on -DLLAMA_F16C=on -DLLAMA_FMA=on -DLLAMA_METAL=on -DLLAMA_METAL_EMBED_LIBRARY=on -DGGML_USE_METAL=on -DLLAMA_METAL_COMPILE_SERIALIZED=1 -DLLAMA_METAL_MACOSX_VERSION_MIN=11.3 -DLLAMA_SUPPORTS_GPU_OFFLOAD=on" go generate -v ./... (AI-Feynman) davidlaxer@bluediamond ollama % CGO_CFLAGS="-I/opt/local/include -Xclang -fopenmp" CGO_LDFLAGS="-L/opt/local/lib -framework Accelerate -L/opt/local/lib/libomp -lomp" go build . # github.com/ollama/ollama ld: warning: ignoring duplicate libraries: '-lomp', '-lpthread' ``` Eval: ``` ./ollama run llama3.1 --verbose >>> Please elaborate on pilot wave theory and the collapse of the Schroedinger w ... wave function. The Pilot-Wave Theory, also known as the de Broglie-Bohm theory or Bohmian Mechanics, is an alternative interpretation of quantum mechanics proposed by Louis de Broglie and David Bohm in the 1920s and 1950s, respectively. **The Basic Idea:** In pilot-wave theory, particles like electrons do not follow a probabilistic wave function (like the Schroedinger equation) but are guided by a deterministic "pilot" or "hidden" wave, known as the "quantum potential." This pilot wave is responsible for guiding the particle's motion in such a way that it appears to be following the predictions of quantum mechanics. **The Collapse of the Schroedinger Wave Function:** In standard quantum mechanics, when a measurement is made on a system, the wave function collapses to one of its possible eigenstates. This collapse is a fundamental aspect of the Copenhagen interpretation, where the act of measurement itself causes the wave function to "jump" into one specific state. However, in pilot-wave theory, there is no collapse of the wave function. Instead, the particle's motion is guided by the pilot wave, which determines its position and momentum at any given time. When a measurement is made, the pilot wave simply reflects the new information obtained from the measurement, without changing the underlying wave function. **Key Features:** 1. **No Wave Function Collapse:** In pilot-wave theory, the wave function remains intact, and there is no collapse of the state upon measurement. 2. **Deterministic Motion:** The particle's motion is deterministic, guided by the pilot wave. 3. **Hidden Variables:** Pilot-wave theory requires the introduction of hidden variables (the pilot wave) to explain the behavior of particles at a microscopic level. **Challenges and Criticisms:** Pilot-wave theory has faced several criticisms and challenges: 1. **Lack of Predictive Power:** The theory relies on an additional, unmeasurable variable (the pilot wave), which makes it difficult to make precise predictions. 2. **Quantum Interpretation:** Pilot-wave theory is often seen as a non-standard interpretation of quantum mechanics, differing from the Copenhagen interpretation and other popular approaches. **Advantages:** 1. **Solves the Measurement Problem:** Pilot-wave theory provides an alternative explanation for the measurement process, sidestepping the difficulties associated with wave function collapse. 2. **Classical-Like Behavior:** The deterministic motion in pilot-wave theory can exhibit classical-like behavior, which is appealing from a philosophical and intuitive perspective. In conclusion, pilot-wave theory offers an intriguing alternative to standard quantum mechanics, where particles are guided by a deterministic "pilot" wave rather than probabilistic wave functions. While it faces challenges and criticisms, the theory remains an active area of research and debate in the foundations of quantum mechanics. total duration: 10m27.349376696s load duration: 16.784202ms prompt eval count: 28 token(s) prompt eval duration: 26.016955s prompt eval rate: 1.08 tokens/s eval count: 573 token(s) eval duration: 10m1.304367s eval rate: 0.95 tokens/s ```
Author
Owner

@raparici commented on GitHub (Sep 3, 2024):

Is this fix going to main ?

<!-- gh-comment-id:2326264719 --> @raparici commented on GitHub (Sep 3, 2024): Is this fix going to main ?
Author
Owner

@cracksauce commented on GitHub (Sep 17, 2024):

Commenting in support of this feature request and optimize the Ollama experience for users with AMD GPUs, particularly those using eGPUs on Intel Macs, who've historically been unable to utilize their graphics hardware for acceleration.

<!-- gh-comment-id:2356071739 --> @cracksauce commented on GitHub (Sep 17, 2024): Commenting in support of this feature request and optimize the Ollama experience for users with AMD GPUs, particularly those using eGPUs on Intel Macs, who've historically been unable to utilize their graphics hardware for acceleration.
Author
Owner

@THL-Leo commented on GitHub (Sep 17, 2024):

@Grergo What flags did you use to generate and build with Vulkan? I followed @ahornby's guide and received empty spaces as output. I am running on i9 intel mbp with 5500m GPU. I was able to run metal but not vulkan.

<!-- gh-comment-id:2356647385 --> @THL-Leo commented on GitHub (Sep 17, 2024): @Grergo What flags did you use to generate and build with Vulkan? I followed @ahornby's guide and received empty spaces as output. I am running on i9 intel mbp with 5500m GPU. I was able to run metal but not vulkan.
Author
Owner

@dbl001 commented on GitHub (Sep 17, 2024):

Same here. Vulkan gives gibberish. Metal works better, but I get GPU errors after the first or second prompt.
iMac 27” 2021 w/AMD Radeon Pro 5700 XT.

On Sep 17, 2024, at 11:44 AM, Leo Lee @.***> wrote:

@Grergo https://github.com/Grergo What flags did you use to generate and build with Vulkan? I followed @ahornby https://github.com/ahornby's guide and received empty spaces as output. I am running on i9 intel mbp with 5500m GPU. I was able to run metal but not vulkan.


Reply to this email directly, view it on GitHub https://github.com/ollama/ollama/issues/1016#issuecomment-2356647385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW5BC2NSCEIRL643N2TZXBZ77AVCNFSM6AAAAAA67V3742VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJWGY2DOMZYGU.
You are receiving this because you were mentioned.

<!-- gh-comment-id:2356738022 --> @dbl001 commented on GitHub (Sep 17, 2024): Same here. Vulkan gives gibberish. Metal works better, but I get GPU errors after the first or second prompt. iMac 27” 2021 w/AMD Radeon Pro 5700 XT. > On Sep 17, 2024, at 11:44 AM, Leo Lee ***@***.***> wrote: > > > @Grergo <https://github.com/Grergo> What flags did you use to generate and build with Vulkan? I followed @ahornby <https://github.com/ahornby>'s guide and received empty spaces as output. I am running on i9 intel mbp with 5500m GPU. I was able to run metal but not vulkan. > > — > Reply to this email directly, view it on GitHub <https://github.com/ollama/ollama/issues/1016#issuecomment-2356647385>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXWFW5BC2NSCEIRL643N2TZXBZ77AVCNFSM6AAAAAA67V3742VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJWGY2DOMZYGU>. > You are receiving this because you were mentioned. >
Author
Owner

@Grergo commented on GitHub (Sep 18, 2024):

@Grergo你使用哪些标志来生成和构建 Vulkan?我遵循@ahornby指南并收到空白作为输出。我在配备 5500m GPU 的 i9 intel mbp 上运行。我可以运行 metal,但不能运行 vulkan。

Sorry, I haven't followed up on this issue for a while. I used the default parameters to build from this commit: 5709e5. The GPU is RX6600XT, hope this helps you.

<!-- gh-comment-id:2357239411 --> @Grergo commented on GitHub (Sep 18, 2024): > @Grergo你使用哪些标志来生成和构建 Vulkan?我遵循@ahornby指南并收到空白作为输出。我在配备 5500m GPU 的 i9 intel mbp 上运行。我可以运行 metal,但不能运行 vulkan。 Sorry, I haven't followed up on this issue for a while. I used the default parameters to build from this commit: [5709e5](https://github.com/ollama/ollama/commit/5709e59e10808b3621c35910bd5df948ed6a740e). The GPU is RX6600XT, hope this helps you.
Author
Owner

@THL-Leo commented on GitHub (Sep 18, 2024):

@dbl001 Mine too, my only guess is that maybe our GPUs don't support certain version of vulkan/metal. Even when I ran with metal my performance is way worse than the one claimed by ahornby. He is running 4 tokens per second with an older model of macbook while I am getting 1.5-2 tokens per second using 5500m.

@Grergo I see. Thank you for your input, when I ran with the default parameters from the commit I had an issue where static build doesn't exist. I removed the static flag and it ran. But Vulkan still returns gibberish.

I am not too sure what to mess with in the files since I don't have much experience with ollama or llama.cpp.

<!-- gh-comment-id:2357277669 --> @THL-Leo commented on GitHub (Sep 18, 2024): @dbl001 Mine too, my only guess is that maybe our GPUs don't support certain version of vulkan/metal. Even when I ran with metal my performance is way worse than the one claimed by ahornby. He is running 4 tokens per second with an older model of macbook while I am getting 1.5-2 tokens per second using 5500m. @Grergo I see. Thank you for your input, when I ran with the default parameters from the commit I had an issue where static build doesn't exist. I removed the static flag and it ran. But Vulkan still returns gibberish. I am not too sure what to mess with in the files since I don't have much experience with ollama or llama.cpp.
Author
Owner

@dbl001 commented on GitHub (Sep 18, 2024):

llama3 on Metal:

total duration:       4m0.331820271s
load duration:        21.121239ms
prompt eval count:    235 token(s)
prompt eval duration: 3m29.026698s
prompt eval rate:     1.12 tokens/s
eval count:           35 token(s)
eval duration:        31.278045s
eval rate:            1.12 tokens/s

Some times I get these messages about a GPU issue:

error: Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored)
<!-- gh-comment-id:2358470087 --> @dbl001 commented on GitHub (Sep 18, 2024): llama3 on Metal: ``` total duration: 4m0.331820271s load duration: 21.121239ms prompt eval count: 235 token(s) prompt eval duration: 3m29.026698s prompt eval rate: 1.12 tokens/s eval count: 35 token(s) eval duration: 31.278045s eval rate: 1.12 tokens/s ``` Some times I get these messages about a GPU issue: ``` error: Ignored (for causing prior/excessive GPU errors) (00000004:kIOAccelCommandBufferCallbackErrorSubmissionsIgnored) ```
Author
Owner

@THL-Leo commented on GitHub (Sep 19, 2024):

I get similar results on phi3. Perhaps our GPUs are just too weak/outdated for this.

<!-- gh-comment-id:2360399905 --> @THL-Leo commented on GitHub (Sep 19, 2024): I get similar results on phi3. Perhaps our GPUs are just too weak/outdated for this.
Author
Owner

@duolabmeng6 commented on GitHub (Sep 21, 2024):

How to support AMD Radeon Pro 5500?

<!-- gh-comment-id:2365075697 --> @duolabmeng6 commented on GitHub (Sep 21, 2024): How to support AMD Radeon Pro 5500?
Author
Owner

@SecuritySura commented on GitHub (Oct 26, 2024):

I can see this thread is older than a year. I also have Mac with AMD Radeon Pro 5500M (8GB) GPU. still Ollama not support this GPU? or anyone found a solution? appreciate your kind response.

<!-- gh-comment-id:2439187437 --> @SecuritySura commented on GitHub (Oct 26, 2024): I can see this thread is older than a year. I also have Mac with AMD Radeon Pro 5500M (8GB) GPU. still Ollama not support this GPU? or anyone found a solution? appreciate your kind response.
Author
Owner

@TomDev234 commented on GitHub (Oct 31, 2024):

I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT.

Can you upload your binaries somewhere?
I have build ollama according to your instructions with the current source code. But the executable does not find any runners.

<!-- gh-comment-id:2450058860 --> @TomDev234 commented on GitHub (Oct 31, 2024): > I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT. Can you upload your binaries somewhere? I have build ollama according to your instructions with the current source code. But the executable does not find any runners.
Author
Owner

@aes512 commented on GitHub (Nov 15, 2024):

I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT.

Can you upload your binaries somewhere? I have build ollama according to your instructions with the current source code. But the executable does not find any runners.

Same here.

<!-- gh-comment-id:2478906269 --> @aes512 commented on GitHub (Nov 15, 2024): > > I got ollama to run llama 3.1 on my iMac 27" utilizing an AMD Radeon Pro 5700 XT. > > Can you upload your binaries somewhere? I have build ollama according to your instructions with the current source code. But the executable does not find any runners. Same here.
Author
Owner

@21307369 commented on GitHub (Nov 19, 2024):

My graphics is 6750Gre 12G. How can I make this graphics usable for ollam under macOS?

<!-- gh-comment-id:2486703282 --> @21307369 commented on GitHub (Nov 19, 2024): My graphics is 6750Gre 12G. How can I make this graphics usable for ollam under macOS?
Author
Owner

@TomDev234 commented on GitHub (Nov 19, 2024):

My graphics is 6750Gre 12G. How can I make this graphics usable for ollam under macOS?

Fix ollama's metal support for Intel Macs.

<!-- gh-comment-id:2486709021 --> @TomDev234 commented on GitHub (Nov 19, 2024): > My graphics is 6750Gre 12G. How can I make this graphics usable for ollam under macOS? Fix ollama's metal support for Intel Macs.
Author
Owner

@alsyundawy commented on GitHub (Dec 17, 2024):

any update? becouse my mb 2018 rx 560 not working always cpu not gpu

ollama version is 0.5.3

Radeon Pro 560X:

  Chipset Model: Radeon Pro 560X
  Type: GPU
  Bus: PCIe
  PCIe Lane Width: x8
  VRAM (Total): 4 GB
  Vendor: AMD (0x1002)
  Device ID: 0x67ef
  Revision ID: 0x00c2
  ROM Revision: 113-C980AL-075
  VBIOS Version: 113-C97501U-005
  EFI Driver Version: 01.A1.075
  Automatic Graphics Switching: Supported
  gMux Version: 5.0.0
  Metal Family: Supported, Metal GPUFamily macOS 2
<!-- gh-comment-id:2549624636 --> @alsyundawy commented on GitHub (Dec 17, 2024): any update? becouse my mb 2018 rx 560 not working always cpu not gpu ollama version is 0.5.3 Radeon Pro 560X: Chipset Model: Radeon Pro 560X Type: GPU Bus: PCIe PCIe Lane Width: x8 VRAM (Total): 4 GB Vendor: AMD (0x1002) Device ID: 0x67ef Revision ID: 0x00c2 ROM Revision: 113-C980AL-075 VBIOS Version: 113-C97501U-005 EFI Driver Version: 01.A1.075 Automatic Graphics Switching: Supported gMux Version: 5.0.0 Metal Family: Supported, Metal GPUFamily macOS 2
Author
Owner

@soerenkampschroer commented on GitHub (Dec 22, 2024):

Llama.cpp now supports my GPU in both Metal and Vulkan (RX 6800). Unfortunately, Metal is still about half as fast as the CPU. Vulkan on the other hand is extremely fast for me. I did a small benchmark with gemma-2-9b-it.Q5_K_M.gguf:

Metal 3.6 t/s
CPU 5.7 t/s
Vulkan 41.3 t/s

The problem is that Vulkan is not stable at all and will descend into gibberish half the time. The longer the prompt, the higher the chance for it to go wrong.

<!-- gh-comment-id:2558421274 --> @soerenkampschroer commented on GitHub (Dec 22, 2024): Llama.cpp now supports my GPU in both Metal and Vulkan (RX 6800). Unfortunately, Metal is still about half as fast as the CPU. Vulkan on the other hand is extremely fast for me. I did a small benchmark with gemma-2-9b-it.Q5_K_M.gguf: Metal 3.6 t/s CPU 5.7 t/s Vulkan 41.3 t/s The problem is that Vulkan is not stable at all and will descend into gibberish half the time. The longer the prompt, the higher the chance for it to go wrong.
Author
Owner

@MarcelHeemskerk commented on GitHub (Jan 6, 2025):

Llama.cpp now supports my GPU in both Metal and Vulkan (RX 6800). Unfortunately, Metal is still about half as fast as the CPU. Vulkan on the other hand is extremely fast for me.

Is that using MoltenVK @soerenkampschroer ? I am not aware of native support for Vulkan on MacOS.

<!-- gh-comment-id:2572944235 --> @MarcelHeemskerk commented on GitHub (Jan 6, 2025): > Llama.cpp now supports my GPU in both Metal and Vulkan (RX 6800). Unfortunately, Metal is still about half as fast as the CPU. Vulkan on the other hand is extremely fast for me. Is that using MoltenVK @soerenkampschroer ? I am not aware of native support for Vulkan on MacOS.
Author
Owner

@soerenkampschroer commented on GitHub (Jan 6, 2025):

@MarcelHeemskerk Yes that is using MoltenVK. Performance is great, but it seems like it's too buggy. The latest build of llama.cpp is not working for me anymore, as the output is now fully corrupted and it's failing a lot more of the backend tests. That's where I gave up.

<!-- gh-comment-id:2572959904 --> @soerenkampschroer commented on GitHub (Jan 6, 2025): @MarcelHeemskerk Yes that is using MoltenVK. Performance is great, but it seems like it's too buggy. The latest build of llama.cpp is not working for me anymore, as the output is now fully corrupted and it's failing a lot more of the backend tests. That's where I gave up.
Author
Owner

@FellowTraveler commented on GitHub (Jan 17, 2025):

@MarcelHeemskerk Yes that is using MoltenVK. Performance is great, but it seems like it's too buggy. The latest build of llama.cpp is not working for me anymore, as the output is now fully corrupted and it's failing a lot more of the backend tests. That's where I gave up.

Did you inform the Llama.cpp team? The bug might be there, rather than in MoltenVK.

<!-- gh-comment-id:2599130623 --> @FellowTraveler commented on GitHub (Jan 17, 2025): > [@MarcelHeemskerk](https://github.com/MarcelHeemskerk) Yes that is using MoltenVK. Performance is great, but it seems like it's too buggy. The latest build of llama.cpp is not working for me anymore, as the output is now fully corrupted and it's failing a lot more of the backend tests. That's where I gave up. Did you inform the Llama.cpp team? The bug might be there, rather than in MoltenVK.
Author
Owner

@soerenkampschroer commented on GitHub (Jan 17, 2025):

I did, you can find the issue here.

They assume it's a driver/moltenvk bug.

<!-- gh-comment-id:2599159284 --> @soerenkampschroer commented on GitHub (Jan 17, 2025): I did, you can find the issue [here](https://github.com/ggerganov/llama.cpp/issues/10984). They assume it's a driver/moltenvk bug.
Author
Owner

@FellowTraveler commented on GitHub (Jan 18, 2025):

See this: https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2599817892

<!-- gh-comment-id:2599847414 --> @FellowTraveler commented on GitHub (Jan 18, 2025): See this: https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2599817892
Author
Owner

@marekk1717 commented on GitHub (Jan 30, 2025):

I've got the same problem on hackintosh on Sequoia with AMD GPU, Vulkan and MoltenVK. It starts creating answer super fast (compared to CPU) but then it prints some strange characters.

The quantized version of llama3: Llama-3-8B/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i output gibberish.

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

▅ permet In Mayor? Список?OFandidnice Inief▅

▅poISBN nature	
                D?
                  at
                    D?
                      at
                        D?
                          at
                            D?
<!-- gh-comment-id:2624165097 --> @marekk1717 commented on GitHub (Jan 30, 2025): I've got the same problem on hackintosh on Sequoia with AMD GPU, Vulkan and MoltenVK. It starts creating answer super fast (compared to CPU) but then it prints some strange characters. > > The quantized version of llama3: Llama-3-8B/ggml-model-bf16-Q4_K_M.gguf -n 128 -ngl 1 -i output gibberish. > > ``` > == Running in interactive mode. == > - Press Ctrl+C to interject at any time. > - Press Return to return control to the AI. > - To return control without starting a new line, end your input with '/'. > - If you want to submit another line, end your input with '\'. > > ▅ permet In Mayor? Список?OFandidnice Inief▅ > > ▅poISBN nature > D? > at > D? > at > D? > at > D? > ``` >
Author
Owner

@dboyan commented on GitHub (Feb 3, 2025):

For people trying to use vulkan backend via MoltenVK but getting garbled output, you may want to try out the change in MoltenVK/2434. This is a PoC fix to address an issue within MoltenVK and llama.cpp now works mostly as expected with it. We will try to push the change to mainline after finalizing the solution.

<!-- gh-comment-id:2632099300 --> @dboyan commented on GitHub (Feb 3, 2025): For people trying to use vulkan backend via MoltenVK but getting garbled output, you may want to try out the change in [MoltenVK/2434](https://github.com/KhronosGroup/MoltenVK/pull/2434). This is a PoC fix to address an issue within MoltenVK and llama.cpp now works mostly as expected with it. We will try to push the change to mainline after finalizing the solution.
Author
Owner

@marekk1717 commented on GitHub (Feb 3, 2025):

Thanks !

Can you share steps how to compile it and how to build llama.cpp to properly include Vulkan?

For people trying to use vulkan backend via MoltenVK but getting garbled output, you may want to try out the change in MoltenVK/2434. This is a PoC fix to address an issue within MoltenVK and llama.cpp now works mostly as expected with it. We will try to push the change to mainline after finalizing the solution.

<!-- gh-comment-id:2632160103 --> @marekk1717 commented on GitHub (Feb 3, 2025): Thanks ! Can you share steps how to compile it and how to build llama.cpp to properly include Vulkan? > For people trying to use vulkan backend via MoltenVK but getting garbled output, you may want to try out the change in [MoltenVK/2434](https://github.com/KhronosGroup/MoltenVK/pull/2434). This is a PoC fix to address an issue within MoltenVK and llama.cpp now works mostly as expected with it. We will try to push the change to mainline after finalizing the solution.
Author
Owner

@dboyan commented on GitHub (Feb 3, 2025):

You can follow the MoltenVK instruction here, especially the "Install MoltenVK to Replace the Vulkan SDK libMoltenVK.dylib" part. Just switch the codebase to the PR branch. If you installed MoltenVK from sources other than LunarG Vulkan SDK (e.g., from homebrew), you'll need to check the path where they actually install their libMoltenVK.dylib and replace the file there (according to https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2636430442).

If simply building llama.cpp, you can just add -DGGML_VULKAN=ON to the cmake arguments as written here. It should work as long as you installed Vulkan SDK and replace the libMoltenVK.dylib with the custom version.

<!-- gh-comment-id:2632445664 --> @dboyan commented on GitHub (Feb 3, 2025): You can follow the MoltenVK instruction [here](https://github.com/KhronosGroup/MoltenVK?tab=readme-ov-file#building), especially the "Install MoltenVK to Replace the Vulkan SDK libMoltenVK.dylib" part. Just switch the codebase to the [PR branch](https://github.com/dboyan/MoltenVK/tree/specialization-macro). If you installed MoltenVK from sources other than LunarG Vulkan SDK (e.g., from homebrew), you'll need to check the path where they actually install their `libMoltenVK.dylib` and replace the file there (according to https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2636430442). If simply building llama.cpp, you can just add `-DGGML_VULKAN=ON` to the cmake arguments as written [here](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md). It should work as long as you installed Vulkan SDK and replace the libMoltenVK.dylib with the custom version.
Author
Owner

@marekk1717 commented on GitHub (Feb 4, 2025):

still doesn't work:
What is 1+1?imatelyA9H0>!&:;6<F'<13%E<3'B4G$11%(52,?167?E?<B2'0&4F=:8%FE&A9%9B-!.D:1>HH8,>C&(B-E,#/1=#.#3--7>>67;%":H8"$.C4<<086+=5;<(B-A14&;.?!.7/5.46-8%10)7&-0B91E

<!-- gh-comment-id:2633646826 --> @marekk1717 commented on GitHub (Feb 4, 2025): still doesn't work: What is 1+1?<think>imatelyA9H0>!&:;6<F'<13%E<3'B4G$11%(52,?16*7?E?<B2'0&4F=:8%FE&A*9%9B-!.D:*1>HH8,>C&(B-E,#/1=#*.#*3--7>>67;%":H8"$.C4<<086+=5;<(B-A14&;.?!.7/5.46-8%10)7&-0B91*E
Author
Owner

@soerenkampschroer commented on GitHub (Feb 4, 2025):

Similar results here. Depending on the model I'm able to get some correct output but then it's garbled again after a couple hundred tokens. When it works it's very fast for me though.

There are also failing tests in test_backebd_ops. I can send some logs later today but the mul_mat tests that I isolated in the other issue still fail.

For reference im I'm on an Intel mac with an AMD RX 6800.

<!-- gh-comment-id:2633677300 --> @soerenkampschroer commented on GitHub (Feb 4, 2025): Similar results here. Depending on the model I'm able to get some correct output but then it's garbled again after a couple hundred tokens. When it works it's very fast for me though. There are also failing tests in test_backebd_ops. I can send some logs later today but the mul_mat tests that I isolated in the other issue still fail. For reference im I'm on an Intel mac with an AMD RX 6800.
Author
Owner

@dboyan commented on GitHub (Feb 4, 2025):

There are also failing tests in test_backebd_ops. I can send some logs later today but the mul_mat tests that I isolated in the other issue still fail.

For reference im I'm on an Intel mac with an AMD RX 6800.

Interesting, on m1 all test_backend_ops cases are passing. Maybe there is still something not right on older gpus. Unfortunately I don't have an intel mac with amd gpu, but if you are able, please tell us the information about tests that are still failing, preferably on https://github.com/KhronosGroup/MoltenVK/issues/2423

Also, just curious. What are the models that you use which give garbled result? For me on m1, a few models that I tried worked well for most of the time, if not always. There are some model that gives garbled output sometimes, and I'm not totally sure what's wrong there. Also, metal and vulkan backends seems similarly fast on m1.

<!-- gh-comment-id:2634671375 --> @dboyan commented on GitHub (Feb 4, 2025): > There are also failing tests in test_backebd_ops. I can send some logs later today but the mul_mat tests that I isolated in the other issue still fail. > > For reference im I'm on an Intel mac with an AMD RX 6800. Interesting, on m1 all test_backend_ops cases are passing. Maybe there is still something not right on older gpus. Unfortunately I don't have an intel mac with amd gpu, but if you are able, please tell us the information about tests that are still failing, preferably on https://github.com/KhronosGroup/MoltenVK/issues/2423 Also, just curious. What are the models that you use which give garbled result? For me on m1, a few models that I tried worked well for most of the time, if not always. There are some model that gives garbled output sometimes, and I'm not totally sure what's wrong there. Also, metal and vulkan backends seems similarly fast on m1.
Author
Owner

@soerenkampschroer commented on GitHub (Feb 4, 2025):

Yes, Apple Silicon has much better support than Intel. While the Metal backend works on Intel/AMD, it is about half as fast as just running on the CPU. It does run flawlessly though, no corruption at all. The reason why people tried Vulkan/MoltenVK is that the speeds are great, but there is the issue of corrupted output. Maybe it could also be possible to speed up the metal backend, but I understand that soon to be deprecated hardware is not a top priority.

I've been using gemma-2-2b-it.Q8_0.gguf for testing, and it works for a while, but then it's corrupting. Same with Phi-3.1-mini-128k-instruct-Q8_0.
qwen2.5-coder-7b-instruct-q5_k_m corrupts after the first token and just repeats "@@@@".

I'll compile a list of the failing tests and post them on the other issue. I'm failing 383 tests as of the latest builds.

<!-- gh-comment-id:2635145508 --> @soerenkampschroer commented on GitHub (Feb 4, 2025): Yes, Apple Silicon has much better support than Intel. While the Metal backend works on Intel/AMD, it is about half as fast as just running on the CPU. It does run flawlessly though, no corruption at all. The reason why people tried Vulkan/MoltenVK is that the speeds are great, but there is the issue of corrupted output. Maybe it could also be possible to speed up the metal backend, but I understand that soon to be deprecated hardware is not a top priority. I've been using gemma-2-2b-it.Q8_0.gguf for testing, and it works for a while, but then it's corrupting. Same with Phi-3.1-mini-128k-instruct-Q8_0. qwen2.5-coder-7b-instruct-q5_k_m corrupts after the first token and just repeats "@@@@". I'll compile a list of the failing tests and post them on the other issue. I'm failing 383 tests as of the latest builds.
Author
Owner

@dboyan commented on GitHub (Feb 4, 2025):

I've been using gemma-2-2b-it.Q8_0.gguf for testing, and it works for a while, but then it's corrupting. Same with Phi-3.1-mini-128k-instruct-Q8_0. qwen2.5-coder-7b-instruct-q5_k_m corrupts after the first token and just repeats "@@@@".

I do see the same corruption patterns sometimes on a few models even on m1 with vulkan. But other models worked flawlessly for very long interactions. I'll try the model you mentioned locally when I have time.

I'll compile a list of the failing tests and post them on the other issue. I'm failing 383 tests as of the latest builds.

Thanks a lot! Just to make sure, you are using my branch instead of simply the latest main branch right?

If possible, please isolate one failed test case and capture a gputrace as we have done earlier. Although I cannot replay the trace as-is, we can try to inspect the trace for different code path.

<!-- gh-comment-id:2635172415 --> @dboyan commented on GitHub (Feb 4, 2025): > I've been using gemma-2-2b-it.Q8_0.gguf for testing, and it works for a while, but then it's corrupting. Same with Phi-3.1-mini-128k-instruct-Q8_0. qwen2.5-coder-7b-instruct-q5_k_m corrupts after the first token and just repeats "@@@@". I do see the same corruption patterns sometimes on a few models even on m1 with vulkan. But other models worked flawlessly for very long interactions. I'll try the model you mentioned locally when I have time. > I'll compile a list of the failing tests and post them on the other issue. I'm failing 383 tests as of the latest builds. Thanks a lot! Just to make sure, you are using my branch instead of simply the latest main branch right? If possible, please isolate one failed test case and capture a gputrace as we have done earlier. Although I cannot replay the trace as-is, we can try to inspect the trace for different code path.
Author
Owner

@soerenkampschroer commented on GitHub (Feb 4, 2025):

No problem at all, happy to help where I can! I've been using your branch of MoltenVK and then the latest main branch from llama.cpp. I've also made sure that /usr/local/lib/libMoltenVK.dylib is installed correctly after building your pr. Then I compiled llama.cpp like this:

cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 -DVulkan_INCLUDE_DIR=/usr/local/Cellar/vulkan-headers/1.4.307/include -DVulkan_LIBRARY=/usr/local/lib/libvulkan.1.4.307.dylib

I see that you mentioned only using -DGGML_VULKAN=ON, does that make a difference?

<!-- gh-comment-id:2635196383 --> @soerenkampschroer commented on GitHub (Feb 4, 2025): No problem at all, happy to help where I can! I've been using your branch of MoltenVK and then the latest main branch from llama.cpp. I've also made sure that `/usr/local/lib/libMoltenVK.dylib` is installed correctly after building your pr. Then I compiled llama.cpp like this: `cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 -DVulkan_INCLUDE_DIR=/usr/local/Cellar/vulkan-headers/1.4.307/include -DVulkan_LIBRARY=/usr/local/lib/libvulkan.1.4.307.dylib` I see that you mentioned only using `-DGGML_VULKAN=ON`, does that make a difference?
Author
Owner

@dboyan commented on GitHub (Feb 4, 2025):

I see that you mentioned only using -DGGML_VULKAN=ON, does that make a difference?

I don't think so. It will build with both vulkan and metal backends.

<!-- gh-comment-id:2635200983 --> @dboyan commented on GitHub (Feb 4, 2025): > I see that you mentioned only using `-DGGML_VULKAN=ON`, does that make a difference? I don't think so. It will build with both vulkan and metal backends.
Author
Owner

@dboyan commented on GitHub (Feb 5, 2025):

@soerenkampschroer I suspect that your program are still using the original library without my change (or something is going wrong with my heuristic of telling the macro, which I can hardly imagine how). With the gputrace I captured on my own, I can see a few macros have been defined in the metal library, just like this:

Image

But the macro definition is blank within your trace above. (To find it, first find call 203 in your trace by expanding the "vkQueueSubmit" under call 189, and the "vkCmdDispatch" inside. And then double click "Compute Pipeline 0x7f8..." on the right. Expand "Compute Function" > "Compile Option" inside)

<!-- gh-comment-id:2635727189 --> @dboyan commented on GitHub (Feb 5, 2025): @soerenkampschroer I suspect that your program are still using the original library without my change (or something is going wrong with my heuristic of telling the macro, which I can hardly imagine how). With the gputrace I captured on my own, I can see a few macros have been defined in the metal library, just like this: <img width="758" alt="Image" src="https://github.com/user-attachments/assets/d7d3da62-cbf6-45be-a5c8-6f1e654370e6" /> But the macro definition is blank within your trace above. (To find it, first find call 203 in your trace by expanding the "vkQueueSubmit" under call 189, and the "vkCmdDispatch" inside. And then double click "Compute Pipeline 0x7f8..." on the right. Expand "Compute Function" > "Compile Option" inside)
Author
Owner

@soerenkampschroer commented on GitHub (Feb 7, 2025):

I just wanted to add to this issue that the fix to MoltenVK by @dboyan works great on my machine. I'm now able to use llama.cpp with GPU acceleration.

However, I wasn't able to compile ollama with vulkan support on macOS. There is a pull request to add vulkan support for linux, but I couldn't figure out how to make that work either. It's possible but needs some work.

<!-- gh-comment-id:2642611366 --> @soerenkampschroer commented on GitHub (Feb 7, 2025): I just wanted to add to this issue that the fix to MoltenVK by @dboyan works great on my machine. I'm now able to use llama.cpp with GPU acceleration. However, I wasn't able to compile ollama with vulkan support on macOS. There is a pull request to add vulkan support for linux, but I couldn't figure out how to make that work either. It's possible but needs some work.
Author
Owner

@marekk1717 commented on GitHub (Feb 7, 2025):

Would it be possible for you step by step how you build everything?

I just wanted to add to this issue that the fix to MoltenVK by @dboyan works great on my machine. I'm now able to use llama.cpp with GPU acceleration.

<!-- gh-comment-id:2642614913 --> @marekk1717 commented on GitHub (Feb 7, 2025): Would it be possible for you step by step how you build everything? > I just wanted to add to this issue that the fix to MoltenVK by [@dboyan](https://github.com/dboyan) works great on my machine. I'm now able to use llama.cpp with GPU acceleration.
Author
Owner

@soerenkampschroer commented on GitHub (Feb 7, 2025):

This is retracing my steps from memory, but it should at least get you on the right track.

  1. Install dependencies:
brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader
  1. Clone MoltenVK and pull the PR
git clone git@github.com:KhronosGroup/MoltenVK.git
cd MoltenVK
git fetch origin pull/2434/head:p2434
git switch p2434
  1. Build MoltenVK
./fetchDependencies --macos
make macos
  1. Install
    Note: The path will be different depending on the version of molten-vk you installed.
    Copy ./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib to /usr/local/Cellar/molten-vk/1.2.11/lib/.

  2. Build llama.cpp
    Clone the repo as normal and build it with:

cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON
cmake --build build --config Release
<!-- gh-comment-id:2642713162 --> @soerenkampschroer commented on GitHub (Feb 7, 2025): This is retracing my steps from memory, but it should at least get you on the right track. 1. Install dependencies: ``` brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader ``` 2. Clone MoltenVK and pull the PR ``` git clone git@github.com:KhronosGroup/MoltenVK.git cd MoltenVK git fetch origin pull/2434/head:p2434 git switch p2434 ``` 3. Build MoltenVK ``` ./fetchDependencies --macos make macos ``` 4. Install _Note: The path will be different depending on the version of molten-vk you installed._ Copy `./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib` to `/usr/local/Cellar/molten-vk/1.2.11/lib/`. 5. Build llama.cpp Clone the repo as normal and build it with: ``` cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON cmake --build build --config Release ```
Author
Owner

@marekk1717 commented on GitHub (Feb 7, 2025):

Thank you. PLease can you share also a llama command line incl model name and all parameters?

<!-- gh-comment-id:2642733539 --> @marekk1717 commented on GitHub (Feb 7, 2025): Thank you. PLease can you share also a llama command line incl model name and all parameters?
Author
Owner

@soerenkampschroer commented on GitHub (Feb 7, 2025):

All the models I've tested so far worked, but as an example:

./llama-cli -m ~/models/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/qwen2.5-coder-7b-instruct-q5_k_m.gguf --n-gpu-layers 100
<!-- gh-comment-id:2642751308 --> @soerenkampschroer commented on GitHub (Feb 7, 2025): All the models I've tested so far worked, but as an example: ``` ./llama-cli -m ~/models/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/qwen2.5-coder-7b-instruct-q5_k_m.gguf --n-gpu-layers 100 ```
Author
Owner

@marekk1717 commented on GitHub (Feb 7, 2025):

Thank you !!!
Now it works and it's crazy fast :)

<!-- gh-comment-id:2642859207 --> @marekk1717 commented on GitHub (Feb 7, 2025): Thank you !!! Now it works and it's crazy fast :)
Author
Owner

@marekk1717 commented on GitHub (Feb 7, 2025):

Next question, any recommendation for good Web UI that can be connected to llama-server? ;)

<!-- gh-comment-id:2643002896 --> @marekk1717 commented on GitHub (Feb 7, 2025): Next question, any recommendation for good Web UI that can be connected to llama-server? ;)
Author
Owner

@THL-Leo commented on GitHub (Feb 7, 2025):

All the models I've tested so far worked, but as an example:

./llama-cli -m ~/models/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/qwen2.5-coder-7b-instruct-q5_k_m.gguf --n-gpu-layers 100

Do you know if this will work with Ollama as well? Or just Llama.cpp.

I was able to get it working on llama.cpp thanks to you :)

<!-- gh-comment-id:2644068246 --> @THL-Leo commented on GitHub (Feb 7, 2025): > All the models I've tested so far worked, but as an example: > > ``` > ./llama-cli -m ~/models/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/qwen2.5-coder-7b-instruct-q5_k_m.gguf --n-gpu-layers 100 > ``` Do you know if this will work with Ollama as well? Or just Llama.cpp. I was able to get it working on llama.cpp thanks to you :)
Author
Owner

@soerenkampschroer commented on GitHub (Feb 7, 2025):

Do you know if this will work with Ollama as well? Or just Llama.cpp

Ollama is built on top of llama.cpp, so yes it would work. The problem is, ollama does not use vulkan in its macOS or linux version, and they don't intend to at the moment. See here.

I'm not familiar enough with the project to say how much work it would be to add vulkan support, but it doesn't look as easy as with other projects.

As for other GUIs:

I was able to build jan.ai with vulkan support, but it uses a custom layer on top of llama (cortex) and for some reason I got like half the speed as with just llama.cpp. Should be possible to be improved, but that's how deep I was willing to dig.

Another way would be to use llama-server and then a webgui like open-webui through the openai compatible API. But there is no model management and it's a bit cumbersome.

Then there is LocalAI which looks promising and should be relatively easy to build with llama.cpp+vulkan. That's next on my list.

As long as the fix is not permanent and merged into MoltenVK I'm hesitant to open issues and ask for vulkan support.

<!-- gh-comment-id:2644270168 --> @soerenkampschroer commented on GitHub (Feb 7, 2025): > Do you know if this will work with Ollama as well? Or just Llama.cpp Ollama is built on top of llama.cpp, so yes it would work. The problem is, ollama does not use vulkan in its macOS or linux version, and they don't intend to at the moment. [See here](https://github.com/ollama/ollama/pull/5059#issuecomment-2628002106). I'm not familiar enough with the project to say how much work it would be to add vulkan support, but it doesn't look as easy as with other projects. As for other GUIs: I was able to build jan.ai with vulkan support, but it uses a custom layer on top of llama (cortex) and for some reason I got like half the speed as with just llama.cpp. Should be possible to be improved, but that's how deep I was willing to dig. Another way would be to use llama-server and then a webgui like open-webui through the openai compatible API. But there is no model management and it's a bit cumbersome. Then there is LocalAI which looks promising and should be relatively easy to build with llama.cpp+vulkan. That's next on my list. As long as the fix is not permanent and merged into MoltenVK I'm hesitant to open issues and ask for vulkan support.
Author
Owner

@TomDev234 commented on GitHub (Feb 8, 2025):

Is there a macOS gui available for llama.cpp? I couldn't find any.

<!-- gh-comment-id:2644356506 --> @TomDev234 commented on GitHub (Feb 8, 2025): Is there a macOS gui available for llama.cpp? I couldn't find any.
Author
Owner

@THL-Leo commented on GitHub (Feb 8, 2025):

Is there a macOS gui available for llama.cpp? I couldn't find any.

Think it would be faster to just build a simple one using UI libraries and host it on localhost. That's what I plan on to do right now at least.

<!-- gh-comment-id:2645967557 --> @THL-Leo commented on GitHub (Feb 8, 2025): > Is there a macOS gui available for llama.cpp? I couldn't find any. Think it would be faster to just build a simple one using UI libraries and host it on localhost. That's what I plan on to do right now at least.
Author
Owner

@prabhu commented on GitHub (Feb 22, 2025):

https://github.com/KhronosGroup/MoltenVK/pull/2441 got merged.

Step 2 from this comment becomes:

git clone git@github.com:KhronosGroup/MoltenVK.git
cd MoltenVK
git fetch origin pull/2441/head:p2441
git switch p2441
<!-- gh-comment-id:2676156408 --> @prabhu commented on GitHub (Feb 22, 2025): https://github.com/KhronosGroup/MoltenVK/pull/2441 got merged. Step 2 from this [comment](https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162) becomes: ``` git clone git@github.com:KhronosGroup/MoltenVK.git cd MoltenVK git fetch origin pull/2441/head:p2441 git switch p2441 ```
Author
Owner

@soerenkampschroer commented on GitHub (Feb 22, 2025):

@prabhu Did you test it on Intel/AMD?

The merged fix does not work for me.

<!-- gh-comment-id:2676163256 --> @soerenkampschroer commented on GitHub (Feb 22, 2025): @prabhu Did you test it on Intel/AMD? The merged fix does not work for me.
Author
Owner

@prabhu commented on GitHub (Feb 22, 2025):

Any errors? I don't see any difference in performance between the two branches.

<!-- gh-comment-id:2676171204 --> @prabhu commented on GitHub (Feb 22, 2025): Any errors? I don't see any difference in performance between the two branches.
Author
Owner

@soerenkampschroer commented on GitHub (Feb 22, 2025):

Interesting. On my machine, the output is corrupted like before.

Could you tell me what version of macOS and what GPU you're running?

I've been trying to find a fix in the issue over at MoltenVK.

<!-- gh-comment-id:2676185321 --> @soerenkampschroer commented on GitHub (Feb 22, 2025): Interesting. On my machine, the output is corrupted like before. Could you tell me what version of macOS and what GPU you're running? I've been trying to find a fix in the [issue over at MoltenVK](https://github.com/KhronosGroup/MoltenVK/issues/2423#issuecomment-2668812505).
Author
Owner

@prabhu commented on GitHub (Feb 22, 2025):

system_profiler SPSoftwareDataType SPHardwareDataType
Software:

    System Software Overview:

      System Version: macOS 15.3.1 (24D70)
      Kernel Version: Darwin 24.3.0
      Boot Volume: Disk
      Boot Mode: Normal
      Computer Name: MacBook Pro
      User Name: ***
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      

Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro16,1
      Processor Name: 8-Core Intel Core i9
      Processor Speed: 2.4 GHz
      Number of Processors: 1
      Total Number of Cores: 8
      L2 Cache (per Core): 256 KB
      L3 Cache: 16 MB
      Hyper-Threading Technology: Enabled
      Memory: 32 GB
      System Firmware Version: 2069.80.3.0.0 (iBridge: 22.16.13051.0.0,0)
      OS Loader Version: 582~3311
      
system_profiler SPSoftwareDataType SPDisplaysDataType
Software:

    System Software Overview:

      System Version: macOS 15.3.1 (24D70)
      Kernel Version: Darwin 24.3.0
      Boot Volume: Disk
      Boot Mode: Normal
      Computer Name: MacBook Pro
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      

Graphics/Displays:

    Intel UHD Graphics 630:

      Chipset Model: Intel UHD Graphics 630
      Type: GPU
      Bus: Built-In
      VRAM (Dynamic, Max): 1536 MB
      Vendor: Intel
      Device ID: 0x3e9b
      Revision ID: 0x0002
      Automatic Graphics Switching: Supported
      gMux Version: 5.0.0
      Metal Support: Metal 3

    AMD Radeon Pro 5500M:

      Chipset Model: AMD Radeon Pro 5500M
      Type: GPU
      Bus: PCIe
      PCIe Lane Width: x16
      VRAM (Total): 8 GB
      Vendor: AMD (0x1002)
      Device ID: 0x7340
      Revision ID: 0x0040
      ROM Revision: 113-D3220E-190
      VBIOS Version: 113-D32206U1-019
      Option ROM Version: 113-D32206U1-019
      EFI Driver Version: 01.A1.190
      Automatic Graphics Switching: Supported
      gMux Version: 5.0.0
      Metal Support: Metal 3
      Displays:
        Odyssey G93SC:
          Resolution: 5120 x 1440
          UI Looks like: 5120 x 1440 @ 120.00Hz
          Framebuffer Depth: 30-Bit Color (ARGB2101010)          
          Main Display: Yes
          Mirror: Off
          Online: Yes
          Rotation: Supported
          Connection Type: Thunderbolt/DisplayPort
<!-- gh-comment-id:2676192980 --> @prabhu commented on GitHub (Feb 22, 2025): ``` system_profiler SPSoftwareDataType SPHardwareDataType Software: System Software Overview: System Version: macOS 15.3.1 (24D70) Kernel Version: Darwin 24.3.0 Boot Volume: Disk Boot Mode: Normal Computer Name: MacBook Pro User Name: *** Secure Virtual Memory: Enabled System Integrity Protection: Enabled Hardware: Hardware Overview: Model Name: MacBook Pro Model Identifier: MacBookPro16,1 Processor Name: 8-Core Intel Core i9 Processor Speed: 2.4 GHz Number of Processors: 1 Total Number of Cores: 8 L2 Cache (per Core): 256 KB L3 Cache: 16 MB Hyper-Threading Technology: Enabled Memory: 32 GB System Firmware Version: 2069.80.3.0.0 (iBridge: 22.16.13051.0.0,0) OS Loader Version: 582~3311 ``` ``` system_profiler SPSoftwareDataType SPDisplaysDataType Software: System Software Overview: System Version: macOS 15.3.1 (24D70) Kernel Version: Darwin 24.3.0 Boot Volume: Disk Boot Mode: Normal Computer Name: MacBook Pro Secure Virtual Memory: Enabled System Integrity Protection: Enabled Graphics/Displays: Intel UHD Graphics 630: Chipset Model: Intel UHD Graphics 630 Type: GPU Bus: Built-In VRAM (Dynamic, Max): 1536 MB Vendor: Intel Device ID: 0x3e9b Revision ID: 0x0002 Automatic Graphics Switching: Supported gMux Version: 5.0.0 Metal Support: Metal 3 AMD Radeon Pro 5500M: Chipset Model: AMD Radeon Pro 5500M Type: GPU Bus: PCIe PCIe Lane Width: x16 VRAM (Total): 8 GB Vendor: AMD (0x1002) Device ID: 0x7340 Revision ID: 0x0040 ROM Revision: 113-D3220E-190 VBIOS Version: 113-D32206U1-019 Option ROM Version: 113-D32206U1-019 EFI Driver Version: 01.A1.190 Automatic Graphics Switching: Supported gMux Version: 5.0.0 Metal Support: Metal 3 Displays: Odyssey G93SC: Resolution: 5120 x 1440 UI Looks like: 5120 x 1440 @ 120.00Hz Framebuffer Depth: 30-Bit Color (ARGB2101010) Main Display: Yes Mirror: Off Online: Yes Rotation: Supported Connection Type: Thunderbolt/DisplayPort ```
Author
Owner

@soerenkampschroer commented on GitHub (Feb 22, 2025):

Thanks for the info!

So it appears that the fix works for 5000 but not 6000 series GPUs. Or there is something wrong with my setup.

Does anyone have a 6000 series GPU and is willing to test the latest master branch of moltenvk?

<!-- gh-comment-id:2676236444 --> @soerenkampschroer commented on GitHub (Feb 22, 2025): Thanks for the info! So it appears that the fix works for 5000 but not 6000 series GPUs. Or there is something wrong with my setup. Does anyone have a 6000 series GPU and is willing to test the latest master branch of moltenvk?
Author
Owner

@jeffklassen commented on GitHub (Feb 24, 2025):

this fix does not work for me:

> cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON -DGGML_CCACHE=OFF
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu-sandybridge: -msse4.2;-mavx GGML_SSE42;GGML_AVX
-- x86 detected
-- Adding CPU backend variant ggml-cpu-haswell: -msse4.2;-mf16c;-mfma;-mavx;-mavx2 GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2
-- x86 detected
-- Adding CPU backend variant ggml-cpu-skylakex: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512
-- x86 detected
-- Adding CPU backend variant ggml-cpu-icelake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw;-mavx512vbmi;-mavx512vnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512;GGML_AVX512_VBMI;GGML_AVX512_VNNI
-- x86 detected
-- Adding CPU backend variant ggml-cpu-alderlake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavxvnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX_VNNI
CMake Error at ml/backend/ggml/ggml/src/CMakeLists.txt:257 (add_subdirectory):
  add_subdirectory given source "ggml-vulkan" which is not an existing
  directory.
Call Stack (most recent call first):
  ml/backend/ggml/ggml/src/CMakeLists.txt:309 (ggml_add_backend)


-- Including Vulkan backend
<!-- gh-comment-id:2677613211 --> @jeffklassen commented on GitHub (Feb 24, 2025): this fix does not work for me: ``` > cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON -DGGML_CCACHE=OFF -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- Including CPU backend -- x86 detected -- Adding CPU backend variant ggml-cpu-sandybridge: -msse4.2;-mavx GGML_SSE42;GGML_AVX -- x86 detected -- Adding CPU backend variant ggml-cpu-haswell: -msse4.2;-mf16c;-mfma;-mavx;-mavx2 GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2 -- x86 detected -- Adding CPU backend variant ggml-cpu-skylakex: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512 -- x86 detected -- Adding CPU backend variant ggml-cpu-icelake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw;-mavx512vbmi;-mavx512vnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512;GGML_AVX512_VBMI;GGML_AVX512_VNNI -- x86 detected -- Adding CPU backend variant ggml-cpu-alderlake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavxvnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX_VNNI CMake Error at ml/backend/ggml/ggml/src/CMakeLists.txt:257 (add_subdirectory): add_subdirectory given source "ggml-vulkan" which is not an existing directory. Call Stack (most recent call first): ml/backend/ggml/ggml/src/CMakeLists.txt:309 (ggml_add_backend) -- Including Vulkan backend ```
Author
Owner

@tristan-k commented on GitHub (Feb 24, 2025):

Thanks for the info!

So it appears that the fix works for 5000 but not 6000 series GPUs. Or there is something wrong with my setup.

Does anyone have a 6000 series GPU and is willing to test the latest master branch of moltenvk?

I can test with a 6000 series GPU but I'm kinda confused what steps need to be done in order to get there.

<!-- gh-comment-id:2677642338 --> @tristan-k commented on GitHub (Feb 24, 2025): > Thanks for the info! > > So it appears that the fix works for 5000 but not 6000 series GPUs. Or there is something wrong with my setup. > > Does anyone have a 6000 series GPU and is willing to test the latest master branch of moltenvk? I can test with a 6000 series GPU but I'm kinda confused what steps need to be done in order to get there.
Author
Owner

@dboyan commented on GitHub (Feb 24, 2025):

We have known that RX 6000 series is not working with the latest main branch, but that is due to another independent issue (https://github.com/KhronosGroup/MoltenVK/issues/2458). There is a workaround, though. To make it work for now, follow the steps in https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162, but replace step 2 with:

git clone git@github.com:KhronosGroup/MoltenVK.git
cd MoltenVK
git revert 835f85ec
<!-- gh-comment-id:2677708722 --> @dboyan commented on GitHub (Feb 24, 2025): We have known that RX 6000 series is not working with the latest main branch, but that is due to another independent issue (https://github.com/KhronosGroup/MoltenVK/issues/2458). There is a workaround, though. To make it work for now, follow the steps in https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162, but replace step 2 with: ``` git clone git@github.com:KhronosGroup/MoltenVK.git cd MoltenVK git revert 835f85ec ```
Author
Owner

@rchesnut-amgteam commented on GitHub (Feb 25, 2025):

This is retracing my steps from memory, but it should at least get you on the right track.

  1. Install dependencies:
brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader
  1. Clone MoltenVK and pull the PR
git clone git@github.com:KhronosGroup/MoltenVK.git
cd MoltenVK
git fetch origin pull/2434/head:p2434
git switch p2434
  1. Build MoltenVK
./fetchDependencies --macos
make macos
  1. Install
    Note: The path will be different depending on the version of molten-vk you installed.
    Copy ./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib to /usr/local/Cellar/molten-vk/1.2.11/lib/.
  2. Build llama.cpp
    Clone the repo as normal and build it with:
cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON
cmake --build build --config Release

I followed these steps and get a kernel panic on build every time.

My specs:
i9 intel 3.18Ghz
RX 6800xt

<!-- gh-comment-id:2683271762 --> @rchesnut-amgteam commented on GitHub (Feb 25, 2025): > This is retracing my steps from memory, but it should at least get you on the right track. > > 1. Install dependencies: > > ``` > brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader > ``` > > 2. Clone MoltenVK and pull the PR > > ``` > git clone git@github.com:KhronosGroup/MoltenVK.git > cd MoltenVK > git fetch origin pull/2434/head:p2434 > git switch p2434 > ``` > > 3. Build MoltenVK > > ``` > ./fetchDependencies --macos > make macos > ``` > > 4. Install > _Note: The path will be different depending on the version of molten-vk you installed._ > Copy `./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib` to `/usr/local/Cellar/molten-vk/1.2.11/lib/`. > 5. Build llama.cpp > Clone the repo as normal and build it with: > > ``` > cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON > cmake --build build --config Release > ``` I followed these steps and get a kernel panic on build every time. My specs: i9 intel 3.18Ghz RX 6800xt
Author
Owner

@kanadgodse commented on GitHub (Feb 26, 2025):

this fix does not work for me:

> cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON -DGGML_CCACHE=OFF
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu-sandybridge: -msse4.2;-mavx GGML_SSE42;GGML_AVX
-- x86 detected
-- Adding CPU backend variant ggml-cpu-haswell: -msse4.2;-mf16c;-mfma;-mavx;-mavx2 GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2
-- x86 detected
-- Adding CPU backend variant ggml-cpu-skylakex: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512
-- x86 detected
-- Adding CPU backend variant ggml-cpu-icelake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw;-mavx512vbmi;-mavx512vnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512;GGML_AVX512_VBMI;GGML_AVX512_VNNI
-- x86 detected
-- Adding CPU backend variant ggml-cpu-alderlake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavxvnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX_VNNI
CMake Error at ml/backend/ggml/ggml/src/CMakeLists.txt:257 (add_subdirectory):
  add_subdirectory given source "ggml-vulkan" which is not an existing
  directory.
Call Stack (most recent call first):
  ml/backend/ggml/ggml/src/CMakeLists.txt:309 (ggml_add_backend)


-- Including Vulkan backend

~~Even I am facing the same issue.
When I cloned ollama, I found that the directory ml/backend/ggml/ggml/src does not have the directory ggml-vulkan

How to add that?~~

I am an idiot!

I needed to clone https://github.com/ggml-org/llama.cpp and build that. Not Ollama!

@jeffklassen Try what I did above and then it should work.

<!-- gh-comment-id:2685726751 --> @kanadgodse commented on GitHub (Feb 26, 2025): > this fix does not work for me: > > ``` > > cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON -DGGML_CCACHE=OFF > -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF > -- CMAKE_SYSTEM_PROCESSOR: x86_64 > -- Including CPU backend > -- x86 detected > -- Adding CPU backend variant ggml-cpu-sandybridge: -msse4.2;-mavx GGML_SSE42;GGML_AVX > -- x86 detected > -- Adding CPU backend variant ggml-cpu-haswell: -msse4.2;-mf16c;-mfma;-mavx;-mavx2 GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2 > -- x86 detected > -- Adding CPU backend variant ggml-cpu-skylakex: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512 > -- x86 detected > -- Adding CPU backend variant ggml-cpu-icelake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavx512f;-mavx512cd;-mavx512vl;-mavx512dq;-mavx512bw;-mavx512vbmi;-mavx512vnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX512;GGML_AVX512_VBMI;GGML_AVX512_VNNI > -- x86 detected > -- Adding CPU backend variant ggml-cpu-alderlake: -msse4.2;-mf16c;-mfma;-mavx;-mavx2;-mavxvnni GGML_SSE42;GGML_F16C;GGML_FMA;GGML_AVX;GGML_AVX2;GGML_AVX_VNNI > CMake Error at ml/backend/ggml/ggml/src/CMakeLists.txt:257 (add_subdirectory): > add_subdirectory given source "ggml-vulkan" which is not an existing > directory. > Call Stack (most recent call first): > ml/backend/ggml/ggml/src/CMakeLists.txt:309 (ggml_add_backend) > > > -- Including Vulkan backend > ``` ~~Even I am facing the same issue. When I cloned ollama, I found that the directory ml/backend/ggml/ggml/src does not have the directory ggml-vulkan How to add that?~~ I am an idiot! I needed to clone https://github.com/ggml-org/llama.cpp and build that. Not Ollama! @jeffklassen Try what I did above and then it should work.
Author
Owner

@soerenkampschroer commented on GitHub (Feb 26, 2025):

@kanadgodse did you test if ollama is using your GPU?

It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure.

<!-- gh-comment-id:2685990872 --> @soerenkampschroer commented on GitHub (Feb 26, 2025): @kanadgodse did you test if ollama is using your GPU? It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure.
Author
Owner

@kanadgodse commented on GitHub (Feb 27, 2025):

@kanadgodse did you test if ollama is using your GPU?

It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure.

Not yet, I am going to download a model and check. Will keep you posted.

<!-- gh-comment-id:2686926014 --> @kanadgodse commented on GitHub (Feb 27, 2025): > [@kanadgodse](https://github.com/kanadgodse) did you test if ollama is using your GPU? > > It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure. Not yet, I am going to download a model and check. Will keep you posted.
Author
Owner

@kanadgodse commented on GitHub (Feb 27, 2025):

@kanadgodse did you test if ollama is using your GPU?

It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure.

I got it to run using Vulkan! But the output is not coming properly.

Here's a screenshot that confirms that it's using the GPU when generating the output:

Image

Here's what I ran and the output:

(pvenv) ➜  llama.cpp git:(master) ./build/bin/llama-cli -m ~/models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf --n-gpu-layers 25
[mvk-info] MoltenVK version 1.2.12, supporting Vulkan version 1.2.296.
	The following 109 Vulkan extensions are supported:
	VK_KHR_16bit_storage v1
	VK_KHR_8bit_storage v1
	VK_KHR_bind_memory2 v1
	VK_KHR_calibrated_timestamps v1
	VK_KHR_copy_commands2 v1
	VK_KHR_create_renderpass2 v1
	VK_KHR_dedicated_allocation v3
	VK_KHR_deferred_host_operations v4
	VK_KHR_depth_stencil_resolve v1
	VK_KHR_descriptor_update_template v1
	VK_KHR_device_group v4
	VK_KHR_device_group_creation v1
	VK_KHR_driver_properties v1
	VK_KHR_dynamic_rendering v1
	VK_KHR_external_fence v1
	VK_KHR_external_fence_capabilities v1
	VK_KHR_external_memory v1
	VK_KHR_external_memory_capabilities v1
	VK_KHR_external_semaphore v1
	VK_KHR_external_semaphore_capabilities v1
	VK_KHR_fragment_shader_barycentric v1
	VK_KHR_format_feature_flags2 v2
	VK_KHR_get_memory_requirements2 v1
	VK_KHR_get_physical_device_properties2 v2
	VK_KHR_get_surface_capabilities2 v1
	VK_KHR_imageless_framebuffer v1
	VK_KHR_image_format_list v1
	VK_KHR_incremental_present v2
	VK_KHR_maintenance1 v2
	VK_KHR_maintenance2 v1
	VK_KHR_maintenance3 v1
	VK_KHR_map_memory2 v1
	VK_KHR_multiview v1
	VK_KHR_portability_subset v1
	VK_KHR_push_descriptor v2
	VK_KHR_relaxed_block_layout v1
	VK_KHR_sampler_mirror_clamp_to_edge v3
	VK_KHR_sampler_ycbcr_conversion v14
	VK_KHR_separate_depth_stencil_layouts v1
	VK_KHR_shader_draw_parameters v1
	VK_KHR_shader_float_controls v4
	VK_KHR_shader_float16_int8 v1
	VK_KHR_shader_integer_dot_product v1
	VK_KHR_shader_non_semantic_info v1
	VK_KHR_shader_subgroup_extended_types v1
	VK_KHR_shader_terminate_invocation v1
	VK_KHR_spirv_1_4 v1
	VK_KHR_storage_buffer_storage_class v1
	VK_KHR_surface v25
	VK_KHR_swapchain v70
	VK_KHR_swapchain_mutable_format v1
	VK_KHR_synchronization2 v1
	VK_KHR_timeline_semaphore v2
	VK_KHR_uniform_buffer_standard_layout v1
	VK_KHR_variable_pointers v1
	VK_KHR_vertex_attribute_divisor v1
	VK_EXT_4444_formats v1
	VK_EXT_calibrated_timestamps v2
	VK_EXT_debug_marker v4
	VK_EXT_debug_report v10
	VK_EXT_debug_utils v2
	VK_EXT_descriptor_indexing v2
	VK_EXT_depth_clip_control v1
	VK_EXT_extended_dynamic_state v1
	VK_EXT_extended_dynamic_state2 v1
	VK_EXT_extended_dynamic_state3 v2
	VK_EXT_external_memory_host v1
	VK_EXT_fragment_shader_interlock v1
	VK_EXT_hdr_metadata v3
	VK_EXT_headless_surface v1
	VK_EXT_host_image_copy v1
	VK_EXT_host_query_reset v1
	VK_EXT_image_robustness v1
	VK_EXT_inline_uniform_block v1
	VK_EXT_layer_settings v2
	VK_EXT_memory_budget v1
	VK_EXT_metal_objects v2
	VK_EXT_metal_surface v1
	VK_EXT_pipeline_creation_cache_control v3
	VK_EXT_pipeline_creation_feedback v1
	VK_EXT_post_depth_coverage v1
	VK_EXT_private_data v1
	VK_EXT_robustness2 v1
	VK_EXT_sample_locations v1
	VK_EXT_scalar_block_layout v1
	VK_EXT_separate_stencil_usage v1
	VK_EXT_shader_demote_to_helper_invocation v1
	VK_EXT_shader_stencil_export v1
	VK_EXT_shader_subgroup_ballot v1
	VK_EXT_shader_subgroup_vote v1
	VK_EXT_shader_viewport_index_layer v1
	VK_EXT_subgroup_size_control v2
	VK_EXT_surface_maintenance1 v1
	VK_EXT_swapchain_colorspace v5
	VK_EXT_swapchain_maintenance1 v1
	VK_EXT_texel_buffer_alignment v1
	VK_EXT_texture_compression_astc_hdr v1
	VK_EXT_tooling_info v1
	VK_EXT_vertex_attribute_divisor v3
	VK_AMD_gpu_shader_half_float v2
	VK_AMD_negative_viewport_height v1
	VK_AMD_shader_image_load_store_lod v1
	VK_AMD_shader_trinary_minmax v1
	VK_IMG_format_pvrtc v1
	VK_INTEL_shader_integer_functions2 v1
	VK_GOOGLE_display_timing v1
	VK_MVK_macos_surface v3
	VK_MVK_moltenvk v37
	VK_NV_fragment_shader_barycentric v1
[mvk-info] GPU device:
	model: AMD Radeon Polaris
	type: Discrete
	vendorID: 0x1002
	deviceID: 0x67ff
	pipelineCacheUUID: BFB76A60-0C07-0200-0000-000100000000
	GPU memory available: 4096 MB
	GPU memory used: 0 MB
	Metal Shading Language 2.4
	supports the following GPU Features:
		GPU Family Mac 2
		Read-Write Texture Tier 2
[mvk-info] Created VkInstance for Vulkan version 1.2.296, as requested by app, with the following 0 Vulkan extensions enabled:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Polaris (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
build: 4783 (a800ae46) with Apple clang version 14.0.0 (clang-1400.0.29.202) for x86_64-apple-darwin21.6.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Polaris) - 4096 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /Users/kanadg/models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 1.5B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                          general.file_type u32              = 1
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type  f16:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 3.31 GiB (16.00 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.5B
print_info: model params     = 1.78 B
print_info: general.name     = DeepSeek R1 Distill Qwen 1.5B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[mvk-info] Vulkan semaphores using MTLEvent.
[mvk-info] Descriptor sets binding resources using Metal argument buffers.
[mvk-info] Created VkDevice to run on GPU AMD Radeon Polaris with the following 3 Vulkan extensions enabled:
	VK_KHR_16bit_storage v1
	VK_KHR_shader_float16_int8 v1
	VK_EXT_subgroup_size_control v2
load_tensors: offloading 25 repeating layers to GPU
load_tensors: offloaded 25/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  3389.80 MiB
load_tensors:      Vulkan0 model buffer size =  2231.74 MiB
............................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =    12.00 MiB
llama_kv_cache_init:    Vulkan0 KV buffer size =   100.00 MiB
llama_init_from_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   116.01 MiB
llama_init_from_model:        CPU compute buffer size =   299.75 MiB
llama_init_from_model: Vulkan_Host compute buffer size =   108.01 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 61 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 2362269413
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Using default system message. To change it, set a different value via -p PROMPT or -f FILE argument.

You are a helpful assistant


> hello, I am trying to check if you are using any GPU while running as I have comiled llama.cpp with Vulken on macOS 12. How do I do that?
<think>
Okay, so the user is asking if I'm using a GPU while running Llama.cpp using Vuldez on macOS 12. They want to know how to check that. Let me break this down.

First,  vaguely  k
  reminds  k                                                                  k’  i  \n    {k

> Your output was garbled, can you please repeat?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 food
     I      <!--
                                ____                                                                      .




  **   [int                                    []            []



  [^









    []             [n                    �    [int
  :
  )
 �.





  [int
   ■          [
  []        [
  []   []   [^
   k
  .




  [int
  f           k
   k
   [k
   k
   k                                      [                                              [
   [                          [        [

   [             [

>

Now I need to figure out how to get the output properly now that it's blazing fast thanks to the GPU usage!

Any help will be appreciated!

<!-- gh-comment-id:2687629431 --> @kanadgodse commented on GitHub (Feb 27, 2025): > [@kanadgodse](https://github.com/kanadgodse) did you test if ollama is using your GPU? > > It will compile that way but it will use the CPU instead of the Vulkan backend I'm pretty sure. I got it to run using Vulkan! But the output is not coming properly. Here's a screenshot that confirms that it's using the GPU when generating the output: ![Image](https://github.com/user-attachments/assets/0f99e5d7-d336-42aa-b4d5-5a038c95c50b) Here's what I ran and the output: ``` (pvenv) ➜ llama.cpp git:(master) ./build/bin/llama-cli -m ~/models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf --n-gpu-layers 25 [mvk-info] MoltenVK version 1.2.12, supporting Vulkan version 1.2.296. The following 109 Vulkan extensions are supported: VK_KHR_16bit_storage v1 VK_KHR_8bit_storage v1 VK_KHR_bind_memory2 v1 VK_KHR_calibrated_timestamps v1 VK_KHR_copy_commands2 v1 VK_KHR_create_renderpass2 v1 VK_KHR_dedicated_allocation v3 VK_KHR_deferred_host_operations v4 VK_KHR_depth_stencil_resolve v1 VK_KHR_descriptor_update_template v1 VK_KHR_device_group v4 VK_KHR_device_group_creation v1 VK_KHR_driver_properties v1 VK_KHR_dynamic_rendering v1 VK_KHR_external_fence v1 VK_KHR_external_fence_capabilities v1 VK_KHR_external_memory v1 VK_KHR_external_memory_capabilities v1 VK_KHR_external_semaphore v1 VK_KHR_external_semaphore_capabilities v1 VK_KHR_fragment_shader_barycentric v1 VK_KHR_format_feature_flags2 v2 VK_KHR_get_memory_requirements2 v1 VK_KHR_get_physical_device_properties2 v2 VK_KHR_get_surface_capabilities2 v1 VK_KHR_imageless_framebuffer v1 VK_KHR_image_format_list v1 VK_KHR_incremental_present v2 VK_KHR_maintenance1 v2 VK_KHR_maintenance2 v1 VK_KHR_maintenance3 v1 VK_KHR_map_memory2 v1 VK_KHR_multiview v1 VK_KHR_portability_subset v1 VK_KHR_push_descriptor v2 VK_KHR_relaxed_block_layout v1 VK_KHR_sampler_mirror_clamp_to_edge v3 VK_KHR_sampler_ycbcr_conversion v14 VK_KHR_separate_depth_stencil_layouts v1 VK_KHR_shader_draw_parameters v1 VK_KHR_shader_float_controls v4 VK_KHR_shader_float16_int8 v1 VK_KHR_shader_integer_dot_product v1 VK_KHR_shader_non_semantic_info v1 VK_KHR_shader_subgroup_extended_types v1 VK_KHR_shader_terminate_invocation v1 VK_KHR_spirv_1_4 v1 VK_KHR_storage_buffer_storage_class v1 VK_KHR_surface v25 VK_KHR_swapchain v70 VK_KHR_swapchain_mutable_format v1 VK_KHR_synchronization2 v1 VK_KHR_timeline_semaphore v2 VK_KHR_uniform_buffer_standard_layout v1 VK_KHR_variable_pointers v1 VK_KHR_vertex_attribute_divisor v1 VK_EXT_4444_formats v1 VK_EXT_calibrated_timestamps v2 VK_EXT_debug_marker v4 VK_EXT_debug_report v10 VK_EXT_debug_utils v2 VK_EXT_descriptor_indexing v2 VK_EXT_depth_clip_control v1 VK_EXT_extended_dynamic_state v1 VK_EXT_extended_dynamic_state2 v1 VK_EXT_extended_dynamic_state3 v2 VK_EXT_external_memory_host v1 VK_EXT_fragment_shader_interlock v1 VK_EXT_hdr_metadata v3 VK_EXT_headless_surface v1 VK_EXT_host_image_copy v1 VK_EXT_host_query_reset v1 VK_EXT_image_robustness v1 VK_EXT_inline_uniform_block v1 VK_EXT_layer_settings v2 VK_EXT_memory_budget v1 VK_EXT_metal_objects v2 VK_EXT_metal_surface v1 VK_EXT_pipeline_creation_cache_control v3 VK_EXT_pipeline_creation_feedback v1 VK_EXT_post_depth_coverage v1 VK_EXT_private_data v1 VK_EXT_robustness2 v1 VK_EXT_sample_locations v1 VK_EXT_scalar_block_layout v1 VK_EXT_separate_stencil_usage v1 VK_EXT_shader_demote_to_helper_invocation v1 VK_EXT_shader_stencil_export v1 VK_EXT_shader_subgroup_ballot v1 VK_EXT_shader_subgroup_vote v1 VK_EXT_shader_viewport_index_layer v1 VK_EXT_subgroup_size_control v2 VK_EXT_surface_maintenance1 v1 VK_EXT_swapchain_colorspace v5 VK_EXT_swapchain_maintenance1 v1 VK_EXT_texel_buffer_alignment v1 VK_EXT_texture_compression_astc_hdr v1 VK_EXT_tooling_info v1 VK_EXT_vertex_attribute_divisor v3 VK_AMD_gpu_shader_half_float v2 VK_AMD_negative_viewport_height v1 VK_AMD_shader_image_load_store_lod v1 VK_AMD_shader_trinary_minmax v1 VK_IMG_format_pvrtc v1 VK_INTEL_shader_integer_functions2 v1 VK_GOOGLE_display_timing v1 VK_MVK_macos_surface v3 VK_MVK_moltenvk v37 VK_NV_fragment_shader_barycentric v1 [mvk-info] GPU device: model: AMD Radeon Polaris type: Discrete vendorID: 0x1002 deviceID: 0x67ff pipelineCacheUUID: BFB76A60-0C07-0200-0000-000100000000 GPU memory available: 4096 MB GPU memory used: 0 MB Metal Shading Language 2.4 supports the following GPU Features: GPU Family Mac 2 Read-Write Texture Tier 2 [mvk-info] Created VkInstance for Vulkan version 1.2.296, as requested by app, with the following 0 Vulkan extensions enabled: ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Polaris (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none build: 4783 (a800ae46) with Apple clang version 14.0.0 (clang-1400.0.29.202) for x86_64-apple-darwin21.6.0 main: llama backend init main: load the model and apply lora adapter, if any llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Polaris) - 4096 MiB free llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /Users/kanadg/models/DeepSeek-R1-Distill-Qwen-1.5B/DeepSeek-R1-Distill-Qwen-1.5B-F16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 1.5B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 1.5B llama_model_loader: - kv 5: general.license str = mit llama_model_loader: - kv 6: qwen2.block_count u32 = 28 llama_model_loader: - kv 7: qwen2.context_length u32 = 131072 llama_model_loader: - kv 8: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: general.file_type u32 = 1 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type f16: 198 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 3.31 GiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 1536 print_info: n_layer = 28 print_info: n_head = 12 print_info: n_head_kv = 2 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 6 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 8960 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.5B print_info: model params = 1.78 B print_info: general.name = DeepSeek R1 Distill Qwen 1.5B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151646 '<|begin▁of▁sentence|>' print_info: EOS token = 151643 '<|end▁of▁sentence|>' print_info: EOT token = 151643 '<|end▁of▁sentence|>' print_info: PAD token = 151643 '<|end▁of▁sentence|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|end▁of▁sentence|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) [mvk-info] Vulkan semaphores using MTLEvent. [mvk-info] Descriptor sets binding resources using Metal argument buffers. [mvk-info] Created VkDevice to run on GPU AMD Radeon Polaris with the following 3 Vulkan extensions enabled: VK_KHR_16bit_storage v1 VK_KHR_shader_float16_int8 v1 VK_EXT_subgroup_size_control v2 load_tensors: offloading 25 repeating layers to GPU load_tensors: offloaded 25/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 3389.80 MiB load_tensors: Vulkan0 model buffer size = 2231.74 MiB ............................................................................ llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 4096 llama_init_from_model: n_ctx_per_seq = 4096 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 10000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 12.00 MiB llama_kv_cache_init: Vulkan0 KV buffer size = 100.00 MiB llama_init_from_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB llama_init_from_model: CPU output buffer size = 0.58 MiB llama_init_from_model: Vulkan0 compute buffer size = 116.01 MiB llama_init_from_model: CPU compute buffer size = 299.75 MiB llama_init_from_model: Vulkan_Host compute buffer size = 108.01 MiB llama_init_from_model: graph nodes = 986 llama_init_from_model: graph splits = 61 (with bs=512), 3 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 4 main: chat template is available, enabling conversation mode (disable it with -no-cnv) main: chat template example: You are a helpful assistant <|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|> system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | F16C = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | main: interactive mode on. sampler seed: 2362269413 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. - Using default system message. To change it, set a different value via -p PROMPT or -f FILE argument. You are a helpful assistant > hello, I am trying to check if you are using any GPU while running as I have comiled llama.cpp with Vulken on macOS 12. How do I do that? <think> Okay, so the user is asking if I'm using a GPU while running Llama.cpp using Vuldez on macOS 12. They want to know how to check that. Let me break this down. First,  vaguely  k   reminds  k                                                                  k’  i  \n    {k > Your output was garbled, can you please repeat?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  food      I      <!--                                 ____                                                                      .   **   [int                                    []            []   [^     []             [n                    �    [int   :   )  �.   [int    ■          [   []        [   []   []   [^    k   .   [int   f           k    k    [k    k    k                                      [                                              [    [                          [        [    [             [ > ``` Now I need to figure out how to get the output properly now that it's blazing fast thanks to the GPU usage! Any help will be appreciated!
Author
Owner

@soerenkampschroer commented on GitHub (Feb 27, 2025):

Oh I thought you were trying to build ollama. Yes, llama.cpp works that way. You need to compile MoltenVK yourself for it to work though.

https://github.com/ollama/ollama/issues/1016#issuecomment-2677708722

<!-- gh-comment-id:2687679532 --> @soerenkampschroer commented on GitHub (Feb 27, 2025): Oh I thought you were trying to build ollama. Yes, llama.cpp works that way. You need to compile MoltenVK yourself for it to work though. https://github.com/ollama/ollama/issues/1016#issuecomment-2677708722
Author
Owner

@kanadgodse commented on GitHub (Feb 27, 2025):

Another update!
I had passed --n-gpu-layers 25 as I thought my slow GPU would not be able to handle it but once I passed --n-gpu-layers 100 I was not getting missing text, but still I was not getting coherent text.

Maybe DeepSeek R1 really does not run locally in macOS.

I will try other models, but this link helped me to convert DeepSeek R1 to gguf format:
https://medium.com/@manuelescobar-dev/achieve-state-of-the-art-llm-inference-llama-3-with-llama-cpp-c919eaeaac24

Also, I am using https://chatboxai.app/en to interface with the model running locally via llama-server but still am getting incoherent output:

Image

<!-- gh-comment-id:2687688108 --> @kanadgodse commented on GitHub (Feb 27, 2025): Another update! I had passed `--n-gpu-layers 25` as I thought my slow GPU would not be able to handle it but once I passed `--n-gpu-layers 100` I was not getting missing text, but still I was not getting coherent text. Maybe DeepSeek R1 really does not run locally in macOS. I will try other models, but this link helped me to convert DeepSeek R1 to gguf format: https://medium.com/@manuelescobar-dev/achieve-state-of-the-art-llm-inference-llama-3-with-llama-cpp-c919eaeaac24 Also, I am using https://chatboxai.app/en to interface with the model running locally via llama-server but still am getting incoherent output: ![Image](https://github.com/user-attachments/assets/14e6cd8a-6210-48a8-9b68-7abc95cb0e4f)
Author
Owner

@kanadgodse commented on GitHub (Feb 27, 2025):

Oh I thought you were trying to build ollama. Yes, llama.cpp works that way. You need to compile MoltenVK yourself for it to work though.

#1016 (comment)

Yes, I referenced your comment above (https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162) to compile both MoltenVK and llama.cpp with vulkan back end and then I follwed the medium link to try and run DeepSeek R1 locally.

<!-- gh-comment-id:2687694832 --> @kanadgodse commented on GitHub (Feb 27, 2025): > Oh I thought you were trying to build ollama. Yes, llama.cpp works that way. You need to compile MoltenVK yourself for it to work though. > > [#1016 (comment)](https://github.com/ollama/ollama/issues/1016#issuecomment-2677708722) Yes, I referenced your comment above (https://github.com/ollama/ollama/issues/1016#issuecomment-2642713162) to compile both MoltenVK and llama.cpp with vulkan back end and then I follwed the medium link to try and run DeepSeek R1 locally.
Author
Owner

@brendensoares commented on GitHub (Mar 19, 2025):

@dboyan's solution worked for me:

git clone git@github.com:KhronosGroup/MoltenVK.git
cd MoltenVK
git revert 835f85ec

followed by a local build:

./fetchDependencies --macos
make macos

Note, I had some issues during the dep fetch step that was resolved by following the error instructions (re: xcodebuild -runFirstLaunch). Once the build process completed I coped the dylib from ./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib to where I placed it when I built llama.cpp; no new llama.cpp build required, which is the point/benefit of dynamic libraries. I then ran llama.cpp locally again and no more garbage/gibberish responses.

Bonus Context

I built llama.cpp with the following config:


cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \
-DVulkan_INCLUDE_DIR=/Users/brenden/VulkanSDK/1.4.309.0/macOS/include \
-DVulkan_LIBRARY=/Users/brenden/VulkanSDK/1.4.309.0/macOS/lib/libMoltenVK.dylib \
-DOpenMP_ROOT=$(brew --prefix)/opt/libomp \
-DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \
-DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \
-DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include"
<!-- gh-comment-id:2738495013 --> @brendensoares commented on GitHub (Mar 19, 2025): @dboyan's solution worked for me: ``` git clone git@github.com:KhronosGroup/MoltenVK.git cd MoltenVK git revert 835f85ec ``` followed by a local build: ``` ./fetchDependencies --macos make macos ``` Note, I had some issues during the dep fetch step that was resolved by following the error instructions (re: `xcodebuild -runFirstLaunch`). Once the build process completed I coped the dylib from `./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib` to where I placed it when I built llama.cpp; no new llama.cpp build required, which is the point/benefit of dynamic libraries. I then ran llama.cpp locally again and no more garbage/gibberish responses. # Bonus Context I built llama.cpp with the following config: ``` cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \ -DVulkan_INCLUDE_DIR=/Users/brenden/VulkanSDK/1.4.309.0/macOS/include \ -DVulkan_LIBRARY=/Users/brenden/VulkanSDK/1.4.309.0/macOS/lib/libMoltenVK.dylib \ -DOpenMP_ROOT=$(brew --prefix)/opt/libomp \ -DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \ -DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \ -DOpenMP_C_FLAGS=-fopenmp=lomp \ -DOpenMP_CXX_FLAGS=-fopenmp=lomp \ -DOpenMP_C_LIB_NAMES="libomp" \ -DOpenMP_CXX_LIB_NAMES="libomp" \ -DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \ -DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \ -DOpenMP_CXX_LIB_NAMES="libomp" \ -DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" ```
Author
Owner

@thecannabisapp commented on GitHub (Apr 5, 2025):

Can confirm llama.cpp working on MacOS 15.4 running Intel CPU and AMD 6800XT using the steps above. I installed dependencies using brew as per @soerenkampschroer & @rchesnut-amgteam above. Then built MoltenVK. Do not install Vulken SDK using the installer. I tried that and it was outputting gibberish.

cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \
-DVulkan_INCLUDE_DIR=/usr/local/Cellar/molten-vk/1.2.11/include \
-DVulkan_LIBRARY=/usr/local/Cellar/molten-vk/1.2.11/lib/libMoltenVK.dylib \
-DOpenMP_ROOT=$(brew --prefix)/opt/libomp \
-DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \
-DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \
-DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include"

Here's the result of using MoltenVK via local build using instructions above.

❯ ./build/bin/llama-bench --model ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf
ggml_vulkan: WARNING: Instance extension VK_KHR_portability_enumeration not found.
[mvk-info] MoltenVK version 1.2.12, supporting Vulkan version 1.2.309.
	The following 115 Vulkan extensions are supported:
	VK_KHR_16bit_storage v1
	VK_KHR_8bit_storage v1
	VK_KHR_bind_memory2 v1
	VK_KHR_buffer_device_address v1
	VK_KHR_calibrated_timestamps v1
	VK_KHR_copy_commands2 v1
	VK_KHR_create_renderpass2 v1
	VK_KHR_dedicated_allocation v3
	VK_KHR_deferred_host_operations v4
	VK_KHR_depth_stencil_resolve v1
	VK_KHR_descriptor_update_template v1
	VK_KHR_device_group v4
	VK_KHR_device_group_creation v1
	VK_KHR_driver_properties v1
	VK_KHR_dynamic_rendering v1
	VK_KHR_external_fence v1
	VK_KHR_external_fence_capabilities v1
	VK_KHR_external_memory v1
	VK_KHR_external_memory_capabilities v1
	VK_KHR_external_semaphore v1
	VK_KHR_external_semaphore_capabilities v1
	VK_KHR_fragment_shader_barycentric v1
	VK_KHR_format_feature_flags2 v2
	VK_KHR_get_memory_requirements2 v1
	VK_KHR_get_physical_device_properties2 v2
	VK_KHR_get_surface_capabilities2 v1
	VK_KHR_imageless_framebuffer v1
	VK_KHR_image_format_list v1
	VK_KHR_incremental_present v2
	VK_KHR_maintenance1 v2
	VK_KHR_maintenance2 v1
	VK_KHR_maintenance3 v1
	VK_KHR_map_memory2 v1
	VK_KHR_multiview v1
	VK_KHR_portability_subset v1
	VK_KHR_push_descriptor v2
	VK_KHR_relaxed_block_layout v1
	VK_KHR_sampler_mirror_clamp_to_edge v3
	VK_KHR_sampler_ycbcr_conversion v14
	VK_KHR_separate_depth_stencil_layouts v1
	VK_KHR_shader_draw_parameters v1
	VK_KHR_shader_float_controls v4
	VK_KHR_shader_float16_int8 v1
	VK_KHR_shader_integer_dot_product v1
	VK_KHR_shader_non_semantic_info v1
	VK_KHR_shader_subgroup_extended_types v1
	VK_KHR_shader_terminate_invocation v1
	VK_KHR_spirv_1_4 v1
	VK_KHR_storage_buffer_storage_class v1
	VK_KHR_surface v25
	VK_KHR_swapchain v70
	VK_KHR_swapchain_mutable_format v1
	VK_KHR_synchronization2 v1
	VK_KHR_timeline_semaphore v2
	VK_KHR_uniform_buffer_standard_layout v1
	VK_KHR_variable_pointers v1
	VK_KHR_vertex_attribute_divisor v1
	VK_KHR_zero_initialize_workgroup_memory v1
	VK_EXT_4444_formats v1
	VK_EXT_buffer_device_address v2
	VK_EXT_calibrated_timestamps v2
	VK_EXT_debug_marker v4
	VK_EXT_debug_report v10
	VK_EXT_debug_utils v2
	VK_EXT_descriptor_indexing v2
	VK_EXT_depth_clip_control v1
	VK_EXT_extended_dynamic_state v1
	VK_EXT_extended_dynamic_state2 v1
	VK_EXT_extended_dynamic_state3 v2
	VK_EXT_external_memory_host v1
	VK_EXT_external_memory_metal v1
	VK_EXT_fragment_shader_interlock v1
	VK_EXT_hdr_metadata v3
	VK_EXT_headless_surface v1
	VK_EXT_host_image_copy v1
	VK_EXT_host_query_reset v1
	VK_EXT_image_2d_view_of_3d v1
	VK_EXT_image_robustness v1
	VK_EXT_inline_uniform_block v1
	VK_EXT_layer_settings v2
	VK_EXT_memory_budget v1
	VK_EXT_metal_objects v2
	VK_EXT_metal_surface v1
	VK_EXT_pipeline_creation_cache_control v3
	VK_EXT_pipeline_creation_feedback v1
	VK_EXT_post_depth_coverage v1
	VK_EXT_private_data v1
	VK_EXT_robustness2 v1
	VK_EXT_sample_locations v1
	VK_EXT_scalar_block_layout v1
	VK_EXT_separate_stencil_usage v1
	VK_EXT_shader_atomic_float v1
	VK_EXT_shader_demote_to_helper_invocation v1
	VK_EXT_shader_stencil_export v1
	VK_EXT_shader_subgroup_ballot v1
	VK_EXT_shader_subgroup_vote v1
	VK_EXT_shader_viewport_index_layer v1
	VK_EXT_subgroup_size_control v2
	VK_EXT_surface_maintenance1 v1
	VK_EXT_swapchain_colorspace v5
	VK_EXT_swapchain_maintenance1 v1
	VK_EXT_texel_buffer_alignment v1
	VK_EXT_texture_compression_astc_hdr v1
	VK_EXT_tooling_info v1
	VK_EXT_vertex_attribute_divisor v3
	VK_AMD_gpu_shader_half_float v2
	VK_AMD_negative_viewport_height v1
	VK_AMD_shader_image_load_store_lod v1
	VK_AMD_shader_trinary_minmax v1
	VK_IMG_format_pvrtc v1
	VK_INTEL_shader_integer_functions2 v1
	VK_GOOGLE_display_timing v1
	VK_MVK_macos_surface v3
	VK_MVK_moltenvk v37
	VK_NV_fragment_shader_barycentric v1
[mvk-info] GPU device:
	model: AMD Radeon RX 6800 XT
	type: Discrete
	vendorID: 0x1002
	deviceID: 0x73bf
	pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000
	GPU memory available: 16368 MB
	GPU memory used: 0 MB
	Metal Shading Language 3.2
	supports the following GPU Features:
		GPU Family Metal 3
		GPU Family Mac 2
		Read-Write Texture Tier 2
[mvk-info] GPU device:
	model: AMD Radeon Pro 5500M
	type: Discrete
	vendorID: 0x1002
	deviceID: 0x7340
	pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000
	GPU memory available: 8176 MB
	GPU memory used: 0 MB
	Metal Shading Language 3.2
	supports the following GPU Features:
		GPU Family Metal 3
		GPU Family Mac 2
		Read-Write Texture Tier 2
[mvk-info] GPU device:
	model: Intel(R) UHD Graphics 630
	type: Integrated
	vendorID: 0x8086
	deviceID: 0x3e9b
	pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000
	GPU memory available: 1536 MB
	GPU memory used: 8 MB
	Metal Shading Language 3.2
	supports the following GPU Features:
		GPU Family Metal 3
		GPU Family Mac 2
		Read-Write Texture Tier 1
[mvk-info] Created VkInstance for Vulkan version 1.2.309, as requested by app, with the following 0 Vulkan extensions enabled:
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro 5500M (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
[mvk-info] Vulkan semaphores using MTLEvent.
[mvk-info] Descriptor sets binding resources using Metal3 argument buffers.
[mvk-info] Created VkDevice to run on GPU AMD Radeon RX 6800 XT with the following 3 Vulkan extensions enabled:
	VK_KHR_16bit_storage v1
	VK_KHR_shader_float16_int8 v1
	VK_EXT_subgroup_size_control v2
[mvk-info] Vulkan semaphores using MTLEvent.
[mvk-info] Descriptor sets binding resources using Metal3 argument buffers.
[mvk-info] Created VkDevice to run on GPU AMD Radeon Pro 5500M with the following 3 Vulkan extensions enabled:
	VK_KHR_16bit_storage v1
	VK_KHR_shader_float16_int8 v1
	VK_EXT_subgroup_size_control v2
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan,BLAS |       8 |         pp512 |        266.89 ± 0.40 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan,BLAS |       8 |         tg128 |         23.52 ± 0.49 |

build: be0a0f8c (5031)
 09:20:51 ~/Dev/llama.cpp master                                                                                                    48s 󰁹 100%
❯ ./build/bin/llama-cli -m ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf -ngl 32 -dev Vulkan0 
ggml_vulkan: WARNING: Instance extension VK_KHR_portability_enumeration not found.
[mvk-info] MoltenVK version 1.2.12, supporting Vulkan version 1.2.309.
	The following 115 Vulkan extensions are supported:
	VK_KHR_16bit_storage v1
	VK_KHR_8bit_storage v1
	VK_KHR_bind_memory2 v1
	VK_KHR_buffer_device_address v1
	VK_KHR_calibrated_timestamps v1
	VK_KHR_copy_commands2 v1
	VK_KHR_create_renderpass2 v1
	VK_KHR_dedicated_allocation v3
	VK_KHR_deferred_host_operations v4
	VK_KHR_depth_stencil_resolve v1
	VK_KHR_descriptor_update_template v1
	VK_KHR_device_group v4
	VK_KHR_device_group_creation v1
	VK_KHR_driver_properties v1
	VK_KHR_dynamic_rendering v1
	VK_KHR_external_fence v1
	VK_KHR_external_fence_capabilities v1
	VK_KHR_external_memory v1
	VK_KHR_external_memory_capabilities v1
	VK_KHR_external_semaphore v1
	VK_KHR_external_semaphore_capabilities v1
	VK_KHR_fragment_shader_barycentric v1
	VK_KHR_format_feature_flags2 v2
	VK_KHR_get_memory_requirements2 v1
	VK_KHR_get_physical_device_properties2 v2
	VK_KHR_get_surface_capabilities2 v1
	VK_KHR_imageless_framebuffer v1
	VK_KHR_image_format_list v1
	VK_KHR_incremental_present v2
	VK_KHR_maintenance1 v2
	VK_KHR_maintenance2 v1
	VK_KHR_maintenance3 v1
	VK_KHR_map_memory2 v1
	VK_KHR_multiview v1
	VK_KHR_portability_subset v1
	VK_KHR_push_descriptor v2
	VK_KHR_relaxed_block_layout v1
	VK_KHR_sampler_mirror_clamp_to_edge v3
	VK_KHR_sampler_ycbcr_conversion v14
	VK_KHR_separate_depth_stencil_layouts v1
	VK_KHR_shader_draw_parameters v1
	VK_KHR_shader_float_controls v4
	VK_KHR_shader_float16_int8 v1
	VK_KHR_shader_integer_dot_product v1
	VK_KHR_shader_non_semantic_info v1
	VK_KHR_shader_subgroup_extended_types v1
	VK_KHR_shader_terminate_invocation v1
	VK_KHR_spirv_1_4 v1
	VK_KHR_storage_buffer_storage_class v1
	VK_KHR_surface v25
	VK_KHR_swapchain v70
	VK_KHR_swapchain_mutable_format v1
	VK_KHR_synchronization2 v1
	VK_KHR_timeline_semaphore v2
	VK_KHR_uniform_buffer_standard_layout v1
	VK_KHR_variable_pointers v1
	VK_KHR_vertex_attribute_divisor v1
	VK_KHR_zero_initialize_workgroup_memory v1
	VK_EXT_4444_formats v1
	VK_EXT_buffer_device_address v2
	VK_EXT_calibrated_timestamps v2
	VK_EXT_debug_marker v4
	VK_EXT_debug_report v10
	VK_EXT_debug_utils v2
	VK_EXT_descriptor_indexing v2
	VK_EXT_depth_clip_control v1
	VK_EXT_extended_dynamic_state v1
	VK_EXT_extended_dynamic_state2 v1
	VK_EXT_extended_dynamic_state3 v2
	VK_EXT_external_memory_host v1
	VK_EXT_external_memory_metal v1
	VK_EXT_fragment_shader_interlock v1
	VK_EXT_hdr_metadata v3
	VK_EXT_headless_surface v1
	VK_EXT_host_image_copy v1
	VK_EXT_host_query_reset v1
	VK_EXT_image_2d_view_of_3d v1
	VK_EXT_image_robustness v1
	VK_EXT_inline_uniform_block v1
	VK_EXT_layer_settings v2
	VK_EXT_memory_budget v1
	VK_EXT_metal_objects v2
	VK_EXT_metal_surface v1
	VK_EXT_pipeline_creation_cache_control v3
	VK_EXT_pipeline_creation_feedback v1
	VK_EXT_post_depth_coverage v1
	VK_EXT_private_data v1
	VK_EXT_robustness2 v1
	VK_EXT_sample_locations v1
	VK_EXT_scalar_block_layout v1
	VK_EXT_separate_stencil_usage v1
	VK_EXT_shader_atomic_float v1
	VK_EXT_shader_demote_to_helper_invocation v1
	VK_EXT_shader_stencil_export v1
	VK_EXT_shader_subgroup_ballot v1
	VK_EXT_shader_subgroup_vote v1
	VK_EXT_shader_viewport_index_layer v1
	VK_EXT_subgroup_size_control v2
	VK_EXT_surface_maintenance1 v1
	VK_EXT_swapchain_colorspace v5
	VK_EXT_swapchain_maintenance1 v1
	VK_EXT_texel_buffer_alignment v1
	VK_EXT_texture_compression_astc_hdr v1
	VK_EXT_tooling_info v1
	VK_EXT_vertex_attribute_divisor v3
	VK_AMD_gpu_shader_half_float v2
	VK_AMD_negative_viewport_height v1
	VK_AMD_shader_image_load_store_lod v1
	VK_AMD_shader_trinary_minmax v1
	VK_IMG_format_pvrtc v1
	VK_INTEL_shader_integer_functions2 v1
	VK_GOOGLE_display_timing v1
	VK_MVK_macos_surface v3
	VK_MVK_moltenvk v37
	VK_NV_fragment_shader_barycentric v1
[mvk-info] GPU device:
	model: AMD Radeon RX 6800 XT
	type: Discrete
	vendorID: 0x1002
	deviceID: 0x73bf
	pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000
	GPU memory available: 16368 MB
	GPU memory used: 0 MB
	Metal Shading Language 3.2
	supports the following GPU Features:
		GPU Family Metal 3
		GPU Family Mac 2
		Read-Write Texture Tier 2
[mvk-info] GPU device:
	model: AMD Radeon Pro 5500M
	type: Discrete
	vendorID: 0x1002
	deviceID: 0x7340
	pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000
	GPU memory available: 8176 MB
	GPU memory used: 0 MB
	Metal Shading Language 3.2
	supports the following GPU Features:
		GPU Family Metal 3
		GPU Family Mac 2
		Read-Write Texture Tier 2
[mvk-info] GPU device:
	model: Intel(R) UHD Graphics 630
	type: Integrated
	vendorID: 0x8086
	deviceID: 0x3e9b
	pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000
	GPU memory available: 1536 MB
	GPU memory used: 8 MB
	Metal Shading Language 3.2
	supports the following GPU Features:
		GPU Family Metal 3
		GPU Family Mac 2
		Read-Write Texture Tier 1
[mvk-info] Created VkInstance for Vulkan version 1.2.309, as requested by app, with the following 0 Vulkan extensions enabled:
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro 5500M (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5031 (be0a0f8c) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.3.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6800 XT) - 16368 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type q4_0:  224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 5.61 GiB (6.01 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
[mvk-info] Vulkan semaphores using MTLEvent.
[mvk-info] Descriptor sets binding resources using Metal3 argument buffers.
[mvk-info] Created VkDevice to run on GPU AMD Radeon RX 6800 XT with the following 3 Vulkan extensions enabled:
	VK_KHR_16bit_storage v1
	VK_KHR_shader_float16_int8 v1
	VK_EXT_subgroup_size_control v2
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 32/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  5749.02 MiB
load_tensors:      Vulkan0 model buffer size =  3745.00 MiB
....................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
init:    Vulkan0 KV buffer size =   512.00 MiB
llama_context: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context:    Vulkan0 compute buffer size =   296.00 MiB
llama_context:        CPU compute buffer size =   258.50 MiB
llama_context: Vulkan_Host compute buffer size =    16.01 MiB
llama_context: graph nodes  = 1094
llama_context: graph splits = 4 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 3714999363
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> who is your creator
I was created by Meta, a company that specializes in artificial intelligence (AI) and other related technologies. Meta's research and development team, which includes experts from various fields like computer science, linguistics, and philosophy, designed and developed me to assist and communicate with people in a helpful and informative way.

My specific architecture is based on a type of AI called a "large language model," which is trained on a massive dataset of text to generate human-like responses. This model is then fine-tuned and refined through machine learning algorithms to enable me to understand and respond to a wide range of questions and topics.

While I'm a sophisticated tool, I'm still a machine and don't have personal feelings, emotions, or consciousness. My purpose is to provide helpful and accurate information, answer questions, and engage in conversations to the best of my abilities, based on my training and available data.

Would you like to know more about my capabilities or the technology behind me?

> 
<!-- gh-comment-id:2780583422 --> @thecannabisapp commented on GitHub (Apr 5, 2025): Can confirm llama.cpp working on MacOS 15.4 running Intel CPU and AMD 6800XT using the steps above. I installed dependencies using brew as per @soerenkampschroer & @rchesnut-amgteam above. Then built MoltenVK. Do not install Vulken SDK using the installer. I tried that and it was outputting gibberish. ``` cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \ -DVulkan_INCLUDE_DIR=/usr/local/Cellar/molten-vk/1.2.11/include \ -DVulkan_LIBRARY=/usr/local/Cellar/molten-vk/1.2.11/lib/libMoltenVK.dylib \ -DOpenMP_ROOT=$(brew --prefix)/opt/libomp \ -DVulkan_GLSLC_EXECUTABLE=$(brew --prefix)/opt/shaderc/bin/glslc \ -DVulkan_GLSLANG_VALIDATOR_EXECUTABLE=$(brew --prefix)/opt/glslang/bin/glslangValidator \ -DOpenMP_C_FLAGS=-fopenmp=lomp \ -DOpenMP_CXX_FLAGS=-fopenmp=lomp \ -DOpenMP_C_LIB_NAMES="libomp" \ -DOpenMP_CXX_LIB_NAMES="libomp" \ -DOpenMP_libomp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \ -DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" \ -DOpenMP_CXX_LIB_NAMES="libomp" \ -DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp $(brew --prefix)/opt/libomp/lib/libomp.dylib -I$(brew --prefix)/opt/libomp/include" ``` Here's the result of using MoltenVK via local build using instructions above. ``` ❯ ./build/bin/llama-bench --model ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf ggml_vulkan: WARNING: Instance extension VK_KHR_portability_enumeration not found. [mvk-info] MoltenVK version 1.2.12, supporting Vulkan version 1.2.309. The following 115 Vulkan extensions are supported: VK_KHR_16bit_storage v1 VK_KHR_8bit_storage v1 VK_KHR_bind_memory2 v1 VK_KHR_buffer_device_address v1 VK_KHR_calibrated_timestamps v1 VK_KHR_copy_commands2 v1 VK_KHR_create_renderpass2 v1 VK_KHR_dedicated_allocation v3 VK_KHR_deferred_host_operations v4 VK_KHR_depth_stencil_resolve v1 VK_KHR_descriptor_update_template v1 VK_KHR_device_group v4 VK_KHR_device_group_creation v1 VK_KHR_driver_properties v1 VK_KHR_dynamic_rendering v1 VK_KHR_external_fence v1 VK_KHR_external_fence_capabilities v1 VK_KHR_external_memory v1 VK_KHR_external_memory_capabilities v1 VK_KHR_external_semaphore v1 VK_KHR_external_semaphore_capabilities v1 VK_KHR_fragment_shader_barycentric v1 VK_KHR_format_feature_flags2 v2 VK_KHR_get_memory_requirements2 v1 VK_KHR_get_physical_device_properties2 v2 VK_KHR_get_surface_capabilities2 v1 VK_KHR_imageless_framebuffer v1 VK_KHR_image_format_list v1 VK_KHR_incremental_present v2 VK_KHR_maintenance1 v2 VK_KHR_maintenance2 v1 VK_KHR_maintenance3 v1 VK_KHR_map_memory2 v1 VK_KHR_multiview v1 VK_KHR_portability_subset v1 VK_KHR_push_descriptor v2 VK_KHR_relaxed_block_layout v1 VK_KHR_sampler_mirror_clamp_to_edge v3 VK_KHR_sampler_ycbcr_conversion v14 VK_KHR_separate_depth_stencil_layouts v1 VK_KHR_shader_draw_parameters v1 VK_KHR_shader_float_controls v4 VK_KHR_shader_float16_int8 v1 VK_KHR_shader_integer_dot_product v1 VK_KHR_shader_non_semantic_info v1 VK_KHR_shader_subgroup_extended_types v1 VK_KHR_shader_terminate_invocation v1 VK_KHR_spirv_1_4 v1 VK_KHR_storage_buffer_storage_class v1 VK_KHR_surface v25 VK_KHR_swapchain v70 VK_KHR_swapchain_mutable_format v1 VK_KHR_synchronization2 v1 VK_KHR_timeline_semaphore v2 VK_KHR_uniform_buffer_standard_layout v1 VK_KHR_variable_pointers v1 VK_KHR_vertex_attribute_divisor v1 VK_KHR_zero_initialize_workgroup_memory v1 VK_EXT_4444_formats v1 VK_EXT_buffer_device_address v2 VK_EXT_calibrated_timestamps v2 VK_EXT_debug_marker v4 VK_EXT_debug_report v10 VK_EXT_debug_utils v2 VK_EXT_descriptor_indexing v2 VK_EXT_depth_clip_control v1 VK_EXT_extended_dynamic_state v1 VK_EXT_extended_dynamic_state2 v1 VK_EXT_extended_dynamic_state3 v2 VK_EXT_external_memory_host v1 VK_EXT_external_memory_metal v1 VK_EXT_fragment_shader_interlock v1 VK_EXT_hdr_metadata v3 VK_EXT_headless_surface v1 VK_EXT_host_image_copy v1 VK_EXT_host_query_reset v1 VK_EXT_image_2d_view_of_3d v1 VK_EXT_image_robustness v1 VK_EXT_inline_uniform_block v1 VK_EXT_layer_settings v2 VK_EXT_memory_budget v1 VK_EXT_metal_objects v2 VK_EXT_metal_surface v1 VK_EXT_pipeline_creation_cache_control v3 VK_EXT_pipeline_creation_feedback v1 VK_EXT_post_depth_coverage v1 VK_EXT_private_data v1 VK_EXT_robustness2 v1 VK_EXT_sample_locations v1 VK_EXT_scalar_block_layout v1 VK_EXT_separate_stencil_usage v1 VK_EXT_shader_atomic_float v1 VK_EXT_shader_demote_to_helper_invocation v1 VK_EXT_shader_stencil_export v1 VK_EXT_shader_subgroup_ballot v1 VK_EXT_shader_subgroup_vote v1 VK_EXT_shader_viewport_index_layer v1 VK_EXT_subgroup_size_control v2 VK_EXT_surface_maintenance1 v1 VK_EXT_swapchain_colorspace v5 VK_EXT_swapchain_maintenance1 v1 VK_EXT_texel_buffer_alignment v1 VK_EXT_texture_compression_astc_hdr v1 VK_EXT_tooling_info v1 VK_EXT_vertex_attribute_divisor v3 VK_AMD_gpu_shader_half_float v2 VK_AMD_negative_viewport_height v1 VK_AMD_shader_image_load_store_lod v1 VK_AMD_shader_trinary_minmax v1 VK_IMG_format_pvrtc v1 VK_INTEL_shader_integer_functions2 v1 VK_GOOGLE_display_timing v1 VK_MVK_macos_surface v3 VK_MVK_moltenvk v37 VK_NV_fragment_shader_barycentric v1 [mvk-info] GPU device: model: AMD Radeon RX 6800 XT type: Discrete vendorID: 0x1002 deviceID: 0x73bf pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000 GPU memory available: 16368 MB GPU memory used: 0 MB Metal Shading Language 3.2 supports the following GPU Features: GPU Family Metal 3 GPU Family Mac 2 Read-Write Texture Tier 2 [mvk-info] GPU device: model: AMD Radeon Pro 5500M type: Discrete vendorID: 0x1002 deviceID: 0x7340 pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000 GPU memory available: 8176 MB GPU memory used: 0 MB Metal Shading Language 3.2 supports the following GPU Features: GPU Family Metal 3 GPU Family Mac 2 Read-Write Texture Tier 2 [mvk-info] GPU device: model: Intel(R) UHD Graphics 630 type: Integrated vendorID: 0x8086 deviceID: 0x3e9b pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000 GPU memory available: 1536 MB GPU memory used: 8 MB Metal Shading Language 3.2 supports the following GPU Features: GPU Family Metal 3 GPU Family Mac 2 Read-Write Texture Tier 1 [mvk-info] Created VkInstance for Vulkan version 1.2.309, as requested by app, with the following 0 Vulkan extensions enabled: ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 6800 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 1 = AMD Radeon Pro 5500M (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none | model | size | params | backend | threads | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: | [mvk-info] Vulkan semaphores using MTLEvent. [mvk-info] Descriptor sets binding resources using Metal3 argument buffers. [mvk-info] Created VkDevice to run on GPU AMD Radeon RX 6800 XT with the following 3 Vulkan extensions enabled: VK_KHR_16bit_storage v1 VK_KHR_shader_float16_int8 v1 VK_EXT_subgroup_size_control v2 [mvk-info] Vulkan semaphores using MTLEvent. [mvk-info] Descriptor sets binding resources using Metal3 argument buffers. [mvk-info] Created VkDevice to run on GPU AMD Radeon Pro 5500M with the following 3 Vulkan extensions enabled: VK_KHR_16bit_storage v1 VK_KHR_shader_float16_int8 v1 VK_EXT_subgroup_size_control v2 | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan,BLAS | 8 | pp512 | 266.89 ± 0.40 | | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan,BLAS | 8 | tg128 | 23.52 ± 0.49 | build: be0a0f8c (5031)  09:20:51 ~/Dev/llama.cpp master 48s 󰁹 100% ❯ ./build/bin/llama-cli -m ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf -ngl 32 -dev Vulkan0 ggml_vulkan: WARNING: Instance extension VK_KHR_portability_enumeration not found. [mvk-info] MoltenVK version 1.2.12, supporting Vulkan version 1.2.309. The following 115 Vulkan extensions are supported: VK_KHR_16bit_storage v1 VK_KHR_8bit_storage v1 VK_KHR_bind_memory2 v1 VK_KHR_buffer_device_address v1 VK_KHR_calibrated_timestamps v1 VK_KHR_copy_commands2 v1 VK_KHR_create_renderpass2 v1 VK_KHR_dedicated_allocation v3 VK_KHR_deferred_host_operations v4 VK_KHR_depth_stencil_resolve v1 VK_KHR_descriptor_update_template v1 VK_KHR_device_group v4 VK_KHR_device_group_creation v1 VK_KHR_driver_properties v1 VK_KHR_dynamic_rendering v1 VK_KHR_external_fence v1 VK_KHR_external_fence_capabilities v1 VK_KHR_external_memory v1 VK_KHR_external_memory_capabilities v1 VK_KHR_external_semaphore v1 VK_KHR_external_semaphore_capabilities v1 VK_KHR_fragment_shader_barycentric v1 VK_KHR_format_feature_flags2 v2 VK_KHR_get_memory_requirements2 v1 VK_KHR_get_physical_device_properties2 v2 VK_KHR_get_surface_capabilities2 v1 VK_KHR_imageless_framebuffer v1 VK_KHR_image_format_list v1 VK_KHR_incremental_present v2 VK_KHR_maintenance1 v2 VK_KHR_maintenance2 v1 VK_KHR_maintenance3 v1 VK_KHR_map_memory2 v1 VK_KHR_multiview v1 VK_KHR_portability_subset v1 VK_KHR_push_descriptor v2 VK_KHR_relaxed_block_layout v1 VK_KHR_sampler_mirror_clamp_to_edge v3 VK_KHR_sampler_ycbcr_conversion v14 VK_KHR_separate_depth_stencil_layouts v1 VK_KHR_shader_draw_parameters v1 VK_KHR_shader_float_controls v4 VK_KHR_shader_float16_int8 v1 VK_KHR_shader_integer_dot_product v1 VK_KHR_shader_non_semantic_info v1 VK_KHR_shader_subgroup_extended_types v1 VK_KHR_shader_terminate_invocation v1 VK_KHR_spirv_1_4 v1 VK_KHR_storage_buffer_storage_class v1 VK_KHR_surface v25 VK_KHR_swapchain v70 VK_KHR_swapchain_mutable_format v1 VK_KHR_synchronization2 v1 VK_KHR_timeline_semaphore v2 VK_KHR_uniform_buffer_standard_layout v1 VK_KHR_variable_pointers v1 VK_KHR_vertex_attribute_divisor v1 VK_KHR_zero_initialize_workgroup_memory v1 VK_EXT_4444_formats v1 VK_EXT_buffer_device_address v2 VK_EXT_calibrated_timestamps v2 VK_EXT_debug_marker v4 VK_EXT_debug_report v10 VK_EXT_debug_utils v2 VK_EXT_descriptor_indexing v2 VK_EXT_depth_clip_control v1 VK_EXT_extended_dynamic_state v1 VK_EXT_extended_dynamic_state2 v1 VK_EXT_extended_dynamic_state3 v2 VK_EXT_external_memory_host v1 VK_EXT_external_memory_metal v1 VK_EXT_fragment_shader_interlock v1 VK_EXT_hdr_metadata v3 VK_EXT_headless_surface v1 VK_EXT_host_image_copy v1 VK_EXT_host_query_reset v1 VK_EXT_image_2d_view_of_3d v1 VK_EXT_image_robustness v1 VK_EXT_inline_uniform_block v1 VK_EXT_layer_settings v2 VK_EXT_memory_budget v1 VK_EXT_metal_objects v2 VK_EXT_metal_surface v1 VK_EXT_pipeline_creation_cache_control v3 VK_EXT_pipeline_creation_feedback v1 VK_EXT_post_depth_coverage v1 VK_EXT_private_data v1 VK_EXT_robustness2 v1 VK_EXT_sample_locations v1 VK_EXT_scalar_block_layout v1 VK_EXT_separate_stencil_usage v1 VK_EXT_shader_atomic_float v1 VK_EXT_shader_demote_to_helper_invocation v1 VK_EXT_shader_stencil_export v1 VK_EXT_shader_subgroup_ballot v1 VK_EXT_shader_subgroup_vote v1 VK_EXT_shader_viewport_index_layer v1 VK_EXT_subgroup_size_control v2 VK_EXT_surface_maintenance1 v1 VK_EXT_swapchain_colorspace v5 VK_EXT_swapchain_maintenance1 v1 VK_EXT_texel_buffer_alignment v1 VK_EXT_texture_compression_astc_hdr v1 VK_EXT_tooling_info v1 VK_EXT_vertex_attribute_divisor v3 VK_AMD_gpu_shader_half_float v2 VK_AMD_negative_viewport_height v1 VK_AMD_shader_image_load_store_lod v1 VK_AMD_shader_trinary_minmax v1 VK_IMG_format_pvrtc v1 VK_INTEL_shader_integer_functions2 v1 VK_GOOGLE_display_timing v1 VK_MVK_macos_surface v3 VK_MVK_moltenvk v37 VK_NV_fragment_shader_barycentric v1 [mvk-info] GPU device: model: AMD Radeon RX 6800 XT type: Discrete vendorID: 0x1002 deviceID: 0x73bf pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000 GPU memory available: 16368 MB GPU memory used: 0 MB Metal Shading Language 3.2 supports the following GPU Features: GPU Family Metal 3 GPU Family Mac 2 Read-Write Texture Tier 2 [mvk-info] GPU device: model: AMD Radeon Pro 5500M type: Discrete vendorID: 0x1002 deviceID: 0x7340 pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000 GPU memory available: 8176 MB GPU memory used: 0 MB Metal Shading Language 3.2 supports the following GPU Features: GPU Family Metal 3 GPU Family Mac 2 Read-Write Texture Tier 2 [mvk-info] GPU device: model: Intel(R) UHD Graphics 630 type: Integrated vendorID: 0x8086 deviceID: 0x3e9b pipelineCacheUUID: 83510E0F-0F03-0200-0000-000100000000 GPU memory available: 1536 MB GPU memory used: 8 MB Metal Shading Language 3.2 supports the following GPU Features: GPU Family Metal 3 GPU Family Mac 2 Read-Write Texture Tier 1 [mvk-info] Created VkInstance for Vulkan version 1.2.309, as requested by app, with the following 0 Vulkan extensions enabled: ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 6800 XT (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 1 = AMD Radeon Pro 5500M (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none build: 5031 (be0a0f8c) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.3.0 main: llama backend init main: load the model and apply lora adapter, if any llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6800 XT) - 16368 MiB free llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: llama.rope.scaling.attn_factor f32 = 1.000000 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 28: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type f16: 2 tensors llama_model_loader: - type q4_0: 224 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 5.61 GiB (6.01 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta Llama 3.1 8B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) [mvk-info] Vulkan semaphores using MTLEvent. [mvk-info] Descriptor sets binding resources using Metal3 argument buffers. [mvk-info] Created VkDevice to run on GPU AMD Radeon RX 6800 XT with the following 3 Vulkan extensions enabled: VK_KHR_16bit_storage v1 VK_KHR_shader_float16_int8 v1 VK_EXT_subgroup_size_control v2 load_tensors: offloading 32 repeating layers to GPU load_tensors: offloaded 32/33 layers to GPU load_tensors: CPU_Mapped model buffer size = 5749.02 MiB load_tensors: Vulkan0 model buffer size = 3745.00 MiB .................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.49 MiB init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1 init: Vulkan0 KV buffer size = 512.00 MiB llama_context: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_context: Vulkan0 compute buffer size = 296.00 MiB llama_context: CPU compute buffer size = 258.50 MiB llama_context: Vulkan_Host compute buffer size = 16.01 MiB llama_context: graph nodes = 1094 llama_context: graph splits = 4 (with bs=512), 3 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 8 main: chat template is available, enabling conversation mode (disable it with -no-cnv) main: chat template example: <|start_header_id|>system<|end_header_id|> You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|> Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|> Hi there<|eot_id|><|start_header_id|>user<|end_header_id|> How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|> system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | main: interactive mode on. sampler seed: 3714999363 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. - Not using system message. To change it, set a different value via -sys PROMPT > who is your creator I was created by Meta, a company that specializes in artificial intelligence (AI) and other related technologies. Meta's research and development team, which includes experts from various fields like computer science, linguistics, and philosophy, designed and developed me to assist and communicate with people in a helpful and informative way. My specific architecture is based on a type of AI called a "large language model," which is trained on a massive dataset of text to generate human-like responses. This model is then fine-tuned and refined through machine learning algorithms to enable me to understand and respond to a wide range of questions and topics. While I'm a sophisticated tool, I'm still a machine and don't have personal feelings, emotions, or consciousness. My purpose is to provide helpful and accurate information, answer questions, and engage in conversations to the best of my abilities, based on my training and available data. Would you like to know more about my capabilities or the technology behind me? > ```
Author
Owner

@kyvaith commented on GitHub (Jun 13, 2025):

So, the trick to build MoltenVK and LLAMA-CLI with instructions provided in comments works for me. Works great on Hackintosh with AMD Radeon RX 6750 XT 12 GB GPU, AMD Ryzen 5 5600 CPU and 32GB RAM. Now, How can I build Ollama with it, or wrap somehow?

<!-- gh-comment-id:2969915747 --> @kyvaith commented on GitHub (Jun 13, 2025): So, the trick to build MoltenVK and LLAMA-CLI with instructions provided in comments works for me. Works great on Hackintosh with AMD Radeon RX 6750 XT 12 GB GPU, AMD Ryzen 5 5600 CPU and 32GB RAM. Now, How can I build Ollama with it, or wrap somehow?
Author
Owner

@anakayub commented on GitHub (Jul 8, 2025):

I followed the links here and here and managed to build llama.cpp with vulkan support. Some added steps to finetune the server (I used Ollama with AnythingLLM prior to this). Not for pure tech beginners, but I'm seeing 50-400% speed improvements. I wonder if the previous update that would speed up things using AVX-512 didn't work on MacOS (I'm on an iMac Pro 8 core, Vega 56, 64 GB RAM). I personally see yet another reason to not move on towards Apple Silicon. People elsewhere talked about activating "flash attention". I found it to reduce performance. So I basically used default llama cpp vulkan settings except for model-specific recommendations (Qwen3-30B-A3B). Before this I had the convenience of multitasking while waiting for the responses; now they're gone (but probably better for my electric bill).

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 |

<!-- gh-comment-id:3047234660 --> @anakayub commented on GitHub (Jul 8, 2025): I followed the links [here](https://medium.com/@ankitbabber/run-a-llm-locally-on-an-intel-mac-with-an-egpu-55ed66db54be) and [here](https://medium.com/@nks1608/building-llama-cpp-for-macos-on-intel-silicon-956bcb5b384b) and managed to build llama.cpp with vulkan support. Some added steps to finetune the server (I used Ollama with AnythingLLM prior to this). Not for pure tech beginners, but I'm seeing 50-400% speed improvements. I wonder if the previous update that would speed up things using AVX-512 didn't work on MacOS (I'm on an iMac Pro 8 core, Vega 56, 64 GB RAM). I personally see yet another reason to not move on towards Apple Silicon. People elsewhere talked about activating "flash attention". I found it to reduce performance. So I basically used default llama cpp vulkan settings except for model-specific recommendations (Qwen3-30B-A3B). Before this I had the convenience of multitasking while waiting for the responses; now they're gone (but probably better for my electric bill). `system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 | `
Author
Owner

@brendensoares commented on GitHub (Aug 12, 2025):

@anakayub flash attention can allow for larger context windows which is very valuable for certain use cases, like local coding with AI with tools like aider.

https://chatgpt.com/s/t_689b764cf3d08191bd4340bf840621fb

<!-- gh-comment-id:3180271267 --> @brendensoares commented on GitHub (Aug 12, 2025): @anakayub flash attention can allow for larger context windows which is very valuable for certain use cases, like local coding with AI with tools like aider. https://chatgpt.com/s/t_689b764cf3d08191bd4340bf840621fb
Author
Owner

@brendensoares commented on GitHub (Aug 12, 2025):

@kyvaith Ollama vendors llama.cpp, so you should be able to clone ollama's git repo and cd into the llama/llama.cpp and do your custom build there. Once that's done you should be able to build ollama which will use the custom build in the vendored path.

I didn't even think to explore this path myself. I was just using the llama.cpp server that is included and I recently started using llama-cpp-python from github as a frontend. Being able to use ollama's ecosystem would be ideal.

This may also help: https://chatgpt.com/s/t_689b78929518819191b759803b265241

EDIT: note, I have recently had to compile a vendored version of llama.cpp for another codebase and I'll tell you it's important to use the vendored version that is expected by the consuming codebase eg ollama. You can't simply swap in your own local llama.cpp build path if it does not match the expected version.

<!-- gh-comment-id:3180298659 --> @brendensoares commented on GitHub (Aug 12, 2025): @kyvaith [Ollama vendors llama.cpp](https://github.com/ollama/ollama/tree/8f4ec9ab289fd2a1f96384926a7f7bfd888d4ef9/llama), so you should be able to clone ollama's git repo and cd into the `llama/llama.cpp` and do your custom build there. Once that's done you should be able to build ollama which will use the custom build in the vendored path. I didn't even think to explore this path myself. I was just using the llama.cpp server that is included and I recently started using `llama-cpp-python` from github as a frontend. Being able to use ollama's ecosystem would be ideal. This may also help: https://chatgpt.com/s/t_689b78929518819191b759803b265241 EDIT: note, I have recently had to compile a vendored version of llama.cpp for another codebase and I'll tell you it's important to use the vendored version that is expected by the consuming codebase eg ollama. You can't simply swap in your own local llama.cpp build path if it does not match the expected version.
Author
Owner

@mrglutton commented on GitHub (Aug 12, 2025):

UPDATE: I managed to get llama.cpp to work on iMac Pro Vega 64 on Sonoma 15.6 using molten-vk.

In short: some of the build scripts have errors, and VulkanSDK has errors. It works in the end, in practicall all AMD GPUs on Macs. Even on trashcan macs. :-) (I needed to read this somewhere.)

I will divide the post in two:

  1. What didn't work
  2. What did work
  3. Performance

==== 1 (didn't work, maybe only for newer cards?) ====

I managed to build llama.cpp usign here

Few notes:

  1. when you are install ing Vulkan SDK, install with all checkboxes selected, otherwise builder can't see the Vulkan libraries
  2. check you path structure, builds will fail if you build in a folder that has spaces in names

This is the build recipe I used:

# if you need to clean the build dir
rm -rf build

# reconfigure (use your fixed command + set Release)
cmake -S . -B build \
  -DGGML_METAL=OFF -DGGML_VULKAN=ON -DGGML_OPENMP=ON \
  -DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix)/opt/libomp/include" \
  -DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix)/opt/libomp/include" \
  -DOpenMP_C_LIB_NAMES="omp" \
  -DOpenMP_CXX_LIB_NAMES="omp" \



  -DOpenMP_omp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \
  -DCMAKE_EXE_LINKER_FLAGS="-L$(brew --prefix)/opt/libomp/lib -lomp" \
  -DCMAKE_SHARED_LINKER_FLAGS="-L$(brew --prefix)/opt/libomp/lib -lomp" \
  -DCMAKE_BUILD_TYPE=Release

# build
cmake --build build -j

============================

Now the issue... After building llama.cpp detects my Vega 64:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Pro Vega 64 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

But it doesn't use it.

load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/33 layers to GPU

If I force the GPU usign command line, I get a load of cr*p in the terminal.

Using this:

./llama-cli -m ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf -ngl 8 -b 512 -ub 128

I get:

@@@@@@@@@@@

==== 2 WORKED! For Vega ====

On that post someone said solution 1 does't work and pointed to: here

If you first tried 1., that doesnt work. You will prob. get gibberish on older GPUs, altho it will probably work on new glpus like 6800.

To "fix", you need:

  1. Remove VulkanSDK. Go to user folder, enter VulkanSDK. There is a maintenance tool inside that will uninstall entire SDK. Do it.
  2. You probably have brew, in any case:
    (/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)")`
  3. If brew behaves like it is missing, just restart mac.
  4. Run this to install dependencises:
    brew install cmake git libomp vulkan-headers glslang molten-vk shaderc
  5. I worked with all in Users directory. Less navigation, and spaces in dir names WILL break build.
  6. The original build file will return error. This one is fixed and should work more than fine:
cd ~/llama.cpp

cmake -S . -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \
  -DCMAKE_BUILD_TYPE=Release \
  -DVulkan_INCLUDE_DIR="$(brew --prefix)/opt/vulkan-headers/include" \
  -DVulkan_LIBRARY="$(brew --prefix)/lib/libvulkan.dylib" \
  -DOpenMP_ROOT="$(brew --prefix)/opt/libomp" \
  -DVulkan_GLSLC_EXECUTABLE="$(brew --prefix)/opt/shaderc/bin/glslc" \
  -DVulkan_GLSLANG_VALIDATOR_EXECUTABLE="$(brew --prefix)/opt/glslang/bin/glslangValidator"

cmake --build build --config Release -j
  1. This worked for me after 5h of fiddling around and trying stuff.

To start the server you can use this directly:

./build/bin/llama-server --hf-repo hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF --hf-file llama-3.2-1b-instruct-q8_0.gguf -c 2048 --n-gpu-layers 999

To run better (3B) model, run:

./build/bin/llama-server --hf-repo hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -c 2048 --n-gpu-layers 29

==== Performance ====

45-50 t/s using Llama-3.2-3B-Instruct-Q8_0-GGUF
110-120 t/s using Llama-3.2-1B-Instruct-Q8_0-GGUF

The speed on 1B model is practically instant. It summarised this paper in one second. Arguably it is not that good, but it is good enough. (CPU is glacial in comparison, and I used that before.) 3B is also very fast. It completes the task in about 5? seconds.

Here is a benchmark paper: Twin modelling reveals partly distinct genetic pathways to music enjoyment
I fed it PDF directly.

IMG:

Image Image

============================================================

For me the difference is staggering. It works so fast that I almost can't believe it. I am yet to try faster models.

I included entire bin as attachment. This is on Sequoia 15.6 and if you isntall dependencies it might work for you?

llama.cpp_built_bin.zip

**The file is ZIP, but there is a 7z file inside, you need to rename the to 7z and unpack. IT IS NOT A VIRUS. LOL ZIP was too large **

Image Image
<!-- gh-comment-id:3180941390 --> @mrglutton commented on GitHub (Aug 12, 2025): **UPDATE: I managed to get llama.cpp to work on iMac Pro Vega 64 on Sonoma 15.6 using molten-vk.** In short: some of the build scripts have errors, and VulkanSDK has errors. It works in the end, in practicall all AMD GPUs on Macs. Even on trashcan macs. :-) (I needed to read this somewhere.) **I will divide the post in two:** 1. What didn't work 2. What did work 3. Performance ==== 1 (didn't work, maybe only for newer cards?) ==== I managed to build llama.cpp usign [here](https://medium.com/@ankitbabber/run-a-llm-locally-on-an-intel-mac-with-an-egpu-55ed66db54be) Few notes: 1. **when you are install ing Vulkan SDK, install with all checkboxes selected, otherwise builder can't see the Vulkan libraries** 2. **check you path structure, builds will fail if you build in a folder that has spaces in names** This is the build recipe I used: ``` # if you need to clean the build dir rm -rf build # reconfigure (use your fixed command + set Release) cmake -S . -B build \ -DGGML_METAL=OFF -DGGML_VULKAN=ON -DGGML_OPENMP=ON \ -DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix)/opt/libomp/include" \ -DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix)/opt/libomp/include" \ -DOpenMP_C_LIB_NAMES="omp" \ -DOpenMP_CXX_LIB_NAMES="omp" \ -DOpenMP_omp_LIBRARY="$(brew --prefix)/opt/libomp/lib/libomp.dylib" \ -DCMAKE_EXE_LINKER_FLAGS="-L$(brew --prefix)/opt/libomp/lib -lomp" \ -DCMAKE_SHARED_LINKER_FLAGS="-L$(brew --prefix)/opt/libomp/lib -lomp" \ -DCMAKE_BUILD_TYPE=Release # build cmake --build build -j ``` ============================ Now the issue... After building llama.cpp detects my Vega 64: ``` ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Pro Vega 64 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ``` But it doesn't use it. ``` load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/33 layers to GPU ``` If I force the GPU usign command line, I get a load of cr*p in the terminal. Using this: ./llama-cli -m ../llm-models/meta-llama-3.1-8b-instruct-q4_0.gguf -ngl 8 -b 512 -ub 128 I get: `@@@@@@@@@@@` ==== 2 WORKED! For Vega ==== On that post someone said solution 1 does't work and pointed to: [here](https://medium.com/@nks1608/building-llama-cpp-for-macos-on-intel-silicon-956bcb5b384b) If you first tried 1., that doesnt work. You will prob. get gibberish on older GPUs, altho it will probably work on new glpus like 6800. To "fix", you need: 1. Remove VulkanSDK. Go to user folder, enter VulkanSDK. There is a maintenance tool inside that will uninstall entire SDK. Do it. 2. You probably have brew, in any case: `(/bin/bash` -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)")` 3. If brew behaves like it is missing, just restart mac. 4. Run this to install dependencises: `brew install cmake git libomp vulkan-headers glslang molten-vk shaderc` 5. I worked with all in Users directory. Less navigation, and spaces in dir names WILL break build. 6. The original build file will return error. This one is fixed and should work more than fine: ``` cd ~/llama.cpp cmake -S . -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 \ -DCMAKE_BUILD_TYPE=Release \ -DVulkan_INCLUDE_DIR="$(brew --prefix)/opt/vulkan-headers/include" \ -DVulkan_LIBRARY="$(brew --prefix)/lib/libvulkan.dylib" \ -DOpenMP_ROOT="$(brew --prefix)/opt/libomp" \ -DVulkan_GLSLC_EXECUTABLE="$(brew --prefix)/opt/shaderc/bin/glslc" \ -DVulkan_GLSLANG_VALIDATOR_EXECUTABLE="$(brew --prefix)/opt/glslang/bin/glslangValidator" cmake --build build --config Release -j ``` 7. This worked for me after 5h of fiddling around and trying stuff. To start the server you can use this directly: ``./build/bin/llama-server --hf-repo hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF --hf-file llama-3.2-1b-instruct-q8_0.gguf -c 2048 --n-gpu-layers 999`` To run better (3B) model, run: ``./build/bin/llama-server --hf-repo hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -c 2048 --n-gpu-layers 29`` ==== Performance ==== 45-50 t/s using Llama-3.2-3B-Instruct-Q8_0-GGUF 110-120 t/s using Llama-3.2-1B-Instruct-Q8_0-GGUF The speed on 1B model is practically instant. It summarised this paper in one second. Arguably it is not that good, but it is good enough. (CPU is glacial in comparison, and I used that before.) 3B is also very fast. It completes the task in about 5? seconds. Here is a benchmark paper: [Twin modelling reveals partly distinct genetic pathways to music enjoyment ](https://doi.org/10.1038/s41467-025-58123-8) I fed it PDF directly. IMG: <img width="1804" height="1208" alt="Image" src="https://github.com/user-attachments/assets/5cf5be5c-272c-4bb0-93ac-50e8f753279c" /> <img width="1788" height="866" alt="Image" src="https://github.com/user-attachments/assets/75c1c1fd-2ffa-4c70-b93f-8a9f716d1b6e" /> ============================================================ For me the difference is staggering. It works so fast that I almost can't believe it. I am yet to try faster models. I included entire bin as attachment. This is on Sequoia 15.6 and if you isntall dependencies it might work for you? [llama.cpp_built_bin.zip](https://github.com/user-attachments/files/21744672/llama.cpp_built_bin.zip) **The file is ZIP, but there is a 7z file inside, you need to rename the to 7z and unpack. IT IS NOT A VIRUS. LOL ZIP was too large ** <img width="728" height="54" alt="Image" src="https://github.com/user-attachments/assets/088b4ac4-3811-4784-803e-4e6b4d208e24" /> <img width="2662" height="1630" alt="Image" src="https://github.com/user-attachments/assets/c0b96b7b-de61-4b87-9dd7-5660f50bba79" />
Author
Owner

@Splash04 commented on GitHub (Aug 24, 2025):

This is retracing my steps from memory, but it should at least get you on the right track.

  1. Install dependencies:
brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader
  1. Clone MoltenVK and pull the PR
git clone git@github.com:KhronosGroup/MoltenVK.git
cd MoltenVK
git fetch origin pull/2434/head:p2434
git switch p2434
  1. Build MoltenVK
./fetchDependencies --macos
make macos
  1. Install
    Note: The path will be different depending on the version of molten-vk you installed.
    Copy ./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib to /usr/local/Cellar/molten-vk/1.2.11/lib/.
  2. Build llama.cpp
    Clone the repo as normal and build it with:
cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON
cmake --build build --config Release

I followed these steps and get a kernel panic on build every time.

My specs: i9 intel 3.18Ghz RX 6800xt

Looks like issue that force to use:
git fetch origin pull/2434/head:p2434

git switch p2434

I was able to run model on gpu using my iMac 27 with AMD Radeon Pro 5700 XT 16 GB

<!-- gh-comment-id:3218233404 --> @Splash04 commented on GitHub (Aug 24, 2025): > > This is retracing my steps from memory, but it should at least get you on the right track. > > > > 1. Install dependencies: > > > > ``` > > brew install libomp vulkan-headers glslang molten-vk shaderc vulkan-loader > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Clone MoltenVK and pull the PR > > > > ``` > > git clone git@github.com:KhronosGroup/MoltenVK.git > > cd MoltenVK > > git fetch origin pull/2434/head:p2434 > > git switch p2434 > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > 3. Build MoltenVK > > > > ``` > > ./fetchDependencies --macos > > make macos > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Install > > _Note: The path will be different depending on the version of molten-vk you installed._ > > Copy `./Package/Release/MoltenVK/dynamic/dylib/macOS/libMoltenVK.dylib` to `/usr/local/Cellar/molten-vk/1.2.11/lib/`. > > 5. Build llama.cpp > > Clone the repo as normal and build it with: > > > > ``` > > cmake -B build -DGGML_METAL=OFF -DGGML_VULKAN=ON > > cmake --build build --config Release > > ``` > > I followed these steps and get a kernel panic on build every time. > > My specs: i9 intel 3.18Ghz RX 6800xt Looks like issue that force to use: git fetch origin pull/2434/head:p2434 ``` git switch p2434 ``` I was able to run model on gpu using my iMac 27 with AMD Radeon Pro 5700 XT 16 GB
Author
Owner

@paoloaveri commented on GitHub (Nov 7, 2025):

I was able to run llama.cpp on my Radeon Pro 555X 4 GB (2019 MacBook Pro) using a prebuilt MoltenVK package, instructions here: https://gist.github.com/paoloaveri/31a58a37525b6214ba3ff14fdb90acaf

<!-- gh-comment-id:3501068016 --> @paoloaveri commented on GitHub (Nov 7, 2025): I was able to run llama.cpp on my `Radeon Pro 555X 4 GB` (2019 MacBook Pro) using a prebuilt MoltenVK package, instructions here: https://gist.github.com/paoloaveri/31a58a37525b6214ba3ff14fdb90acaf
Author
Owner

@jamfor999 commented on GitHub (Nov 14, 2025):

If anyone's curious, I have managed to get this working.
I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama

(thanks to the gist above from @paoloaveri and other people pointing in the right direction)

I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR.

<!-- gh-comment-id:3530447063 --> @jamfor999 commented on GitHub (Nov 14, 2025): If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama (thanks to the gist above from @paoloaveri and other people pointing in the right direction) I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR.
Author
Owner

@tristan-k commented on GitHub (Nov 15, 2025):

@jamfor999 Thanks. I noticed that brew install go is missing for the dependencies in your fork.

$ ./scripts/build_darwin_vulkan.sh
>>> Building darwin arm64
./scripts/build_darwin_vulkan.sh: line 47: go: command not found

Sadly Intel iGPUs dont seems to play well with this vulkan build.

$ ./Applications/Ollama.app/Contents/Resources/ollama run gemma3:1b
>>> Why is the blue sky blue?
The blue color we associate with clear blue skies results from a fascinating process called **Rayleigh scattering**. Here's how it works perfectly broken down for you perfectly well metiche:

**1.  The Sun ي مع التَ فَ لَ ن ن د ي ن د ي ت ن ل ن د ز ل ن د ت ل ت د ي ت ل ت ن د ز ن د ت ن ف ر ن د ا ت د ت ل ت د ل ن ت ن ت ت ي ن د ز ت ل ن د ت ن د ر ن د ل ت ت د ل ن ت د ت د ز ل ت ن د ت د ت ت د ز ت ل ن د ز ل ن د ت
د ت د ت د ل ن د ي ز ل ن د ن د ز ل ر ت د ي^C

>>> /bye
<!-- gh-comment-id:3536693440 --> @tristan-k commented on GitHub (Nov 15, 2025): @jamfor999 Thanks. I noticed that `brew install go` is missing for the dependencies in your [fork](https://github.com/jamfor999/ollama). ``` $ ./scripts/build_darwin_vulkan.sh >>> Building darwin arm64 ./scripts/build_darwin_vulkan.sh: line 47: go: command not found ``` Sadly Intel iGPUs dont seems to play well with this vulkan build. ``` $ ./Applications/Ollama.app/Contents/Resources/ollama run gemma3:1b >>> Why is the blue sky blue? The blue color we associate with clear blue skies results from a fascinating process called **Rayleigh scattering**. Here's how it works perfectly broken down for you perfectly well metiche: **1. The Sun ي مع التَ فَ لَ ن ن د ي ن د ي ت ن ل ن د ز ل ن د ت ل ت د ي ت ل ت ن د ز ن د ت ن ف ر ن د ا ت د ت ل ت د ل ن ت ن ت ت ي ن د ز ت ل ن د ت ن د ر ن د ل ت ت د ل ن ت د ت د ز ل ت ن د ت د ت ت د ز ت ل ن د ز ل ن د ت د ت د ت د ل ن د ي ز ل ن د ن د ز ل ر ت د ي^C >>> /bye ```
Author
Owner

@mrglutton commented on GitHub (Nov 15, 2025):

Sadly Intel iGPUs dont seems to play well with this vulkan build.

I don't know what you did wrong, but this is not correct. I have run llama.cpp on various intel iGPU across generations, and they all performed well. The performance was irrelevant, I just played with them to see if they would work.

<!-- gh-comment-id:3536869426 --> @mrglutton commented on GitHub (Nov 15, 2025): > Sadly Intel iGPUs dont seems to play well with this vulkan build. I don't know what you did wrong, but this is not correct. I have run llama.cpp on various intel iGPU across generations, and they all performed well. The performance was irrelevant, I just played with them to see if they would work.
Author
Owner

@tristan-k commented on GitHub (Nov 17, 2025):

@mrglutton Sure, I can confirm that llama.cpp is playing nice with the iGPU on macOS but not with Ollama.

<!-- gh-comment-id:3542493167 --> @tristan-k commented on GitHub (Nov 17, 2025): @mrglutton Sure, I can [confirm](https://github.com/ggml-org/llama.cpp/issues/8913#issuecomment-3422628004) that `llama.cpp` is playing nice with the iGPU on macOS but not with `Ollama`.
Author
Owner

@GoMino commented on GitHub (Nov 18, 2025):

If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama

(thanks to the gist above from @paoloaveri and other people pointing in the right direction)

I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR.

I just tried it and it seems to work correctly at the moment on my MacBook Pro 16" (intel with AMD Radeon Pro 5500M 8GB) @jamfor999 (small model only)

<!-- gh-comment-id:3544885211 --> @GoMino commented on GitHub (Nov 18, 2025): > If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama > > (thanks to the gist above from [@paoloaveri](https://github.com/paoloaveri) and other people pointing in the right direction) > > I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR. I just tried it and it seems to work correctly at the moment on my MacBook Pro 16" (intel with AMD Radeon Pro 5500M 8GB) @jamfor999 (small model only)
Author
Owner

@n-connect commented on GitHub (Nov 26, 2025):

The shortest working steps building llama.cpp, since v1.4 Molten-VK here is this gist, only 3 build parameters for llama.cpp, but can work with 2 as well (- the curl one)

<!-- gh-comment-id:3583150582 --> @n-connect commented on GitHub (Nov 26, 2025): The shortest working steps building llama.cpp, since v1.4 Molten-VK here is this [gist](https://gist.github.com/n-connect/9a7975980f36e187175b0d35e7e52ade), only 3 build parameters for llama.cpp, but can work with 2 as well (- the curl one)
Author
Owner

@raparici commented on GitHub (Dec 7, 2025):

I tried it and works perfect for me with 2019 iMac with internal Vega 48 and a RX6900XT over thunderbolt running Sequoia. It's awesome!

<!-- gh-comment-id:3623526715 --> @raparici commented on GitHub (Dec 7, 2025): I tried it and works perfect for me with 2019 iMac with internal Vega 48 and a RX6900XT over thunderbolt running Sequoia. It's awesome!
Author
Owner

@mrglutton commented on GitHub (Dec 7, 2025):

I tried it and works perfect for me with 2019 iMac with internal Vega 48 and a RX6900XT over thunderbolt running Sequoia. It's awesome!

What was your compile procedure and usage case?

<!-- gh-comment-id:3623541631 --> @mrglutton commented on GitHub (Dec 7, 2025): > I tried it and works perfect for me with 2019 iMac with internal Vega 48 and a RX6900XT over thunderbolt running Sequoia. It's awesome! What was your compile procedure and usage case?
Author
Owner

@raparici commented on GitHub (Dec 7, 2025):

I built llama.cpp with Vulkan support via Molten as @n-connect described:

https://gist.github.com/n-connect/9a7975980f36e187175b0d35e7e52ade

The resulting llama-cli finds gpus and does the layer splitting among them:

./llama-cli -m DeepSeek-R1-Distill-Qwen-14B-Q4_0.gguf -cnv

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro Vega 48 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
...
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6900 XT) (unknown id) - 16368 MiB free
llama_model_load_from_file_impl: using device Vulkan1 (AMD Radeon Pro Vega 48) (unknown id) - 8176 MiB free
...
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 417,66 MiB
load_tensors: Vulkan0 model buffer size = 4900,16 MiB
load_tensors: Vulkan1 model buffer size = 2824,94 MiB

<!-- gh-comment-id:3623663235 --> @raparici commented on GitHub (Dec 7, 2025): I built llama.cpp with Vulkan support via Molten as @n-connect described: https://gist.github.com/n-connect/9a7975980f36e187175b0d35e7e52ade The resulting llama-cli finds gpus and does the layer splitting among them: ./llama-cli -m DeepSeek-R1-Distill-Qwen-14B-Q4_0.gguf -cnv ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 6900 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 1 = AMD Radeon Pro Vega 48 (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ... llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6900 XT) (unknown id) - 16368 MiB free llama_model_load_from_file_impl: using device Vulkan1 (AMD Radeon Pro Vega 48) (unknown id) - 8176 MiB free ... load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 48 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 49/49 layers to GPU load_tensors: CPU_Mapped model buffer size = 417,66 MiB load_tensors: Vulkan0 model buffer size = 4900,16 MiB load_tensors: Vulkan1 model buffer size = 2824,94 MiB
Author
Owner

@1zilc commented on GitHub (Dec 8, 2025):

If anyone needs it, I created a daily build repo to build llama.cpp with vulkan enabled for intel mac
https://github.com/1zilc/llama.cpp-mac_x64-vulkan/releases

<!-- gh-comment-id:3624297130 --> @1zilc commented on GitHub (Dec 8, 2025): If anyone needs it, I created a daily build repo to build llama.cpp with vulkan enabled for intel mac https://github.com/1zilc/llama.cpp-mac_x64-vulkan/releases
Author
Owner

@mrglutton commented on GitHub (Dec 8, 2025):

The resulting llama-cli finds gpus and shares the tensors between of them:

Thank you. :-)

<!-- gh-comment-id:3625711171 --> @mrglutton commented on GitHub (Dec 8, 2025): > The resulting llama-cli finds gpus and shares the tensors between of them: Thank you. :-)
Author
Owner

@chafey commented on GitHub (Jan 17, 2026):

If anyone needs it, I created a daily build repo to build llama.cpp with vulkan enabled for intel mac https://github.com/1zilc/llama.cpp-mac_x64-vulkan/releases

Thanks for doing this - it worked with my 2x Vega II Duo system and my 2x w6800X Duo system. It only detected 3/4 gpus on the 2x Vega II system and only 2 GPUs on the 2x w6800X Duo system. All four GPUs are showing up in the system information/pci list. Any suggestions on how to troubleshoot this to get all GPUs active?

<!-- gh-comment-id:3764333665 --> @chafey commented on GitHub (Jan 17, 2026): > If anyone needs it, I created a daily build repo to build llama.cpp with vulkan enabled for intel mac https://github.com/1zilc/llama.cpp-mac_x64-vulkan/releases Thanks for doing this - it worked with my 2x Vega II Duo system and my 2x w6800X Duo system. It only detected 3/4 gpus on the 2x Vega II system and only 2 GPUs on the 2x w6800X Duo system. All four GPUs are showing up in the system information/pci list. Any suggestions on how to troubleshoot this to get all GPUs active?
Author
Owner

@nepomucen-sexp commented on GitHub (Jan 19, 2026):

If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama

(thanks to the gist above from @paoloaveri and other people pointing in the right direction)

I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR.

Thank you @jamfor999! It is working great on my Intel MacBook with AMD Radeon Pro 5500M 8 GB. Merging it to upstream would be awesome.

<!-- gh-comment-id:3770124370 --> @nepomucen-sexp commented on GitHub (Jan 19, 2026): > If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama > > (thanks to the gist above from [@paoloaveri](https://github.com/paoloaveri) and other people pointing in the right direction) > > I probably won't be keeping this up to date much other than just merging in main. Probably good to soak test and get some feedback from peeps here who wanted it - until then I'd rather wait before raising a PR. Thank you @jamfor999! It is working great on my Intel MacBook with AMD Radeon Pro 5500M 8 GB. Merging it to upstream would be awesome.
Author
Owner

@bradrlaw commented on GitHub (Jan 24, 2026):

If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama

good stuff @jamfor999 I was abl to get this working on a 2019 imac with 128gb ram / 580X 8GB. Works with VScode and most tools, but seem to run into issues using claude and similar tools. Have you had any luck with those?

Edit: Looks like I am running into this issue:
https://github.com/anthropics/claude-code/issues/20416

<!-- gh-comment-id:3793865150 --> @bradrlaw commented on GitHub (Jan 24, 2026): > If anyone's curious, I have managed to get this working. I've pushed a fork of Ollama which works for me using an AMD GPU via MoltenVK. https://github.com/jamfor999/ollama good stuff @jamfor999 I was abl to get this working on a 2019 imac with 128gb ram / 580X 8GB. Works with VScode and most tools, but seem to run into issues using claude and similar tools. Have you had any luck with those? Edit: Looks like I am running into this issue: https://github.com/anthropics/claude-code/issues/20416
Author
Owner

@PrAntini commented on GitHub (Feb 3, 2026):

@jamfor999 , could you explain more clearly how to build your version? I had issue:

./scripts/build_darwin_vulkan.sh: line 47: go: command not found

<!-- gh-comment-id:3841614282 --> @PrAntini commented on GitHub (Feb 3, 2026): @jamfor999 , could you explain more clearly how to build your version? I had issue: ./scripts/build_darwin_vulkan.sh: line 47: go: command not found
Author
Owner

@alifeinbinary commented on GitHub (Feb 3, 2026):

@PrAntini do you have Go installed on your Mac?
https://formulae.brew.sh/formula/go#default

<!-- gh-comment-id:3841633048 --> @alifeinbinary commented on GitHub (Feb 3, 2026): @PrAntini do you have Go installed on your Mac? https://formulae.brew.sh/formula/go#default
Author
Owner

@PrAntini commented on GitHub (Feb 3, 2026):

My bad! I am noobish ) Forget about Go. I did it! Thank you for your quick responce! Big thx to alifeinbinary and @jamfor999

<!-- gh-comment-id:3841804628 --> @PrAntini commented on GitHub (Feb 3, 2026): My bad! I am noobish ) Forget about Go. I did it! Thank you for your quick responce! Big thx to [alifeinbinary](https://github.com/alifeinbinary) and @jamfor999
Author
Owner

@Deep345 commented on GitHub (Mar 18, 2026):

Hi, I just wanted to ask if there is any update on the status of this request? If we could have amd gpu support for macos on mainstream ollama, i'm sure many users would appreciate the performance gains!

<!-- gh-comment-id:4078893803 --> @Deep345 commented on GitHub (Mar 18, 2026): Hi, I just wanted to ask if there is any update on the status of this request? If we could have amd gpu support for macos on mainstream ollama, i'm sure many users would appreciate the performance gains!
Author
Owner

@jamfor999 commented on GitHub (Mar 21, 2026):

I have created a pr to merge my fork upstream.

Although it does not mean Ollama will necessarily make binaries available for it even if it is merged

<!-- gh-comment-id:4104911335 --> @jamfor999 commented on GitHub (Mar 21, 2026): I have created [a pr](https://github.com/ollama/ollama/pull/15000) to merge my fork upstream. Although it does not mean Ollama will necessarily make binaries available for it even if it is merged
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26257