[GH-ISSUE #1730] MLX backend #47497

New Issue

@qdrddr commented on GitHub (Jun 28, 2024):

Related to this Apple CoreML support to utilize Apple Neural Engine (ANE) alongside GPU & CPU:
https://github.com/ollama/ollama/issues/3898

@qdrddr commented on GitHub (Jun 28, 2024): Related to this Apple CoreML support to utilize Apple Neural Engine (ANE) alongside GPU & CPU: https://github.com/ollama/ollama/issues/3898

GiteaMirror commented

@ibehnam commented on GitHub (Aug 20, 2024):

Any updates on this? MLX is now faster than Llama.cpp on Mac.

@ibehnam commented on GitHub (Aug 20, 2024): Any updates on this? MLX is now faster than Llama.cpp on Mac.

GiteaMirror commented

@garhbod commented on GitHub (Aug 21, 2024):

Any progress @mxyng? Is this a seperate project that other could contribute to?

@garhbod commented on GitHub (Aug 21, 2024): Any progress @mxyng? Is this a seperate project that other could contribute to?

GiteaMirror commented

2026-04-28 03:56:30 -05:00

@nicarq commented on GitHub (Aug 29, 2024):

it would be awesome. MLX is moving really fast and it would make sense that it would be the best tool long-term to run models on Apple's hardware.

@nicarq commented on GitHub (Aug 29, 2024): it would be awesome. MLX is moving really fast and it would make sense that it would be the best tool long-term to run models on Apple's hardware.

GiteaMirror commented

@parthpat12 commented on GitHub (Sep 9, 2024):

Please add support for MLX! Any update @mxyng?

@parthpat12 commented on GitHub (Sep 9, 2024): Please add support for MLX! Any update @mxyng?

GiteaMirror commented

2026-04-28 03:56:30 -05:00

@ivanfioravanti commented on GitHub (Sep 15, 2024):

MLX support would be awesome!!!

@ivanfioravanti commented on GitHub (Sep 15, 2024): MLX support would be awesome!!!

GiteaMirror commented

2026-04-28 03:56:30 -05:00

@czzarr commented on GitHub (Sep 15, 2024):

indeed it would be!

@czzarr commented on GitHub (Sep 15, 2024): indeed it would be!

GiteaMirror commented

2026-04-28 03:56:31 -05:00

@vietvudanh commented on GitHub (Oct 2, 2024):

MLX would support vision models too.

@vietvudanh commented on GitHub (Oct 2, 2024): MLX would support vision models too.

GiteaMirror commented

2026-04-28 03:56:31 -05:00

@Bigsy commented on GitHub (Oct 9, 2024):

Seems the new MLX backend in LMStudio is providing some real benefits especially in regards to memory consumption. Would be great to get support in Ollama.

@Bigsy commented on GitHub (Oct 9, 2024): Seems the new MLX backend in LMStudio is providing some real benefits especially in regards to memory consumption. Would be great to get support in Ollama.

GiteaMirror commented

2026-04-28 03:56:32 -05:00

@hg0428 commented on GitHub (Oct 10, 2024):

I have been testing the MLX backend in LM Studio, and I have found it to be on average 40% faster for inference than Ollama using the same exact settings with the same model at the same precision.
I am using M3 Max 36GB memory.

@hg0428 commented on GitHub (Oct 10, 2024): I have been testing the MLX backend in LM Studio, and I have found it to be on average 40% faster for inference than Ollama using the same exact settings with the same model at the same precision. I am using M3 Max 36GB memory.

GiteaMirror commented

2026-04-28 03:56:33 -05:00

@robbiemu commented on GitHub (Oct 18, 2024):

I have been testing the MLX backend in LM Studio, and I have found it to be on average 40% faster for inference than Ollama using the same exact settings with the same model at the same precision. I am using M3 Max 36GB memory.

I've seen numbers, admittedly a couple months ago, around 20% faster. Can you share a bit more - what models/context/settings? what iogpu.wired_limit_mb? etc

20% is still 20% more than Im doing currently :D

I'm not sure how good of an idea it is to have Ollama add a lot of features only available to some people .. but it does have some NVIDIA exclusive (or nvidia vs cpu only) stuff at least.

@robbiemu commented on GitHub (Oct 18, 2024): > I have been testing the MLX backend in LM Studio, and I have found it to be on average 40% faster for inference than Ollama using the same exact settings with the same model at the same precision. I am using M3 Max 36GB memory. I've seen numbers, admittedly a couple months ago, around 20% faster. Can you share a bit more - what models/context/settings? what iogpu.wired_limit_mb? etc 20% is still 20% more than Im doing currently :D I'm not sure how good of an idea it is to have Ollama add a lot of features only available to some people .. but it does have some NVIDIA exclusive (or nvidia vs cpu only) stuff at least.

GiteaMirror commented

2026-04-28 03:56:34 -05:00

@CharafChnioune commented on GitHub (Oct 20, 2024):

So any plans to add mlx like lmstudio? Mlx supports multi model and is faster now. Llama.cpp is sort of death since the stoped vision

@CharafChnioune commented on GitHub (Oct 20, 2024): So any plans to add mlx like lmstudio? Mlx supports multi model and is faster now. Llama.cpp is sort of death since the stoped vision

GiteaMirror commented

2026-04-28 03:56:34 -05:00

@twalderman commented on GitHub (Oct 23, 2024):

Given that there are less options with MLX for models but the ones that are available are good workhorses, it would be a huge benefit to have Ollama support mlx. As others have stated, LM studio supports MLX and the performance is great however Ollama still supports a wider range of templates and potentially upcoming support for more sampler options. Having one solution is ideal for Apple Silicon.

@twalderman commented on GitHub (Oct 23, 2024): Given that there are less options with MLX for models but the ones that are available are good workhorses, it would be a huge benefit to have Ollama support mlx. As others have stated, LM studio supports MLX and the performance is great however Ollama still supports a wider range of templates and potentially upcoming support for more sampler options. Having one solution is ideal for Apple Silicon.

GiteaMirror commented

2026-04-28 03:56:35 -05:00

@ice6 commented on GitHub (Oct 30, 2024):

it is good to support this :) most important keep ollama steady and fast!

@ice6 commented on GitHub (Oct 30, 2024): it is good to support this :) most important keep `ollama` steady and fast!

GiteaMirror commented

2026-04-28 03:56:35 -05:00

@nercone-dev commented on GitHub (Oct 31, 2024):

On the MacBook Air (Apple M3 Normal), it is now faster.
For example, it was faster even when running 13B Codellama.
This is probably a technology that can be optimized for Apple silicon (especially M3 and later).
So it would be better to implement it.

https://github.com/user-attachments/assets/6aadc898-1c62-43c8-91d6-8c2308db603c

https://github.com/user-attachments/assets/e65e373f-a2bd-4089-82d0-66c3cbab4db5

Model

MacBook Air 13-inch (2024, M3)

CPU

Apple M3

Memory (RAM)

16GB

@nercone-dev commented on GitHub (Oct 31, 2024): On the MacBook Air (Apple M3 Normal), it is now **faster**. For example, it was **faster even when running 13B Codellama**. This is probably a technology that can be optimized for Apple silicon (especially M3 and later). So it would be better to implement it. https://github.com/user-attachments/assets/6aadc898-1c62-43c8-91d6-8c2308db603c https://github.com/user-attachments/assets/e65e373f-a2bd-4089-82d0-66c3cbab4db5 ## Model MacBook Air 13-inch (2024, M3) ## CPU Apple M3 ## Memory (RAM) 16GB

GiteaMirror commented

2026-04-28 03:56:35 -05:00

@hg0428 commented on GitHub (Oct 31, 2024):

On my hardware, MLX runs an average of 40% faster than Llama.cpp (actual percentage varies from 38%-42%. 40% is the average over many tests).

@hg0428 commented on GitHub (Oct 31, 2024): On my hardware, MLX runs an average of 40% faster than Llama.cpp (actual percentage varies from 38%-42%. 40% is the average over many tests).

GiteaMirror commented

2026-04-28 03:56:36 -05:00

@twalderman commented on GitHub (Oct 31, 2024):

What are people using for mlx local serving? What implementation is best and worth implementing in ollama?

@twalderman commented on GitHub (Oct 31, 2024): What are people using for mlx local serving? What implementation is best and worth implementing in ollama?

GiteaMirror commented

2026-04-28 03:56:36 -05:00

@hg0428 commented on GitHub (Oct 31, 2024):

What are people using for mlx local serving? What implementation is best and worth implementing in ollama?

The developers of LM Studio have created a wrapper around MLX that makes it super simple. They used it to transition LM Studio from supporting only Llama.cpp as a backend to being able to support both Llama.cpp and MLX.

https://github.com/lmstudio-ai/mlx-engine

@hg0428 commented on GitHub (Oct 31, 2024): > What are people using for mlx local serving? What implementation is best and worth implementing in ollama? The developers of LM Studio have created a wrapper around MLX that makes it super simple. They used it to transition LM Studio from supporting only Llama.cpp as a backend to being able to support both Llama.cpp and MLX. https://github.com/lmstudio-ai/mlx-engine

GiteaMirror commented

@ahmetkca commented on GitHub (Nov 9, 2024):

What do you hope to gain from this? I don't think MLX is faster for inference, at least not yet.

I hav just tried LM Studio's new MLX backend and you can see 11+ tokens per second improvement for same model. The model in question was qwen2.5:7b-instruct-q8_0 from 70~ tokens per second to 81~ tokens per second

@ahmetkca commented on GitHub (Nov 9, 2024): > What do you hope to gain from this? I don't think MLX is faster for inference, at least not yet. I hav just tried LM Studio's new MLX backend and you can see 11+ tokens per second improvement for same model. The model in question was qwen2.5:7b-instruct-q8_0 from 70~ tokens per second to 81~ tokens per second

GiteaMirror commented

https://github.com/ollama/ollama/issues/1730#issuecomment-2466166189

@ahmetkca commented on GitHub (Nov 9, 2024):

What are people using for mlx local serving? What implementation is best and worth implementing in ollama?

The developers of LM Studio have created a wrapper around MLX that makes it super simple. They used it to transition LM Studio from supporting only Llama.cpp as a backend to being able to support both Llama.cpp and MLX.

https://github.com/lmstudio-ai/mlx-engine

@ahmetkca commented on GitHub (Nov 9, 2024): > > What are people using for mlx local serving? What implementation is best and worth implementing in ollama? > > The developers of LM Studio have created a wrapper around MLX that makes it super simple. They used it to transition LM Studio from supporting only Llama.cpp as a backend to being able to support both Llama.cpp and MLX. > > https://github.com/lmstudio-ai/mlx-engine https://github.com/ollama/ollama/issues/1730#issuecomment-2466166189

GiteaMirror commented

@ahmetkca commented on GitHub (Nov 9, 2024):

What are people using for mlx local serving? What implementation is best and worth implementing in ollama?

The developers of LM Studio have created a wrapper around MLX that makes it super simple. They used it to transition LM Studio from supporting only Llama.cpp as a backend to being able to support both Llama.cpp and MLX.

https://github.com/lmstudio-ai/mlx-engine

This is actually wrapper around MLX's mlx-lm python package. So perhaps, Ollama team can do better by completely bypassing python?

@ahmetkca commented on GitHub (Nov 9, 2024): > > What are people using for mlx local serving? What implementation is best and worth implementing in ollama? > > The developers of LM Studio have created a wrapper around MLX that makes it super simple. They used it to transition LM Studio from supporting only Llama.cpp as a backend to being able to support both Llama.cpp and MLX. > > https://github.com/lmstudio-ai/mlx-engine This is actually wrapper around MLX's mlx-lm python package. So perhaps, Ollama team can do better by completely bypassing python?

GiteaMirror commented

@logiota commented on GitHub (Nov 10, 2024):

What do you hope to gain from this? I don't think MLX is faster for inference, at least not yet.

llama3.2 is at least 64% faster with MLX on M4! Just got mine :)

@logiota commented on GitHub (Nov 10, 2024): > What do you hope to gain from this? I don't think MLX is faster for inference, at least not yet. llama3.2 is at least 64% faster with MLX on M4! Just got mine :)

GiteaMirror commented

@Idmon commented on GitHub (Nov 18, 2024):

Are we gonna see MLX support?

I think now with the new M4 Max chips it's a great time to support it.

@Idmon commented on GitHub (Nov 18, 2024): Are we gonna see MLX support? I think now with the new M4 Max chips it's a great time to support it.

GiteaMirror commented

@ljgeneral commented on GitHub (Nov 19, 2024):

Followed, looking forward to ollama support, I will try LM Studio first

@ljgeneral commented on GitHub (Nov 19, 2024): Followed, looking forward to ollama support, I will try LM Studio first

GiteaMirror commented

@bhupesh-sf commented on GitHub (Nov 22, 2024):

Want to follow to see its support natively

@bhupesh-sf commented on GitHub (Nov 22, 2024): Want to follow to see its support natively

GiteaMirror commented

2026-04-28 03:56:39 -05:00

@NashvilleBrandon commented on GitHub (Nov 27, 2024):

The people need this.

@NashvilleBrandon commented on GitHub (Nov 27, 2024): The people need this.

GiteaMirror commented

@bakaburg1 commented on GitHub (Dec 4, 2024):

Up!

@bakaburg1 commented on GitHub (Dec 4, 2024): Up!

GiteaMirror commented

2026-04-28 03:56:39 -05:00

@baoduy commented on GitHub (Dec 5, 2024):

1 more vote for this month be 👍👍👍

@baoduy commented on GitHub (Dec 5, 2024): 1 more vote for this month be 👍👍👍

GiteaMirror commented

2026-04-28 03:56:40 -05:00

@elfriscia commented on GitHub (Dec 5, 2024):

1 more vote for this month be 👍👍👍

Wish granted. One more 👍
This feature is native on Lm Studio and there’s a big difference.

@elfriscia commented on GitHub (Dec 5, 2024): > 1 more vote for this month be 👍👍👍 Wish granted. One more 👍 This feature is native on Lm Studio and there’s a big difference.

GiteaMirror commented

2026-04-28 03:56:41 -05:00

@vietvudanh commented on GitHub (Dec 6, 2024):

Well, I ended up using MLX directly. Just wish they supports json output mode.

@vietvudanh commented on GitHub (Dec 6, 2024): Well, I ended up using MLX directly. Just wish they supports json output mode.

GiteaMirror commented

2026-04-28 03:56:42 -05:00

@smsmatt commented on GitHub (Dec 7, 2024):

Would appear to be a big win for Ollama to put this feature in.

@smsmatt commented on GitHub (Dec 7, 2024): Would appear to be a big win for Ollama to put this feature in.

GiteaMirror commented

2026-04-28 03:56:43 -05:00

@cryptedx commented on GitHub (Dec 8, 2024):

Well, this would be a big win for Ollama!

@cryptedx commented on GitHub (Dec 8, 2024): Well, this would be a big win for Ollama!

GiteaMirror commented

2026-04-28 03:56:43 -05:00

@hg0428 commented on GitHub (Dec 8, 2024):

Llama.cpp is exploring Apple Silicon ANE support, which not even MLX has. If implemented properly, that could make Llama.cpp significantly faster than MLX.

@hg0428 commented on GitHub (Dec 8, 2024): Llama.cpp is exploring Apple Silicon ANE support, which not even MLX has. If implemented properly, that could make Llama.cpp significantly faster than MLX.

GiteaMirror commented

2026-04-28 03:56:43 -05:00

@ivanfioravanti commented on GitHub (Dec 8, 2024):

All models must be created adhoc for ANE. Moreover ANE is faster if you have few GPU cores otherwise noSent from the road.Il giorno 8 dic 2024, alle ore 19:44, Hudson Gouge @.***> ha scritto:
Llama.cpp is exploring Apple Silicon ANE support, which not even MLX has. If implemented properly, that could make Llama.cpp significantly faster than MLX.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

@ivanfioravanti commented on GitHub (Dec 8, 2024): All models must be created adhoc for ANE. Moreover ANE is faster if you have few GPU cores otherwise noSent from the road.Il giorno 8 dic 2024, alle ore 19:44, Hudson Gouge ***@***.***> ha scritto: Llama.cpp is exploring Apple Silicon ANE support, which not even MLX has. If implemented properly, that could make Llama.cpp significantly faster than MLX. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

GiteaMirror commented

2026-04-28 03:56:44 -05:00

@hg0428 commented on GitHub (Dec 8, 2024):

All models must be created adhoc for ANE. Moreover ANE is faster if you have few GPU cores otherwise no.

The ANE is like the GPU, but specialized for AI. The model files themselves need not be changed. You can run anything on it just like the GPU, thanks the new API that Apple released. Previously, you had to use CoreML; now, you can access it directly. The ANE alone can get performance roughly equivalent to the M4 Max GPU, which is quite good. Combined with GPU, it can result in a significant performance boost.

@hg0428 commented on GitHub (Dec 8, 2024): > All models must be created adhoc for ANE. Moreover ANE is faster if you have few GPU cores otherwise no. The ANE is like the GPU, but specialized for AI. The model files themselves need not be changed. You can run anything on it just like the GPU, thanks the new API that Apple released. Previously, you had to use CoreML; now, you can access it directly. The ANE alone can get performance roughly equivalent to the M4 Max GPU, which is quite good. Combined with GPU, it can result in a significant performance boost.

GiteaMirror commented

@elfriscia commented on GitHub (Dec 8, 2024):

The troll downvoting is frustrated for not being able to have a MacBook and will reply anything to contradict this statement.

@elfriscia commented on GitHub (Dec 8, 2024): The troll downvoting is frustrated for not being able to have a MacBook and will reply anything to contradict this statement.

GiteaMirror commented

@vietvudanh commented on GitHub (Dec 9, 2024):

Well, soon I guess: Ollama's post

@vietvudanh commented on GitHub (Dec 9, 2024): Well, soon I guess: [Ollama's post](https://x.com/ollama/status/1865238754052485293)

GiteaMirror commented

@iamhenry commented on GitHub (Dec 9, 2024):

@iamhenry commented on GitHub (Dec 9, 2024): ![image](https://github.com/user-attachments/assets/2d0b000d-a651-4ca3-8653-33546c2f66d3)

GiteaMirror commented