MAX models can now run on Apple silicon GPUs

5 points by melodyogonna

symgryph

Dumb question, what exactly is a Max model? Is it yet another llama.cpp thing? In other words, does it just run models? Why is this special? There's a lot of things that run on Apple. Silicon. Wish the website would tell you what it was about!

sloane

not dumb, had the exact same question while clicking around confusedly before closing the tab.
melodyogonna
What exactly is a Max Model

These are your normal open source models served through the max stack: https://docs.modular.com/max/models/

Is it another Llama.cpp thing?

In short yes, it's a bit of Llama.CPP and vLLM in that it is small enough to spin up and run on your machine, but can scale up to datacenter-scale AI serving.

Why is this special?

Two things:
1. It is another milestone for the new Mojo programming language.
2. It validates that the Modular stack can portably target heterogenous hardwares. This is something that was the stated goal of Modular, but until now the stack only really worked with accelerators of "similar architecture" - e.g AMD and Nvidia GPUs with separate host and device memory, so memory access abstractions were built around this access pattern. Apple GPUs with their shared memory architecture is different from these two, it has been nice watching support for it mature until it can run full models.
Something to note is that this stack is entirely self contained. It is the same code base and kernels targeting different hardwares, no Cuda kernels, MLX kernels, or RocM kernels, all Mojo. https://github.com/modular/modular/tree/main/max/kernels
JulianWgs

MAX is the inference engine of the AI company Modular, which are better known for the programming language mojo.

https://www.modular.com/open-source/max

https://mojolang.org/