How and Why Local LLMs Perform On Framework 13 AMD Strix Point
13 points by msf
13 points by msf
In case someone is interested. I've been tracking my progress with NixOS + strix halo here:
https://discourse.nixos.org/t/how-to-ollama-on-amd-strix-halo/74363
I'm no expert, and if you have better benchmarks please share!
I normally won't post on these but the statistics are so wrong on every level. I regularly get things in excess of 100 g per second and I get an excess of 40 to 50 tokens per second with an MOE model that has 80 billion parameters. Also mention that this person only has 64 gigs! It makes a huge difference having more memory. If you quantize properly local llms work at much higher rates and are almost as effective as using the remote ones, at least for coding purposes. I will preface this with saying that I do use at least when coding this way, a larger lllm, say Gemini Pro to do some of the planning. The only complaint that I would have is that many of the newer models do not work particularly well with AMD. And you will be very disappointed if you try to do anything python! But if you just want to do basic research it's it's not a bad buy. If you want to do ml for a living don't waste your time with us. AMD is still far far behind Nvidia in terms of software and I suspect it will be another for you several years before they're up to the same level in terms of software support.
and I get an excess of 40 to 50 tokens per second with an MOE model that has 80 billion parameters.
do note that if this is one of qwen's gated deltanet models (i.e. the Next & 3.5 lines), I have to expect that the tps might also be influenced by architecture?
In what hardware do you get those numbers? And yes, if you have MoE, the usable token space that needs reading is smaller than the model size, the math presented on the poat is based on dense models and reading all the params..
Can you run some of the commands i have on the post to get a sense of the memory bandwith of the hardware where you get those?
Is also an amd64 laptop ?
I did send an email but the short answer is I have a framework desktop with 128 gigs of RAM. Please also keep in mind that mine pushes 140 w. In my opinion, never use dense models on an AMD. Also a tool called AMD GPU top which is written in rust will give you much more accurate instantaneous measures of bandwidth as you're actually doing inference.
right, so you have the Strix Halo, that has:
But point taken and I've learned about the importance of Sparse, MoE models on Memory Bandwith limited systems