Arm desktop: so many cores, not enough speed
24 points by raymii
24 points by raymii
This actually seems like a kernel scheduling problem or a device driver problem. There’s no reason why user interactive work can’t take priority over building packages. I wouldn’t blame the hardware.
Well, there is not even a fundamental reason why compiler seeing a huge block of smashed together code cannot split it into functions (this can be done at IO speed) then do most of the work in a parallelised way.
The writeup is probably mostly informative as an observation about what has been considered efficient use of effort in desktop and development software recently.
For scheduling, assumption mismatch about hardware behaviour seems likely…
But at the same time, 100% load on all cores means you cannot listen to music on Spotify or watch online videos, etc. All that because the CPU cores are occupied by the build processes.
I am sad that the nice command (or really, the concept of process priority in general) seems to have fallen out of our collective awareness.
To be fair, nice helps with worst-case-latency but doesn't really solve it on its own, even if the scheduler distributes the throughput well. I wonder whether a kernel build with realtime options would be worth it, given that there is throughput to spare (and also to do a kernel rebuild quickly!)
I had this issue with foreground tasks like video playback stuttering when saturating the cores with background compilation tasks. I ended up using the BORE kernel patch (https://github.com/firelzrd/bore-scheduler) which solved the issue for me. I hear people are also using sched-ext for this?
Con Kolivas scheduler also was about that, although I think it was explicitly designed for low-ish number of cores and might have efficiency issues at 64+
Not that it would be helpful in CPU bound case, but there is also ionice.
Are cgroups helpful in this case? I know that in the worst case one could at least limit the builds to not take everything. But I mean in prioritizing. Is it any better than nice/ionice?
Something I haven't seen mentioned yet is that building 80 things at once will use significantly more memory than building 8 things at once 10x faster. (Different I/O patterns too, now that I think of it, though I doubt that's significant.) Linux's behavior at low memory is kinda shitty, which is my guess on the origin of lag. I've been bitten by this many times on high-core-count build servers.
Interesting hypothesis.
I went the opposite way recently, forgive how short this recap is, I’m writing from my phone.
It goes like this: I want to do game development over Parsec to a beefy machine.
My laptop is an ARM Thinkpad (Snapdragon X-2 Elite) - Windows works ok, Linux has basically no support for GPU, Webcam, Audio, Wifi.
Decide to use Windows. Parsec only has x86_64 binaries for Windows. No GPU acceleration for emulated process. Kinda works but loud laptop and no dual screen streaming.
Decide to get a HP Elitebook with a modern Intel CPU. Single display stream: butter smooth. Two? Major lag (and super inconsistent).
Go on a deep dive.
Discover that the way that the (very fast) media decoder works is via pipelining. Two videos invalidate the pipeline.
Test this by trying to play a video while streaming; Stream lags.
Inspect why 2 videos work; they use buffering, which streaming can’t do.
Test a dGPU (ARC A750) with a eGPU enclosure, two video decoders, works.
Decide to try my M2 Macbook Air. Works.
My guess; The h265 decoder on the M-series doesn’t use pipelining (RISC, right!), so doesn’t evict the pipeline all the time.
Hey! Slightly unrelated. I was wondering about those new ARM laptops, are they good at all ? Do you still prefer the x86 ones ?
Also, do you know if your processor supports MTE ? I've been trying to find info on it on the internet but since it appears to be "optional", it's not easy to know if they used it. ($ cat /proc/cpuinfo | grep mte apparently suffice to know ~) :D
Thanks!
The Arm server market (which Altra is intended for, and it's also quite long in the tooth now) is pretty much exclusively targeting hyperscale (or hyperscale aspirational, i.e. the OVHs of the world), where you're maximizing tenants per system with massive core density. It doesn't matter that much if those cores are fast.
For HPC clusters with high core density it’s excellent, as it has very low energy requirements. Each core in the Altra has a single thread instead of two like others. Unusual setup that requires kernel tweaking to unleash its full potential (I’m utilising those with FreeBSD at work). Homelabs could benefit from it at as well, you get 32 cores with TDP of 60W which is approachable in the off-grid network.