Arm desktop: so many cores, not enough speed

24 points by raymii

rpaulo

This actually seems like a kernel scheduling problem or a device driver problem. There’s no reason why user interactive work can’t take priority over building packages. I wouldn’t blame the hardware.

k749gtnc9l3w

Well, there is not even a fundamental reason why compiler seeing a huge block of smashed together code cannot split it into functions (this can be done at IO speed) then do most of the work in a parallelised way.

The writeup is probably mostly informative as an observation about what has been considered efficient use of effort in desktop and development software recently.

For scheduling, assumption mismatch about hardware behaviour seems likely…

koreth

But at the same time, 100% load on all cores means you cannot listen to music on Spotify or watch online videos, etc. All that because the CPU cores are occupied by the build processes.

I am sad that the nice command (or really, the concept of process priority in general) seems to have fallen out of our collective awareness.

k749gtnc9l3w

To be fair, nice helps with worst-case-latency but doesn't really solve it on its own, even if the scheduler distributes the throughput well. I wonder whether a kernel build with realtime options would be worth it, given that there is throughput to spare (and also to do a kernel rebuild quickly!)
- accelbread
  
  I had this issue with foreground tasks like video playback stuttering when saturating the cores with background compilation tasks. I ended up using the BORE kernel patch (https://github.com/firelzrd/bore-scheduler) which solved the issue for me. I hear people are also using sched-ext for this?
  - k749gtnc9l3w
    
    Con Kolivas scheduler also was about that, although I think it was explicitly designed for low-ish number of cores and might have efficiency issues at 64+
- hawski
  
  Not that it would be helpful in CPU bound case, but there is also ionice.
  
  Are cgroups helpful in this case? I know that in the worst case one could at least limit the builds to not take everything. But I mean in prioritizing. Is it any better than nice/ionice?
ibisum
I'm sad folks don't have muscle memory for:
```
$ make -j $(($(nproc) - 1))
```
.. like the good old days.
- fanf
  
  It’s annoying that bare -j means unbounded parallelism rather than automatically limited by nproc.
icefox

Something I haven't seen mentioned yet is that building 80 things at once will use significantly more memory than building 8 things at once 10x faster. (Different I/O patterns too, now that I think of it, though I doubt that's significant.) Linux's behavior at low memory is kinda shitty, which is my guess on the origin of lag. I've been bitten by this many times on high-core-count build servers.
dijit

Interesting hypothesis.

I went the opposite way recently, forgive how short this recap is, I’m writing from my phone.

It goes like this: I want to do game development over Parsec to a beefy machine.

My laptop is an ARM Thinkpad (Snapdragon X-2 Elite) - Windows works ok, Linux has basically no support for GPU, Webcam, Audio, Wifi.

Decide to use Windows. Parsec only has x86_64 binaries for Windows. No GPU acceleration for emulated process. Kinda works but loud laptop and no dual screen streaming.

Decide to get a HP Elitebook with a modern Intel CPU. Single display stream: butter smooth. Two? Major lag (and super inconsistent).

Go on a deep dive.

Discover that the way that the (very fast) media decoder works is via pipelining. Two videos invalidate the pipeline.

Test this by trying to play a video while streaming; Stream lags.

Inspect why 2 videos work; they use buffering, which streaming can’t do.

Test a dGPU (ARC A750) with a eGPU enclosure, two video decoders, works.

Decide to try my M2 Macbook Air. Works.

My guess; The h265 decoder on the M-series doesn’t use pipelining (RISC, right!), so doesn’t evict the pipeline all the time.
- j4m3s
  
  Hey! Slightly unrelated. I was wondering about those new ARM laptops, are they good at all ? Do you still prefer the x86 ones ? Also, do you know if your processor supports MTE ? I've been trying to find info on it on the internet but since it appears to be "optional", it's not easy to know if they used it. ($ cat /proc/cpuinfo | grep mte apparently suffice to know ~) :D Thanks!
calvin

The Arm server market (which Altra is intended for, and it's also quite long in the tooth now) is pretty much exclusively targeting hyperscale (or hyperscale aspirational, i.e. the OVHs of the world), where you're maximizing tenants per system with massive core density. It doesn't matter that much if those cores are fast.
- heavyrain266
  
  For HPC clusters with high core density it’s excellent, as it has very low energy requirements. Each core in the Altra has a single thread instead of two like others. Unusual setup that requires kernel tweaking to unleash its full potential (I’m utilising those with FreeBSD at work). Homelabs could benefit from it at as well, you get 32 cores with TDP of 60W which is approachable in the off-grid network.