10% of Firefox crashes are caused by bitflips
73 points by glacambre
73 points by glacambre
I asked a question in that thread because I'm either missing something about the claim or the claim is too strong, but didn't get a response. The author says the estimate is conservative and likely lower than reality, but the process summarised in the bug tracker is: If the bad pointer access is for an unmapped page and the there's a mapped page one bit-flip away, the access is classified as a hardware memory issue.
But that seems to ignore a whole class of issues where a valid pointer gets corrupted by either bad_ptr->flag=value or *bad_ptr += 2**x which also result in a bit flip.
They wrote:
I've looked at the numbers we got from the users who run the memory tester after having experienced a crash: for every two crashes we think are caused by a bit-flip the memory tester found one genuine hardware issue.
So 25 000 / 470 000 were potential bit flip crashes, and the memory tester confirms a hardware problem in about half the cases when it runs.
Which I think says 2% of crashes are hardware problems, possibly more. (I can’t tell what proportion are due to running out of space.)
Dunno where the 10% headline figure comes from.
Yeah this is a pretty strong conclusion to draw from an extremely messy data set. It's probably better to call it something like "10% (+/- 30%) of Firefox crashes are from memory errors that could be bitflips". Small numbers of very flaky hardware could also be a confounding factor, as mentioned in sibling comments.
How about we all switch to ECC. Then we could tell if it's the hardware or the software at fault.
It would be interesting to evaluate the difference. It will be very hard with telemetry because there will be bias with who opts out. But tracking something simple like "has seen a crash in the last 30 days" combined with "has ECC memory" would tell you a great deal. This would also serve as another way to estimate the crash rate associated with ECC memory so you can compare the two methods of estimation.
Note that they don't claim that 10% of all machines running Firefox have bad memory. The claim is that 10% of all reported crashes are caused by bad memory, which is quite different. Machines with bad memory presumably generate more crash reports.
This is really a testament to the quality of the Firefox codebase. The more bugs they fix, the higher that number will get!
The hackernews thread also contains very interesting anecdotes : https://news.ycombinator.com/item?id=47252971
I think a lot of those failures are on specific machines. Currently I have some corrupted memory that still works just good enough that I don't want to buy new sticks, but I have probably crashed firefox 50 times in the last month or so (likely all bitflips).
On Linux, you can exclude zones of memory. You can locate the problematic ones with memtests and pass them on the kernel command-line.
A more general solution is an EFI driver that marks the memory, this works across different operating systems or even dual-boot. The EFI version of memtest will also skip the reserved ranges and no longer find the errors.
A very simple implementation that I'm currently using: https://github.com/sammko/badram