Memory errors in consumer devices such as PCs and phones are not something you... - Random

gabrielesvelto, 1 month ago

Memory errors in consumer devices such as PCs and phones are not something you hear much about, yet they are probably one of the most common ways these machines fail.

I'll use this thread to explain how this happens, how it affects you and what you can do about it. But I'll also talk about how the industry failed to address it and how we must force them to, for the sake of sustainability. 🧵 1/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ mmu_man, pies, rodolphe, sossalemaire +22 more

Image

Image alternative text

me_, 1 month ago

@gabrielesvelto Thanks a lot for this thread – we investigated RAM failures some time ago. Looking forward to your article about this!

Online background memtests, remove bad pages from allocation:
Efficient online memory error assessment and circumvention for Linux with RAMpage
https://www.inderscienceonline.com/doi/abs/10.1504/IJCCBS.2013.058397

Type qualifiers to indicate worst-case effect of mem errors:
Improving the fault resilience of an H.264 decoder using static analysis methods
https://dl.acm.org/doi/abs/10.1145/2536747.2536753

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ gabrielesvelto

gabrielesvelto, 1 month ago

First of all let's talk briefly about how memory works. What you have in your PC or phone is what we call dynamic random access memory. That is memory that stores bits by putting a minuscule amount of charge into vanishingly small capacitors (or not putting it in if we're storing a zero).

These capacitors continuously leak this charge, so it needs to be refreshed periodically - every few milliseconds - which is why it's called "dynamic". 2/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

This design is extremely analog in nature. When your machine needs to read some bits the capacitors holding them are connected to a bunch of wires. The very small voltage difference that happens in the wire is detected by the use of a circuit that turns it into a clear 0 or 1 value (this is called a sense amplifier). 3/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

So how can this fail? In a huge number of ways. Circuits age with time and use. The ability of the individual capacitors to hold the charge goes down slowly over time, the transistors in the sense amplifiers degrade, points of contact oxidize, etc... Past a certain point this can make the whole process end up outside of the thresholds required to reliably read, write and retain the bits in memory. 4/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

This can lead to different failures: a very common one is a stuck bit, which ends up being always read as 1 or 0, regardless of what was written into it. Another type is timing-dependent failures, which cause a bit to flip but only if it's not touched in due time by an access or a refresh. More catastrophic errors can affect entire lines - which is what happens when a sense amplifier starts to fail. 5/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

ondra, 1 month ago

@gabrielesvelto Also, high energy particles from space affect even brand new RAM chips...
https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@ondra yes that happens, but that's a well understood and well studied phenomenon - especially in the context of data centers. When people talk about bit-flips that 's the first thing that comes to mind. I'd like to change this perception to make people realize that actual hardware faults are a lot more common than cosmic-ray hits.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

Either way, even a single bit error which happens once in a blue moon is catastrophic to a consumer machine. Sometimes it will cause a pixel to slightly change color, but sometimes it will affect an important computation and lead to a crash. Or worse: it'll cause some user data to be corrupted before it's written to disk, and when it is, the damage has become permanent. 6/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

If your machine exhibits rare but hard-to-explain crashes, or if you're forced to reinstall programs - or even the operating system - because of mysterious failures, or experience random reboots or BSODs, then it's very likely that your memory is failing and you need to replace it. 7/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

Diagnosing it is hard. Windows has a memory diagnostic tool which will catch the worst offenders and is easy to use: https://www.microsoft.com/en-us/surface/do-more-with-surface/how-to-use-windows-memory-diagnostic

It's not enough though, some errors can only be caught with more extensive testing. I recommend the open-source memtest86+ (https://memtest.org/) tool or the closed source memtest86 one (https://www.memtest86.com/) 8/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

kamstrup, 1 month ago

@gabrielesvelto thanks for an interesting thread!

I read talk, some years ago, about memtest+ not being maintained and actually not working at all for most cases (iirc it was in the context of the Ubuntu live cd boot option)

Do you know how it looks these days?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@kamstrup that was true for a while, then it was forked into PCMemTest by Martin Whitaker, and merged back with the memtest86+ codebase and now it's maintained by both the original author (Sam Demeulemeester) and the Martin Whitaker. The 7.0 major release dates back to this January.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

kamstrup, 1 month ago

@gabrielesvelto awesome, thanks ♥️

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dolske, 1 month ago

@gabrielesvelto FWIW, I wrote a "memtest.js" version that runs in the browser. I even got some bad RAM sticks from Mozilla RelEng to verify that it could detect real failures!

Live version still works, too: https://dolske.net/hacks/memtest.js/live/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dolske, 1 month ago

@gabrielesvelto Of course there are some limitations, since JS (thankfully) doesn't have bare-metal access. But I wanted to see if periodically testing whatever chunks of memory the browser/OS gave out would work well enough.

That is, it can't say "all clear", but it can say "problems found". The idea being to eventually have the browser itself run a small background check, which over time should either detect any bad bits or give confidence that things seem OK.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dolske, 1 month ago

@gabrielesvelto Alas it was just a side project for fun, so I set it aside after the proof-of-concept.

It seems like an interesting problem space, so I hope you get good results!

Oh, the old code: https://github.com/dolske/memtest.js

(It was also an excuse to play with the then-new asm.js, for the hot bitwise-op loops. That code is lost, but IIRC it wasn't any faster because whatever JIT we used then already did a good job.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@dolske thanks Justin! With @pbone we were thinking of doing just that, statistically testing crashy machines to figure out if we could spot some bad memory.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

Naturally what happens on PCs also happens on phones, network devices, printers, TVs, etc... but you can't test them. This is a disaster because these failures are common, and they become more and more common as the device ages. If we want to have repairable devices that last for a long time, the industry will have to change its practices, but more about this later. 9/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ onepict, alcinnz

gabrielesvelto, 1 month ago

Now you might wonder: how often does this actually happen? The common wisdom on this topic is that hardware failures are so rare that software bugs will always dwarf them. As I found out this is demonstrably false.

While investigating Firefox crashes I've come to the conclusion that several of the most common issues we were dealing with were likely caused by flaky hardware. This led me to come up with a simple heuristic to detect crashes potentially caused by bit-flips. 10/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz, jaseg, eniko

gabrielesvelto, 1 month ago

Deploying this heuristic to Mozilla's crash reporting infrastructure has been eye opening: if I take the 10 most common crashes on Windows, 7 are out-of-memory conditions - that is, not bugs - and 3 are likely caused by bad memory.

You've read that right, three out of the ten most common reasons why Firefox crashes on Windows are caused by memory that's gone bad. 11/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz, promovicz, osma, Di4na +1 more

gabrielesvelto, 1 month ago

Now there's a few things that are worth mentioning: users with bad hardware will be over-represented in this category, their machines will crash far more often than others.

The second thing is that Firefox is exceptionally stable, we've driven down its crash rate by more than >70% in the last few years. But Firefox is also a 30 million-lines-of-code monster. There are bugs in there, but they're less common than hardware failures! 12/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

ali1234, 1 month ago

@gabrielesvelto

How can memory errors cause the same crash on two different computers unless they both have the same error at the same address? (Which is of course astronomically unlikely.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@ali1234 because statistically they happen more frequently in pieces of code that touch a lot of memory. Firefox' JavaScript garbage collector is one such example. It traverses the heap using GC's typical mark & sweep behavior and touches thousands upon thousands of objects, crossing over an enormous amount of pointers. Because it's far more likely to hit a bad bit than the rest of the code it will show up far more often. Same for code that traverses huge hash tables, etc...

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ali1234, 1 month ago

@gabrielesvelto

I see. A similar phenomenon was observed in initializing bitcoin full nodes, where verifying the full chain (currently 5TB and growing) exposed a lot of memory errors due to hash mismatches.

That happened essentially because everyone was doing the exact same calculations on the exact same data, in the exact same order, and expecting a known result.

I would have expected a lot more randomness in Firefox, but I forgot how much manual memory management it does.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@ali1234 this is very interesting, yeah that's another workload which ends up touching a ton of memory

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

Plotting these types of crashes against time yields interesting trends: the more machines age the more likely they are to encounter hardware-related failures. You might think that's obvious, and indeed it is, but until now the industry has looked the other way, based on the hand-wavy excuse that hardware failures were less common than bugs. 13/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago (edited 1 month ago)

So what needs to change? First of all, error detection and correction must become commonplace. You can already build a desktop machine with ECC memory (https://en.wikipedia.org/wiki/ECC_memory), but it's uncommon in laptops, even mobile workstations, and completely absent on phones and other consumer appliances. This will measurably lengthen the usable life of these devices. 14/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

Note that detection is more important than correction. The user needs to know that there's something wrong without having to run a memory testing program. Think of the lights that turn on in cars if something's malfunctioning, or the error beeps that your washing machine makes when it thinks it's leaking water. These are extremely common, they need to be on computing devices too. 15/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

Finally hardware design must change to make devices repairable and prolong their useful life. Yes, I'm looking at non-ECC memories soldered on the motherboard or worse, on the same substrate as the CPU. 16/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz

gabrielesvelto, 1 month ago

To end the thread I'd like to thank my colleagues Alex Franchuk and @willcage who did the implementation work and my boss Gian-Carlo Pascutto who plotted crashes against machine age. I'd also like to point out that we've got preliminary data on the topic, but I fully intend to write a proper article with a detailed analysis of the data. 17/17

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ alcinnz, 74

fahrni, 1 month ago

@gabrielesvelto @willcage Looking forward to your article, Gabriele.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

osma, 1 month ago

@gabrielesvelto
Thanks for an interesting thread! It brought back memories from 90s computing that was full of repeatedly running memory checkers and reseating chips - and how that taught me for a while to only accept motherboards which supported ECC memory. Indeed no longer an option available for our devices.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ravenonthill, 1 month ago

@gabrielesvelto @willcage IBM designed parity checks everywhere in the System 360, so that the system would quickly stop in the event of hardware failure. ECC was implemented in main memory systems when it was discovered that solid state RAM was subject to transient memory failures from cosmic rays (really.) Early PCs used memory parity checks (which weren't adequate) until Apple abandoned them for cost reasons. Bad mistake.

https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ nyrath

tommythorn, 1 month ago

@gabrielesvelto Working with existing non-ecc system, couldn't some of this be caught if the OS ran a low-priority process scrubbing memory, eg. writing and checking a random, but check-summed pattern to free pages (similar to ZFS scrubbing). Even better, important data structures should be check summed (something I actually did in a database engine I wrote many decades ago).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@tommythorn one of the things we plan to do is running a brief, low-priority scan in processes that crashed. If we find some bad memory we'll notify the user and flag future crashes from their machine as low-value so we don't spend too much time looking at them.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

tomayac, 1 month ago

@gabrielesvelto Maybe saving someone else a Web search: Error correction code memory (ECC memory) is a type of computer data storage that uses an error correction code (ECC) to detect and correct n-bit data corruption which occurs in memory: https://en.wikipedia.org/wiki/ECC_memory.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@tomayac good point, I'd add the link to the post

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

cr1901, 1 month ago

@gabrielesvelto Really depressing that we've reached the physical limits of creating "memory we're confident that actually will store it's value reliably" :(.

We've went from PARITY CHECK 1/2 to "memory works fine without detection or correction" to "oh now not even parity check is enough". In that sense, it's WORSE than 40 years ago :P.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@cr1901 yes, it is worse than 40 years ago! This is an area where we've actively regressed

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jbqueru, 1 month ago

@gabrielesvelto @cr1901 maybe that's why it sometimes felt like those old machines were rock-solid in spite of their limitations: hardware has become less reliable faster than software became more reliable.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

qqmrichter, 1 month ago

@gabrielesvelto I question if "out of memory conditions" are to be classified as "not a bug". Given how memory-hungry Firefox is, and given how jealously it clings to memory that it's been allocated by the OS (I had to install a third-party extension, Auto Tab Discard, to help combat this!), I would say that Firefox's thirst for memory is indeed, 100% a bug.

Of course if you're on-side with people being forced to continually update hardware...

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@qqmrichter Firefox has limited control over what web pages do. When memory usage gets out of hand it's usually the page's fault. When I mentioned that Firefox is exceptionally stable I mean it, we've dramatically cut OOMs just a couple of years back, which is why we're now seeing even the hardware issues: https://hacks.mozilla.org/2022/11/improving-firefox-stability-with-this-one-weird-trick/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

qqmrichter, 1 month ago

@gabrielesvelto You can choose, however, whether to degrade gracefully ("I'm sorry, this web page won't load because of memory issues") instead of crashing in a shower of bits (which I'll admit you're doing a bit better on now).

I mean worst comes to worst, have an LRU on off-screen assets and just drop them in order when you try to allocate memory and it fails. If they're needed again, reload them again. It's ludicrous that a web browser requires (WAY!) more space than a whole OS image.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@qqmrichter we already do that. We do a lot of stuff when memory is tight. We purge caches, trim buffers, unload tabs that haven't been used in a while, do aggressive garbage collection, etc... Sometimes there's just no way to avoid an OOM, especially in the case of pages with runaway memory consumption or if other applications are also using resources on the machine.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

ollibaba, 1 month ago

@gabrielesvelto @qqmrichter Sorry to hijack this thread; but since you appear to know a lot about this topic: is it possible (or even common) for websites to have memory leaks, e.g. in their JS code? And if so, do you have links to possible solutions (for me, as end-user)?

I suspect that some of the tabs I keep open for long time use up more and more memory (and they have lots of JS code); reloading the tabs appears to help, but I'm looking for an automatic solution.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@ollibaba @qqmrichter yes, runaway memory consumption in websites is common. See this for example: https://nolanlawson.com/2021/12/17/introducing-fuite-a-tool-for-finding-memory-leaks-in-web-apps/

Firefox already does a fairly good job at keeping these pages at bay, but if you need something more active there's addons which can be used to unload unused tabs to free resources: https://addons.mozilla.org/en-US/firefox/addon/tab-reloader/?utm_source=addons.mozilla.org&utm_medium=referral&utm_content=search

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

lienrag, 1 month ago

@gabrielesvelto

Hi and thanks for your work.
Could there be an alert (I guess on the "about:performance" tab) for runaway memory consumption pages ?
It already shows the memory consumption but it's not clear when it's just that the page needs memory to do its work and when it's running amok.

Aldo, very empirically I believe that plug-ins can sometime provoke runaway problems and there's no real tool to monitor it.

@qqmrichter

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@lienrag @qqmrichter it's hard to tell apart a page that uses a lot of memory from a page that's leaking. What we're doing though is making sure that pages that use a lot of memory and which you aren't looking at get unloaded eventually, freeing the memory they're holding to.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rlb, 1 month ago

@gabrielesvelto I appreciate your focus on improving diagnosis. Those over-represented crashes distract people from real bugs. I don't understand why you included OOMs though -- allocation failures can and will happen, and it's our responsibility to handle them gracefully.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@rlb we've reduced OOM crashes massively in the past few years: https://hacks.mozilla.org/2022/11/improving-firefox-stability-with-this-one-weird-trick/

We already handle gracefully all the failures that we can realistically handle, but for a lot of them there's nothing we can do. That's especially true on non-Windows platforms where allocations never fails. Both on Linux and Android the kernel will kill processes to save memory without informing them or allowing for any type of reaction, so graceful handling is impossible.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

promovicz, 1 month ago

@gabrielesvelto I believe that most bitflips happen in media files. That way, they go unnoticed. ECC becomes important pretty quickly when doing other things…

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

lolcat, 1 month ago

@gabrielesvelto

Peripherally related: Dram consumes considerable power whether you're actively using it or not. Are you aware of any research into disabling refresh on unused memory? Could reducing refreshes also extend DRAM useful life and reduce errors?

True, with modern cache design, pretty much all the RAM is used all the time, but "used" is a pretty fuzzy concept in this context. Moreover, swapping to SSD isn't anywhere near as expensive as swapping to a SATA disk.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gabrielesvelto, 1 month ago

@lolcat IIRC modern DRAMs have low-power modes which can be enable when utilization is low or none; I don't know if they can also shut down individual sections/banks but maybe they do? I remember reading about that as a theoretical possibility well over a decade ago so maybe yes

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment