Computer Device Makers Ignore Bit-Flip Errors and Data Corruption

Binary code with one bit being flipped

I came across this article at Spiceworks noting a new type of URL typo-squatting based on bits being flipped in memory.  However, the article seemed to bury the lead which is this:

Research has shown that a computer with 4GB of memory has a 96% chance of having a random “bit flip” every three days.

That’s a crazy high chance of data corruption occurring on your computer. So, what causes these bits flip errors? Well as circuits in computers get smaller and smaller (e.g., the latest Apple chips are based on impossibly fine 5nm circuits and memory circuits have also shrunk), when cosmic rays/neutrons or some other interference passes through them, there is an increased chance that a 0 can be flipped to a 1 or vice versa.

This is why devices are ‘radiation-hardened’ for space applications. Hardening includes, in part, increasing the size of circuits. Chip fabrication for space application is generally held between 65nm to 150nm (a staggering 30x larger than current circuits), because cosmic rays are much more likely to pass through devices in space than on the surface of the Earth.

Here on Earth we have an easier way to deal with such random bit flips and it’s called ECC memory. ECC stands for Error Correction Code and it employs parity to correct such bit flip errors. Parity is used, for example, by network storage devices like Synology, e.g. with RAID 5, to let you replace a bad drive in your RAID without losing your data (so why don’t they use it with RAM). Currently, the only Apple product that employs ECC memory is the Mac Pro. The question is why?

Modern devices seem likely to flip a bit and corrupt your data almost every day. The problem will only get worse with more memory and smaller fabrication techniques. That means every day your computer may bomb inexplicably or some bit of data on your computer will get corrupted. And that data corruption can compound getting worse and worse over time.

So why don’t all modern computer and mobile device makers use ECC memory? Right now ECC memory costs a bit more (you basically have a 9th bit of memory as a single bit parity check on the other 8 bits). However, if everyone moved to ECC memory as a default, these prices would fall fast.

I guess my question is, with error rates so high that a Mario 64 speed runner is experiencing them, is it at some point negligent for our computer/device makers to not start using ECC memory?

9 thoughts on “Computer Device Makers Ignore Bit-Flip Errors and Data Corruption

  • Another good gesture would be “squeeze the pencil” would begin dictation and text can pour out from the pencil tip onto the page where you have the pencil.

  • A little background is appropriate here.

    ECC requires support within the CPU too in order to do something about it. Intel CPUs have this ability BUT it’s only enabled on Xeons, as far as I know. The desktop-class CPUs (i3/i5/i7/i9 etc) don’t have this. Some people claim that the silicon does have it but the capability is not exposed at the pin-out. I have no info on that.

    So this is why Mac Pros support ECC memory and others do not – the Mac Pros use Xeon CPUs. Xeons also have one additional differentiator – they have the signaling necessary to synchronize CPU/memory caches for multi-processor systems, such as “big iron” server machines. You can’t do that with desktop-class CPUs (cheaper) because that sync ability isn’t exposed. So Intel can sell its desktop CPUs at a competitive price and then charge big markups for essentially the same chips that have (1)ECC, and (2)multi-CPU ability available.

    So the question now is why did Apple not opt to include ECC in its “M” series CPUs.

    We don’t know but that would be a great question to ask.

  • Considering how essential computer devices and systems are for our everyday lives, yes manufacturers should go to ECC memory. This isn’t 1987 where they were just hobby devices any more.

    1. Also back then, computer chips used probably 300nm+ lithography, and as such, the chips were more hardened against these bit flips. So you got a bit flip maybe once every few years, if that!

  • John:
     
    Agreed, at least in principle.
     
    ECC is not only more expensive, but unlike parity RAM, will come at a performance cost relative to non-parity and logic parity RAM https://en.wikipedia.org/wiki/RAM_parity?wprov=sfti1
     
    An outstanding question is, with the reduction in circuit size, has anyone noticed a slip in memory performance?
     
    Without a consumer demand groundswell, the industry writ large is unlikely to voluntarily migrate to ECC. If Apple can engineer a way to minimise this such that the hit is negligible, that could become an industry driver, assuming Apple take the lead. If this becomes a practical problem, Apple might do just that.
     
    Nice pick up.

    1. I have ECC memory on my Mac Pro with 28cores… Let’s just say if there is a performance hit, it is negligible. Having a bit flipped once a day over years, IMO, will not remain negligible. That said, fair and good minded folks can disagree! As always, YMMV.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.