danluu, (edited )
@danluu@mastodon.social avatar

This exchange reminds me of the debate I had with Jeff Atwood on whether or not servers should use ECC memory a decade ago. Jeff said no and I disagreed and said yes in https://danluu.com/why-ecc/.

At the time, there was one argument that could've, theoretically, been overturned by progress: Jeff argued that commodity non-ECC memory was becoming more reliable and was highly reliable. This was not true at the time, and it turns out this still isn't true a decade later.

Elucidating, (edited )
@Elucidating@mastodon.social avatar

@danluu Hey just fun fact: at Google we recently had a truly amazing moment where non-ECC ram holding part of a protocol buffer got corrupted after transmission and validation, got replicated all over the place, and caused chaos.

I do not know how we defense in depth much harder but we'll have to figure out a way because I have officially lives to see a cosmic ray infiltrate and destroy a distributed system that aggressively validates input off wire on all nodes.

danhulton,
@danhulton@hachyderm.io avatar

@Elucidating @joby @danluu That is wild, but what is more wild is that you know that that's what happened, that there was somehow the tooling in place to understand and prove this to be the case.

I've never worked at a FAANG, I was unaware you all employed actual mystical sorcery in your stack. Is that something you can import with Terraform or do you need an actual mystical wizard with an ancient tome on-staff? 😆

gryzor,
@gryzor@androiddev.social avatar

@danluu I wonder if Jeff @codinghorror still thinks the same. ;)

slink,
@slink@fosstodon.org avatar

@danluu if anyone else prefers links over screenshots: https://fosstodon.org/@gabrielesvelto/112401366345455768

danluu,
@danluu@mastodon.social avatar

Maybe a more relevant piece of data is the rate of ECC errors in servers with ECC memory but, as discussed in the post, my experience consumer vs. server CPUs, RAM, etc., is that server hardware is more reliable than consumer hardware, so extrapolating from the observed error rate in ECC memory won't really get you to the true error rate.

But even so, from the server data I've seen, which would be overly optimistic, I would not want to have servers without ECC.

mhoye, (edited )
@mhoye@mastodon.social avatar

@danluu one of the biggest things server hardware enjoys is clean power and controlled operating conditions. You’re certainly not wrong about ECC mattering, but consumer-grade hardware running on server-room power in temperature and humidity-controlled rooms runs far more reliably far longer than you’d ever expect, and most house voltage in North America is hardware-stressing noise by comparison.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • modclub
  • DreamBathrooms
  • InstantRegret
  • magazineikmin
  • cubers
  • GTA5RPClips
  • thenastyranch
  • Youngstown
  • rosin
  • slotface
  • tacticalgear
  • ethstaker
  • kavyap
  • Durango
  • anitta
  • everett
  • Leos
  • provamag3
  • mdbf
  • ngwrru68w68
  • cisconetworking
  • tester
  • osvaldo12
  • megavids
  • khanakhh
  • normalnudes
  • JUstTest
  • lostlight
  • All magazines