shortridge,
@shortridge@hachyderm.io avatar

in case there are other nerds out there who haven’t yet read this classic, behold “the case of the 500-mile email” https://www.ibiblio.org/harris/500milemail.html

I adore the “absurd computer-borne mysteries” genre and kindly ask for more content from the annals of y’all’s careers

bob_zim,

@shortridge While working tech support, I got a call on a Monday. Some VPNs which had been working on Friday were no longer working. After a little digging, we found the negotiation was failing due to a certificate validation failure.

The certificate validation failure was happening because the system couldn’t check the CRL.

The system couldn’t check the CRL because it was too big. The system doing the validation only allocated 512kB to store the CRL, and it was bigger than that. This is from a private certificate authority, though, and 512kB is a LOT of revoked certificates. Shouldn’t be possible for this environment to hit within a human lifespan.

Turns out the CRL was nearly a megabyte! What gives? We check the certificate authority, and it’s revoking and reissuing every single certificate it has signed once per second.

The revocations say all the certificates (including the certificate authority’s) are expired. We check the expiration date of the certificate authority, and it’s set to some time in 1910. What? It was around here I started to suspect what had happened.

The certificate authority isn’t valid before some time in 2037. It was waking up every second, seeing the current date was after the expiration date and reissuing everything. But time is linear, so it doesn’t make sense to reissue an expired certificate with an earlier not-valid-before date, so it reissued all the certs with the same dates and went to sleep. One second later, it woke up and did the whole process over again. But why the clearly invalid dates on the CA?

The CA operation log was packed with revocations and reissues, but I eventually found the reissues which changed the validity dates of the CA’s certificate. Sure enough, it reissued itself in 2037 and the expiration date was set to 2037 plus ten years, which fell victim to the 2038 limitation. But it’s not 2037, so why did the system think it was?

The OS running the CA was set to sync with NTP every 120 seconds, and it used a really bad NTP client which blindly set the time to whatever the NTP server gave it. No sanity checking, no drifting. Just get the time, set the time. OS logs showed most of the time, the clock adjustment was a fraction of a second. Then some time on Saturday, there was an adjustment of tens of thousands of seconds forward. The next adjustment was hundreds of thousands of seconds forward. Tens of millions of seconds forward. Eventually it hit billions of seconds backwards, taking the system clock back to 1904 or so. The NTP server was racing forward through the 32-bit timestamp space.

At some point, the NTP server handed out a date in 2037 which was after the CA’s expiration. It reissued itself as I described above, and a date math bug resulted in a cert which expired before it was valid. So now we have an explanation for the CRL being so huge. On to the NTP server!

Turns out they had an NTP “appliance” with a radio clock (i.e, a CDMA radio, GPS receiver, etc.). Whoever built it had done so in a really questionable way. It seems it had a faulty internal clock which was very fast. If it lost upstream time for a while, then reacquired it after the internal clock had accumulated a whole extra second, the server didn’t let itself step backwards or extend the duration of a second. The math it used to correct its internal clock somehow resulted in dramatically shortening the duration of a second until it wrapped in 2038 and eventually ended up at the correct time.

Ultimately found three issues:
• An OS with an overly-simplistic NTP client
• A certificate authority with a bad date math system
• An NTP server with design issues and bad hardware

bob_zim,

@shortridge Some time later, I was no longer working tech support. I got hired to do network and firewall stuff for a fairly large company. At one point, they decided to relocate the office where a lot of the operations and monitoring staff worked. They moved the whole application monitoring team to the new building with the unproven infrastructure first, because some people in charge made very bad decisions.

The monitoring team gets to the new building, and they can’t access any of their monitoring systems. Clearly a problem with the new office, right? They go through a few environments to get to their monitoring systems, so I log in to the remote access VPN for the first one and confirm the first firewall they hit sees their traffic and isn’t dropping it.

I go to log in to the remote access VPN for the second environment, where the monitoring systems actually live. I’m able to start the connection, but it never prompts me for my credentials, and the tunnel never comes up. Huh. That’s weird.

Well, I’ll just get in through the DR version of the second environment. Connection works and it prompts me for my credentials, but it rejects them. I try again, in case I made a mistake entering the passphrase for my key, but it’s still rejected. Huh. That’s weird.

I eventually find a working way in. I’m able to ping all the relevant systems, I’m able to make TCP connections via telnet, but trying to actually use a service like SSH or MSRDP just hangs. But wait! I can connect to my firewalls via SSH! So what’s common among the broken systems?

All the broken systems are VMs. I start testing connections to other things which I know are VMs. They all behave the same. Ping works, TCP connections work, but data over the connections gets no response.

I bring in the virtualization team. Some of us drive in to the datacenter hosting the VMs giving us trouble. Someone quickly realizes the single SAN hosting all of the VMs’ drives was up, but wasn’t responding to storage requests. Effectively the drive had been pulled out of every single VM. Now we have an explanation for why all the VMs seem to be broken.

With most operating systems, the network stack is wired in RAM and can’t be swapped out. The network stack handles responding to pings and opening TCP connections on listening ports. Once a TCP connection is opened, it requests a copy of the listening service from storage to handle the connection. With storage no longer responding, the network stack never gets the copy of the service to handle the connection, so data doesn’t work.

Why couldn’t I connect to the second VPN endpoint? Well, some people in charge made very bad decisions. They had decided that since VMs are the future, the VPN endpoints in that facility should be moved from dedicated hardware to VMs stored on the SAN. They hadn’t gotten to the first VPN endpoint yet, but that environment wasn’t allowed to connect in to the second environment.

Okay, but I could connect to the other site’s VPN endpoint, and the other site didn’t have any problems. Why didn’t it accept my credentials? Well, some people in charge made very bad decisions (you may be noticing a theme!). All authentication was run through some VMs which were stored on the SAN. The VPN boxes in the working location were set to monitor the health of the authentication boxes in the failed location by pinging them. As long as they responded to ping, they were good, so the VPN boxes wouldn’t fail over to using their local authentication boxes. And a computer with its drive pulled can still respond to ping with just the network stack in RAM.

Once we realized what was going on, we physically connected to the WAN routers and added routes to prevent the two sites from reaching each other’s authentication boxes. Presto! We could now log in via the DR environment as normal. The other infrastructure teams were then able to start digging into their parts.

But why is the SAN unresponsive? Turns out this particular SAN vendor had an option for what to do under certain failure conditions: it could fail read-only or fail completely silent. This one was set to fail silent, and it had filled up.

I wasn’t directly involved in fixing the SAN. I know the manager over the SAN team had been sounding the alarm for months before it filled. I also know there were multiple levels of bad configuration, such as more space offered by LUNs than the SAN could physically provide.

Big takeaways:

  1. Make sure your access to fix a system doesn’t depend on that system. It’s really easy to accidentally introduce dependency cycles, and it takes constant work to avoid them.
  2. Superficial tests like whether you can ping something can’t detect some pretty major failures. More significant tests are more likely to notice the problem.
  3. When something is critical to an environment, maybe have more than one of them? The SAN had internal redundancy to deal with faulty drives and so on, but all the storage was in one giant pool. Multiple SAN systems can provide a bulkhead such that breaking one would not break all VMs.
shortridge,
@shortridge@hachyderm.io avatar

@bob_zim this is an exquisite tragicomystery, thank you for sharing it, I’m in awe

dannotdaniel,
@dannotdaniel@mastodon.social avatar

@bob_zim @shortridge

nice work

thorough! I would never have tracked all that down.

rfc6919,
@rfc6919@aus.social avatar

@bob_zim @shortridge @xssfox

reminds me of a new-deployment airgapped site where all the clients had their clocks wrong by ~11h:40s and the customer (who provided the network our gear was on) was not happy. ok, so clients ntp from our server, and our server also has bad time. our server ntps from the site ntp server, which is also wrong. customer claims this is impossible. 11h40s is a really weird error. 11h is the TZ offset, but that shouldn’t turn into an end-user visible local time error. and there’s the 40s on top.

had enough access to the part of the machine room to see the appliance ntp server, which (being airgapped) pulled time from GPS. but, there were a bunch of SMAs on the back with nothing plugged into ‘em. networking vendor installed the NTP server and forgot to attach the antennas, but set the (utc) time from his watch, hence the almost but not quite exactly TZ-sized error.

hazelweakly,
@hazelweakly@hachyderm.io avatar

@shortridge Actually, I take it back, this is my favorite bug story of all time

https://www.teamten.com/lawrence/writings/coding-machines/

The writing is incredible, the twist is beyond absurd. The full implications of it are profound and potentially disturbing

It's not a read for the lighthearted, it'll take a while, but it's absolutely worth it

Imagine "reflections on trusting trust" but rendered real and haunting

TheIronFox,
@TheIronFox@computerfairi.es avatar

@hazelweakly oh what the fuck lol

hazelweakly,
@hazelweakly@hachyderm.io avatar

@TheIronFox right??? So good

uberduck,
@uberduck@hachyderm.io avatar

@hazelweakly @shortridge I got way too far in this before the possibility that it was fiction occurred to me.

It's fiction. Right?

Right?

wolf480pl,
@wolf480pl@mstdn.io avatar

@uberduck @hazelweakly @shortridge
It's fiction. The timing with the mailman is too convenient from a narrative perspective.

uberduck,
@uberduck@hachyderm.io avatar

@wolf480pl @hazelweakly @shortridge Okay, but you do get how hanging the fact/fiction decision on that instead of any of the technical details doesn't make me feel any better, right?

wolf480pl,
@wolf480pl@mstdn.io avatar

@uberduck @hazelweakly @shortridge
For me, the OG "reflections on trusting trust" is much more concerning than this. But I guess that's hardly a consolation.

hazelweakly,
@hazelweakly@hachyderm.io avatar

@uberduck @shortridge The level of detail that it goes into is very very real. Honestly, it's probably fiction, but the fact of it being fictional feels more a mere consequence of probability than impossibility

Which... Is unsettling :)

Kensan,
@Kensan@mastodon.social avatar

@shortridge Have you heard of the “OpenOffice.org won’t print to Brother printers on Tuesdays (but works on other days of the week)” bug?

http://catless.ncl.ac.uk/Risks/25/77#subj14.1

https://mdzlog.alcor.net/2009/08/15/bohrbugs-openoffice-org-wont-print-on-tuesdays/

Ubuntu bug:
https://bugs.launchpad.net/ubuntu/+source/file/+bug/248619

KevinMarks,
@KevinMarks@xoxo.zone avatar

@Kensan @shortridge this reminds me of the "python only parses dates correctly after the 12th of the month" problem I had. (The dates in the files I was being sent had been changed to UK dd/mm/YYYY format. Python assumes mm/dd/YYYY unless the mm>12)

hugovk,
@hugovk@mastodon.social avatar

@KevinMarks @Kensan @shortridge Which bit of Python assumes mm/dd/YYYY unless mm>12?

KevinMarks,
@KevinMarks@xoxo.zone avatar
hugovk,
@hugovk@mastodon.social avatar

@KevinMarks @Kensan @shortridge That's quite the gotcha!

Well, if you're not using ISO dates, and don't tell it what format is being used, this library has to make some sort of guess between mm/dd/YYYY and dd/mm/YYYY.

And iirc you have to tell the standard library which date format to parse, it won't guess.

rjohnston,
@rjohnston@mstdn.ca avatar

@shortridge I once overheard a co-worker's support call. A client said she came in on Monday, plugged her keyboard back in, and now it didn't work.
My co-worker started troubleshooting wires and connections and such, but nothing worked. They he circled back and asked, "Wait, you said you plugged your keyboard back in? Why was it unplugged in the first place"?
The client responded that the keyboard was grimy, so she took it home and put it in the dishwasher. Then asked, "Do you think that's why"?

shortridge,
@shortridge@hachyderm.io avatar

@rjohnston wait so when they say “flush the cash” they don’t mean sticking it in the dish washer on heavy rinse???

mattp,
@mattp@mathstodon.xyz avatar

@shortridge I recently encountered a whole collection of such stories: https://beza1e1.tuxen.de/lore/index.html

ieure,
@ieure@retro.social avatar

@shortridge Many more on this list: https://suricrasia.online/iceberg/

In particular, I can recommend:

  • Mario Wolczko's unix recovery
  • OpenOffice does not print on Tuesdays
  • chucknorris is an HTML color
  • Crash Bandicoot quantum corruption

(Even though this site is only ~3 years old, many links are dead, so use web.archive.org when possible)

dpnash,
@dpnash@c.im avatar

@shortridge

Once, I was working on a somewhat old Web app that processed visitor contact forms and we noticed there were some contacts that were getting dropped for no clear reason.

The one thing they had in common: the contacts were all coming from Norway.

A hazily-remembered bit of trivia wiggled its way into a synapse.

The app was using a programming language that, to honor something like 20 years of backward compatibility, still retained one of the most cursed decisions ever made, from sometime in the 90s: aliasing the strings “YES” and “NO” to Boolean true and false respectively.

bob_zim,

@dpnash @shortridge YAML explicitly considers “NO” to be false. It also apparently treats strings with colons in them (you know, like MAC addresses) as base 60 integers and tries to interpret them as times, which is ridiculously incorrect on many levels.

It’s such a terrible file format.

Paxxi,
@Paxxi@hachyderm.io avatar

@bob_zim @dpnash @shortridge yaml was fixed in a later version afaik

joelanman,
@joelanman@hachyderm.io avatar
shortridge,
@shortridge@hachyderm.io avatar

@joelanman love it, v worthy addition to the genre

lanodan,
@lanodan@queer.hacktivis.me avatar

@shortridge One story in this style I really like is https://patrickthomson.tumblr.com/post/2499755681/the-best-debugging-story-ive-ever-heard which I'd dub floor tiles vs. mainframe.

lanodan,
@lanodan@queer.hacktivis.me avatar

@shortridge And I think you'd also enjoy https://www.ecb.torontomu.ca/~elf/hack/recovery.html even though the problem is known right from the start, how they pieced it together for a recovery is just glorious.

Di4na,
@Di4na@hachyderm.io avatar

@shortridge i loved this one. Rogue basic regex dos the whole system
https://stackstatus.tumblr.com/post/147710624694/outage-postmortem-july-20-2016

And ofc the good old "not monotonic timestamp" go one, that they were warned about but dismissed until "proof of impact in real systems". Well. I mean i suppose this count.

They technically "fixed" it later. But the fix is ... Let's say elegance is in the eye of the watcher
https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns

resuna,
@resuna@ohai.social avatar

@shortridge I remember that one, by the last paragraph where he used the word "millilightseconds", at that point the vague "this seems familiar" feeling collapsed into "yes I've read this before". Something something quantum something.

hazelweakly,
@hazelweakly@hachyderm.io avatar

@shortridge the story of mel remains one of my favorites among this genre of tales of technology

http://www.catb.org/jargon/html/story-of-mel.html

whack,
@whack@hachyderm.io avatar

@shortridge doing some log analysis on web logs found that some servers were recording negative values for “request duration” … turns out certain servers had broken clocks and were causing NTP to kind of jump them backwards in order to synchronize. A request would come in, time jumps backwards, and the request appears to finish before it began. 😂

MichaelTBacon,
@MichaelTBacon@social.coop avatar

@shortridge 20 years ago I was running the email system at a Major Private University and one week the disks on the inbound email servers started mysteriously filling up repeatedly.

Turns out we had a 50mb limit on email size, but software at the time enforced that AFTER the message body was sent. So it would dutifully spool all the data onto disk before saying, "Sorry, rejected" at the end, at which point it would delete it.

1/

MichaelTBacon,
@MichaelTBacon@social.coop avatar

@shortridge

Why was this a problem, you might ask?

Someone had decided to email either themselves or someone else a message and attached a DVD rip of the movie Shrek, which was over 2GB in size. And they tried this multiple times.

So all the servers were busily spooling multiple 2GB messages to disk before rejecting (and deleting) them and it that time the disks would fill up.

From there on out we'd refer to ridiculously oversided email traffic as "some shrek backing everything up."

2/2

shortridge,
@shortridge@hachyderm.io avatar

@MichaelTBacon I lowkey love the persistence of the person sending the pirates Shrek video, incredible story all around, thank you

bittner,
@bittner@hachyderm.io avatar

@shortridge Mess with the laws of physics at your own peril.

jschauma,
@jschauma@mstdn.social avatar
jschauma,
@jschauma@mstdn.social avatar

@shortridge (From my own, very small collection of interesting bugs: https://www.netmeister.org/blog/interesting-bugs.html )

davidseidl,
@davidseidl@mstdn.social avatar

@shortridge Years ago a friend walked into my office in a panic. A hacker had taken over her machine and kept inputting random words into whatever document or file was open.

The hacker would go on rants about the UN and other topics, just filling up her documents. It persisted across reboots, but no malware could be found on the machine.

Security guy to the rescue! We took a look at the system and eventually found the issue: MS Word's speech to text was on, with the gain turned way up!

davidseidl,
@davidseidl@mstdn.social avatar

@shortridge It turned out her laptop was essentially hallucinating text out of the noise the mic was creating. It was remarkably consistent about some words and topics though!

sabik,
@sabik@rants.au avatar

@davidseidl @shortridge
Was it receiving (AM) radio?

davidseidl,
@davidseidl@mstdn.social avatar

@sabik nope! Just the microphone noise causing audio hallucinations in the speech to text tool!

becomingwisest,
@becomingwisest@hachyderm.io avatar
  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • magazineikmin
  • cubers
  • InstantRegret
  • cisconetworking
  • Youngstown
  • vwfavf
  • slotface
  • Durango
  • rosin
  • everett
  • kavyap
  • thenastyranch
  • mdbf
  • megavids
  • khanakhh
  • modclub
  • tester
  • ethstaker
  • osvaldo12
  • GTA5RPClips
  • ngwrru68w68
  • Leos
  • anitta
  • tacticalgear
  • normalnudes
  • provamag3
  • JUstTest
  • All magazines