Activity - @shortridge@hachyderm.io Some time later, I was no longer working tech support....

bob_zim, 4 months ago

@shortridge Some time later, I was no longer working tech support. I got hired to do network and firewall stuff for a fairly large company. At one point, they decided to relocate the office where a lot of the operations and monitoring staff worked. They moved the whole application monitoring team to the new building with the unproven infrastructure first, because some people in charge made very bad decisions.

The monitoring team gets to the new building, and they can’t access any of their monitoring systems. Clearly a problem with the new office, right? They go through a few environments to get to their monitoring systems, so I log in to the remote access VPN for the first one and confirm the first firewall they hit sees their traffic and isn’t dropping it.

I go to log in to the remote access VPN for the second environment, where the monitoring systems actually live. I’m able to start the connection, but it never prompts me for my credentials, and the tunnel never comes up. Huh. That’s weird.

Well, I’ll just get in through the DR version of the second environment. Connection works and it prompts me for my credentials, but it rejects them. I try again, in case I made a mistake entering the passphrase for my key, but it’s still rejected. Huh. That’s weird.

I eventually find a working way in. I’m able to ping all the relevant systems, I’m able to make TCP connections via telnet, but trying to actually use a service like SSH or MSRDP just hangs. But wait! I can connect to my firewalls via SSH! So what’s common among the broken systems?

All the broken systems are VMs. I start testing connections to other things which I know are VMs. They all behave the same. Ping works, TCP connections work, but data over the connections gets no response.

I bring in the virtualization team. Some of us drive in to the datacenter hosting the VMs giving us trouble. Someone quickly realizes the single SAN hosting all of the VMs’ drives was up, but wasn’t responding to storage requests. Effectively the drive had been pulled out of every single VM. Now we have an explanation for why all the VMs seem to be broken.

With most operating systems, the network stack is wired in RAM and can’t be swapped out. The network stack handles responding to pings and opening TCP connections on listening ports. Once a TCP connection is opened, it requests a copy of the listening service from storage to handle the connection. With storage no longer responding, the network stack never gets the copy of the service to handle the connection, so data doesn’t work.

Why couldn’t I connect to the second VPN endpoint? Well, some people in charge made very bad decisions. They had decided that since VMs are the future, the VPN endpoints in that facility should be moved from dedicated hardware to VMs stored on the SAN. They hadn’t gotten to the first VPN endpoint yet, but that environment wasn’t allowed to connect in to the second environment.

Okay, but I could connect to the other site’s VPN endpoint, and the other site didn’t have any problems. Why didn’t it accept my credentials? Well, some people in charge made very bad decisions (you may be noticing a theme!). All authentication was run through some VMs which were stored on the SAN. The VPN boxes in the working location were set to monitor the health of the authentication boxes in the failed location by pinging them. As long as they responded to ping, they were good, so the VPN boxes wouldn’t fail over to using their local authentication boxes. And a computer with its drive pulled can still respond to ping with just the network stack in RAM.

Once we realized what was going on, we physically connected to the WAN routers and added routes to prevent the two sites from reaching each other’s authentication boxes. Presto! We could now log in via the DR environment as normal. The other infrastructure teams were then able to start digging into their parts.

But why is the SAN unresponsive? Turns out this particular SAN vendor had an option for what to do under certain failure conditions: it could fail read-only or fail completely silent. This one was set to fail silent, and it had filled up.

I wasn’t directly involved in fixing the SAN. I know the manager over the SAN team had been sounding the alarm for months before it filled. I also know there were multiple levels of bad configuration, such as more space offered by LUNs than the SAN could physically provide.

Big takeaways:

Make sure your access to fix a system doesn’t depend on that system. It’s really easy to accidentally introduce dependency cycles, and it takes constant work to avoid them.

Superficial tests like whether you can ping something can’t detect some pretty major failures. More significant tests are more likely to notice the problem.

When something is critical to an environment, maybe have more than one of them? The SAN had internal redundancy to deal with faulty drives and so on, but all the storage was in one giant pool. Multiple SAN systems can provide a bulkhead such that breaking one would not break all VMs.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ wonka

wonka 4 months ago