Camp Outage [Aug 17]

We had an outage going from 5:35PM to 4:29AM. (times in pacific)

It calmed down by itself but service was quite slow. A proper fix was done at 8AM. At 9am a configuration update was pushed to prevent the scaling issue.

Timeline:

  • 5:35pm: outage starts, service was completely unavailable
  • 4:29am: traffic went down enough to allow some automatic recovery.
  • 7:50am: Start investigation
  • 8:04am: Just did a quick look and one of the nodes in the cluster was in a bad state. Orchestrator didn't catch it and kept it in rotation. Just took it down and got it back in shape. So hopefully y'all see an improvement now.
  • 8:52am: Just found a big issue with docker setup for /kbin. The guide for it is not really tested, so things aren't fully cover. But I found another guide for bare-metal (when running on a server) that has an important step for letting /kbin handle a lot more processes. Implementing this right now. So that should make a huge difference.
  • 9:21am: Alrigthy, pushed some config changes that should improve Artemis.camp performance. Will monitor latency over the next day to measure improvements.
  • 11:04am: Things seem to be stabler. Tho, will look at the end of the day 🙂
/kbin logotype
SuperSpaceFan,

Thank you for this post, and your transparency. Keep up the great work!

  • All
  • Subscribed
  • Moderated
  • Favorites
  • campsite@artemis.camp
  • DreamBathrooms
  • mdbf
  • InstantRegret
  • Durango
  • Youngstown
  • rosin
  • slotface
  • thenastyranch
  • osvaldo12
  • ngwrru68w68
  • kavyap
  • cisconetworking
  • khanakhh
  • magazineikmin
  • anitta
  • cubers
  • vwfavf
  • modclub
  • everett
  • ethstaker
  • normalnudes
  • tacticalgear
  • tester
  • provamag3
  • GTA5RPClips
  • Leos
  • megavids
  • JUstTest
  • All magazines