OC The Deployment tail

One Monday morning I went into work to find one of the systems down and it reminded me of one of the rules I had developed working out the 'Ops' side of DevSecOps.

A lot of people gain experience in Operations, managing applications, these tend to be 3 tier architectures (e.g. UI -> Middleware -> Database). this kind of system is typically event triggered and so changing the system and making a few calls against the UI will quickly test the entire chain of components. This allows you to immediately confirm the change was successful.

As a result if your team works a standard week (Mon-Friday 9-5), you probably have a rule on not performing changes on a Friday afternoon.

However as systems become more complex, a typical pattern I have seen is to have services which run at set time intervals and perform some kind of housekeeping on data.

For example KBin stores everything within postgres, over time the number of records will increase and slow down the responsiveness of the database. This problem is typically mitigated by moving data out of the main datastore into a long term data store (Hadoop is common). This allows the main database to stay a "reasonable" size.

This is often done using a state machine to control the migration process e.g.

  • Mark a record as needing to be migrated
  • Create a copy of the record in long term storage
  • Confirm copy is correct within long term storage
  • Flag main data store instance of record as ready for deletion
  • Delete main data store instance of record

Thus your ability to confirm the change was successful is delayed, you need to see data move from the start point in the state machine to the end. This can typically take 6-96 hours and creates a long tail to confirm changes are successful.

This tail dictates when in a week you can perform a change. Since the change hasn't completed until then.

With a 6 hour tail a change rolled out at 9am on Thursday, doesn't complete until 3pm.

With a 72 hour tail, a change performed at at 9am on Tuesday, doesn't complete until 9am on Friday.

Since your rollback process also has this tail, it seriously impacts when/how you can perform a change. Any part of the system with more than a 48 hour tail either needs 7 days of cover or has reached the point you can't manage the service.

In either case it should become a priority.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • devops
  • mdbf
  • DreamBathrooms
  • cisconetworking
  • magazineikmin
  • InstantRegret
  • everett
  • thenastyranch
  • Youngstown
  • rosin
  • slotface
  • khanakhh
  • Durango
  • kavyap
  • ethstaker
  • megavids
  • anitta
  • modclub
  • osvaldo12
  • normalnudes
  • ngwrru68w68
  • GTA5RPClips
  • tacticalgear
  • provamag3
  • tester
  • Leos
  • cubers
  • JUstTest
  • lostlight
  • All magazines