@norootcause@hachyderm.io
@norootcause@hachyderm.io avatar

norootcause

@norootcause@hachyderm.io

Student of complex systems failures, resilience engineering, cognitive systems engineering. Will talk your ear off about learning from incidents in software.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

norootcause, to random
@norootcause@hachyderm.io avatar

“I wish software systems crashed at the same rate as bridges do.”

monkey’s paw finger curls

hazelweakly, to random
@hazelweakly@hachyderm.io avatar

Pretty much all of the woes of distributed tracing are caused from the mismatch of the mental model of distributed tracing that makes sense vs the one that can be built easily:

The model that makes sense is "lazily built and incrementally fleshed out call graph with late-binding updates of attributes as discovered"

But the way that makes sense to implement it is "strict call-stack semantics with fire-and-forget frozen rows of data into an append only data store"

norootcause,
@norootcause@hachyderm.io avatar

@hazelweakly I think your observation that:

“Pretty much all of the woes of X are caused from the mismatch of the mental model of X that makes sense vs the one that can be built easily.”

holds for multiple values of X

norootcause, to random
@norootcause@hachyderm.io avatar

Hofstadter's law, incident edition: incidents are always more complex than you think, even taking into account Hofstadter’s law, incident edition.

norootcause, to random
@norootcause@hachyderm.io avatar

Single point of flail

norootcause, to random
@norootcause@hachyderm.io avatar

Do any databases out there offer first-class support for soft deletes?

norootcause, to random
@norootcause@hachyderm.io avatar

Prediction: LLM-based systems are going to exhibit some really weird failure modes.

norootcause,
@norootcause@hachyderm.io avatar

@kellogh Ain’t seen nothing yet!

norootcause, to random
@norootcause@hachyderm.io avatar

Hey-you-should-take-a-look-at-this-as-a-service

mononcqc, to random
@mononcqc@hachyderm.io avatar

A few days ago, @hazelweakly wrote about her redefinition of observability on her blog.

I decided to add some extra color commentary to it on mine, in an attempt to provide extra context and framing around her ideas, but also around classical definitions of observability.

The post covers differences between insights and questions, distinctions between observability and data availability, socio-technical implications, mapping complex systems, and on the use of models.

https://ferd.ca/a-commentary-on-defining-observability.html

norootcause,
@norootcause@hachyderm.io avatar

@mononcqc @hazelweakly So great to see people in dialog via blogs again!

norootcause, to random
@norootcause@hachyderm.io avatar

Good luck dealing with systems that encounter sensor errors!

norootcause,
@norootcause@hachyderm.io avatar

@c0dec0dec0de @jvilk I believe the sensor folks refer to systemic error as “bias”.

norootcause, to random
@norootcause@hachyderm.io avatar

Endlessly fascinated by goal conflicts and double binds.

"Workers coped with the double bind by developing a 'covert work system' that involved, as one worker put it, 'doing what the boss wanted, not what he said". – Woods et al., Behind Human Error

norootcause, to random
@norootcause@hachyderm.io avatar

Woods et al. talk about "buggy" knowledge, mental models that are incorrect in an important way. Alas, there are no unit tests for mental models.

norootcause, to random
@norootcause@hachyderm.io avatar

My absolutely hottest take: only the first word in a title or heading should be capitalized.

norootcause,
@norootcause@hachyderm.io avatar

@onelson In that case, it's a proper name.

norootcause, to random
@norootcause@hachyderm.io avatar

"In general, outsiders pay attention to practitioners' coping strategies only after failure, when such processes seem awkward, flawed, and fallible. It is easy for post-incident evaluations to say that a human error occurred." – Woods et al., Behind Human Error

norootcause, to random
@norootcause@hachyderm.io avatar

I'm kinda fascinated by LLMs because, in one sense, we know exactly how they work, and in another sense, we have no idea how they work.

norootcause, to random
@norootcause@hachyderm.io avatar

Reading this @nat newsletter, and it reminded me that one of the things that bummed me out the most when I was at Netflix was when they introduced levels: https://www.simplermachines.com/bookmarked-links/?ref=simpler-machines-newsletter

norootcause,
@norootcause@hachyderm.io avatar

@gdinwiddie @nat I’ve also heard something along the lines of “do you want to be right, or do you want to make a difference?”

norootcause, to random
@norootcause@hachyderm.io avatar

Proposed org metric: reflection ratio:
(time spent on reflection) / (time spent on planning)

norootcause,
@norootcause@hachyderm.io avatar

@trochee Yep, my experience is the same. Reflection time is always the easiest thing to drop, so it’s the first thing that goes when the org gets pressed for time.

norootcause, to random
@norootcause@hachyderm.io avatar

Resilience is about treating surprise as a first-class thing.

norootcause, to random
@norootcause@hachyderm.io avatar
norootcause, to random
@norootcause@hachyderm.io avatar

Doing some Toyota-related reading and thread-following-Googling brought me to this page: https://www.lean.org/lexicon-terms/gemba/

In the table on that page, one of these things is not like the others.

norootcause,
@norootcause@hachyderm.io avatar

Code is an artifact generated by the work, it’s not where the work gets done.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • khanakhh
  • kavyap
  • thenastyranch
  • everett
  • tacticalgear
  • rosin
  • Durango
  • DreamBathrooms
  • mdbf
  • magazineikmin
  • InstantRegret
  • Youngstown
  • slotface
  • megavids
  • ethstaker
  • ngwrru68w68
  • cisconetworking
  • modclub
  • tester
  • osvaldo12
  • cubers
  • GTA5RPClips
  • normalnudes
  • Leos
  • provamag3
  • anitta
  • lostlight
  • All magazines