norootcause

@norootcause@hachyderm.io

Student of complex systems failures, resilience engineering, cognitive systems engineering. Will talk your ear off about learning from incidents in software.

This profile is from a federated server and may be incomplete. Browse more on the original instance.

norootcause, 2 months ago to random

“I wish software systems crashed at the same rate as bridges do.”

monkey’s paw finger curls

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ wordshaper

hazelweakly, 2 months ago to random

Pretty much all of the woes of distributed tracing are caused from the mismatch of the mental model of distributed tracing that makes sense vs the one that can be built easily:

The model that makes sense is "lazily built and incrementally fleshed out call graph with late-binding updates of attributes as discovered"

But the way that makes sense to implement it is "strict call-stack semantics with fire-and-forget frozen rows of data into an append only data store"

reply

expand (13)

collapse (13)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hrefna, bitprophet

norootcause, 2 months ago

@hazelweakly I think your observation that:

“Pretty much all of the woes of X are caused from the mismatch of the mental model of X that makes sense vs the one that can be built easily.”

holds for multiple values of X

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hazelweakly

norootcause, 2 months ago to random

Hofstadter's law, incident edition: incidents are always more complex than you think, even taking into account Hofstadter’s law, incident edition.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ kellogh

norootcause, 2 months ago to random

Single point of flail

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

Do any databases out there offer first-class support for soft deletes?

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ bitprophet

norootcause, 2 months ago to random

Prediction: LLM-based systems are going to exhibit some really weird failure modes.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago

@kellogh Ain’t seen nothing yet!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

Hey-you-should-take-a-look-at-this-as-a-service

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

mononcqc, 2 months ago to random

A few days ago, @hazelweakly wrote about her redefinition of observability on her blog.

I decided to add some extra color commentary to it on mine, in an attempt to provide extra context and framing around her ideas, but also around classical definitions of observability.

The post covers differences between insights and questions, distinctions between observability and data availability, socio-technical implications, mapping complex systems, and on the use of models.

https://ferd.ca/a-commentary-on-defining-observability.html

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hazelweakly

norootcause, 2 months ago

@mononcqc @hazelweakly So great to see people in dialog via blogs again!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hazelweakly

norootcause, 2 months ago to random

Good luck dealing with systems that encounter sensor errors!

reply

expand (3)

collapse (3)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago

@c0dec0dec0de @jvilk I believe the sensor folks refer to systemic error as “bias”.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

Endlessly fascinated by goal conflicts and double binds.

"Workers coped with the double bind by developing a 'covert work system' that involved, as one worker put it, 'doing what the boss wanted, not what he said". – Woods et al., Behind Human Error

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ recursive, ttpphd, hazelweakly

norootcause, 2 months ago to random

Woods et al. talk about "buggy" knowledge, mental models that are incorrect in an important way. Alas, there are no unit tests for mental models.

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

My absolutely hottest take: only the first word in a title or heading should be capitalized.

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago

@onelson In that case, it's a proper name.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

"In general, outsiders pay attention to practitioners' coping strategies only after failure, when such processes seem awkward, flawed, and fallible. It is easy for post-incident evaluations to say that a human error occurred." – Woods et al., Behind Human Error

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ BoydStephenSmithJr, dpp

norootcause, 2 months ago to random

I'm kinda fascinated by LLMs because, in one sense, we know exactly how they work, and in another sense, we have no idea how they work.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ trochee

norootcause, 2 months ago to random

Reading this @nat newsletter, and it reminded me that one of the things that bummed me out the most when I was at Netflix was when they introduced levels: https://www.simplermachines.com/bookmarked-links/?ref=simpler-machines-newsletter

reply

expand (24)

collapse (24)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago

@gdinwiddie @nat I’ve also heard something along the lines of “do you want to be right, or do you want to make a difference?”

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

Proposed org metric: reflection ratio:
(time spent on reflection) / (time spent on planning)

reply

expand (6)

collapse (6)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ trochee, kellogh

norootcause, 2 months ago

@trochee Yep, my experience is the same. Reflection time is always the easiest thing to drop, so it’s the first thing that goes when the org gets pressed for time.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

Resilience is about treating surprise as a first-class thing.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ oblomov, paris, recursive

norootcause, 2 months ago to random

New blog post: https://surfingcomplexity.blog/2024/03/16/when-theres-no-gemba-to-go-to/

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago to random

Doing some Toyota-related reading and thread-following-Googling brought me to this page: https://www.lean.org/lexicon-terms/gemba/

In the table on that page, one of these things is not like the others.

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

norootcause, 2 months ago

Code is an artifact generated by the work, it’s not where the work gets done.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ danilo