thisismissem,
@thisismissem@hachyderm.io avatar

Not sure how I feel about @social's emphasis on AI/ML in the job listings for Mozilla Social — such a heavy emphasis feels counter culture to the fediverse, which generally advocates against AI/ML from what I've seen.

Though I'm also aware that practically everything is needing to employ those buzzwords for funding these days.

https://www.mozilla.org/en-US/careers/listings/?team=Mozilla%20Social

cc @than

thisismissem,
@thisismissem@hachyderm.io avatar

As an aside, I don't think I'd apply for either of these roles as:

a) I don't do on-call, hire dedicated SREs and DevOps engineers for that.

b) there's a lack of focus on moderation tooling which is my area of focus.

c) I'm generally not a fan of "AI"

thisismissem,
@thisismissem@hachyderm.io avatar

Though also, it's nice to see them having a Senior Staff Software Engineer role, which is my experience level..

runewake2,
@runewake2@hachyderm.io avatar

@thisismissem I'm curious, have you found many roles that only have SRE/DevOps on call? That feels rare to me.

thisismissem,
@thisismissem@hachyderm.io avatar

@runewake2 like, the other part of this is that on-call needs to be layered, and time needs to be allocated to designing reliable systems. Additionally all contracts for those on-call must include additional provisions for compensation and time-off when on-call

thisismissem,
@thisismissem@hachyderm.io avatar

@runewake2 so I've worked several jobs where I've not been on-call & usually outages didn't need to involve product engineering in incident response, but more in the retrospective and planning for how to improve systems in the future. Systems should be well documented that on-call engineers can diagnose issues and resolve them mostly without escalating

runewake2,
@runewake2@hachyderm.io avatar

@thisismissem that makes a lot of sense, thanks!

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem engineers that are not prepared to be on-call for a system they wrote is a huge red flag.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem is first and foremost an engineering function. It's right there in the name.

SRE does not exist to be on-call. SRE exists to improve the reliability of the software.

Sharing the on-call duty is one of the things an SRE team can do to better understand what is causing reliability issues and prioritise work to fix those issues.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem The product owners may legitimately decide that work to improve reliability is deprioritised compared to other work. If that happens SLOs should be adjusted accordingly (and paging thresholds should be tied to SLOs), and/ or the SRE team should disengage.

Here's how to have a healthy on-call rotation: https://nikclayton.writeas.com/rules-for-a-healthy-on-call-rotation

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton see: https://hachyderm.io/@thisismissem/111194970971544788

Very often I see companies not hiring anyone in an ops capacity, and that work then falls on engineers who do not specialise in that field.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton Incident response management and on-call require specific training and contracts to cover that work. General engineers almost never have that, so my general rule is “I don't do on-call", and further, it's incredibly rare that a system I've worked on will need escalation to me in the middle of the night, usually it can wait until office hours.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem generally, no one should be woken up in the middle of the night.

An on-call rotation that requires, as a matter of routine, someone to be paged far outside normal business hours is irredeemably broken, and means the organisation is setting reliability goals it is not staffed to meet.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton The problem with the terms SRE, DevOps, Ops, whatever you want to call it is that it's changed names every few years, but basically the people who specialise in systems management and operations need to exist, usually they're found more in those functions than in product engineering.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem "" is definitely worthless as a term. I've got a presentation kicking around somewhere where the thesis is "If you want to create a DevOps team you've already failed".

In a well functioning team there should be no "us vs. them" between the SWE and people. Both groups are engineers, it's just that the engineering focus is different.

The goal is still the same. Build and deliver services that meet the users' needs, within the .

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton Yeah, but I don't know how to fix complex kubernetes infrastructure, nor networking issues, nor any of the other modern tooling DevOps folks have invented with ever increasing complexity.

I know the basics of linux system administration, and setting up say nginx + letsencrypt. But things like “why isn't pg bouncer working correctly?" and "why didn't the database failover happen successfully" — I've got no idea, it's not my area of expertise.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton also, most software just doesn't need a nine 9's SLO or anything like that.

But all this is besides the point: for me, as a product engineer to be effective and have a regularly scheduled work-flow, I can't be on-call, because that usually means interrupted sleep, and additional time off in compensation (german law), which means planning around work goes completely out the window.

And, well, at the end of the day I probably can't fix an infrastructure problem.

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton

Being on-call really screws with the ability of SREs to do their day job, too.

The advantage of SRE at Google is that on-call is taken seriously, so we usually have 2 teams to avoid overnight on-call and schedule things to take into account how oval kills productivity.

We are also not a substitute for product devs understanding production, because writing operable software requires understanding operating the software, and that usually covers through practice.

1/n

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton

We try to make on-call handleable by devs because it also needs to be handled by newish SREs, and operable under stressful conditions. Lots of docs, lots of good tooling.

And as Nik says, we expect product devs to contribute to production. Just like we have experts in testing but still expect product devs to write their own tests.

A decent amount of this is the luxury of scale...

2/n

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton

Without G scale it's a lot harder to make on-call that doesn't suck.

Product dev going "this is painful, so we have a specialist team to take the pain" is why I'll be moving back to the product dev side if I ever leave big tech.

3/3

thisismissem,
@thisismissem@hachyderm.io avatar

@sgf @nikclayton I'm not arguing product engineers shouldn't know about optimising queries or data for monitoring or having basic knowledge of tools.

What I am saying is that a product engineer probably can't fix a network outage, full database, failing node, misfiring firewall etc.

That is, today's infrastructure has specialist knowledge and there's no way to know all things, so we've gotta be realistic about who has the skills and knowledge to understand the alerts & respond appropriately

thisismissem,
@thisismissem@hachyderm.io avatar

@sgf @nikclayton my service that I deploy should be correctly documented & instrumented & produce good logs as to enable others to diagnose issues & deploy mitigation, if it's the service that's really degraded and not the infrastructure itself.

As a product engineer, I should be trying to write performant database & network operations. I should also have my service instrumented for error tracking such that I can get insight into what fails.

But I probably don't need to be woken up for it.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem @sgf on-call far outside normal business hours for the region is a different organisational dysfunction, and indicates the business has set reliability goals they are not staffed to meet.

Individually there's not much you can do about that except vote with your feet, if you're lucky enough to be in the position to do so.

Collectively, well, this is why unions are so important.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem @sgf I agree, they shouldn't be expected to fix a network outage.

But a product engineer should grow to know enough about the environment the product is running in to mitigate that outage. And knowledge like that helps them design the software to fail gracefully in the first place.

Somewhere on L3/L4 promotion path, IMHO.

As already noted, none of that happens without a specific level of organisational maturity, with ongoing training and support.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton @sgf right, so a product engineer who's usually never been given that training & who has had to drag people (almost literally) to retrospectives following major outages their team caused, but her team got blamed for, can you blame me for not wanting to be on-call and seeing "everyone is on-call" as a potential red flag?

What y'all are talking about is incredibly rare to see.

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton

At the same time "devs are never on-call" looks like a huge red flag for those who do take the pager.

If your workplace is into assigning blame for outages, that's the red flag right there. Being on-call is just "Am I a target for this toxic culture?".

Being able to spend time and effort on excellent production tooling might be a big tech thing, but handling outages blamelessly costs nothing.

thisismissem,
@thisismissem@hachyderm.io avatar

@sgf @nikclayton whilst I agree, I've definitely experienced outages where the outage was definitely caused by a system changing unexpectedly, and that team not communicating changes, and somehow we were just expected to know something had changed.

So whilst blameless is generally good, I've also seen it used to deny responsibility for something that was that teams responsibility: communicating changes & ensuring they didn't do breaking changes

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton We do blamelessness not because it's hard to find someone to blame, but because blaming people makes it harder to improve things.

Blame often leads to "Root cause: Human error" rather than end-to-end systemic fixes. It's not clear to me what assigning blame is supposed to achieve in any healthy culture.

1/n

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton

In the case you mention, pretty much the point of blamelessness is to decouple blame from the need for change. If better comms would have prevented/ameliorated the outage, that's an AI, and taking on the work is no admission of guilt.

Blamelessness doesn't mean "those to blame don't fix things", it means people who aren't to blame fix things.

2/n

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton

In case you think this is all Google idealism, I spent a decade as dev in places with strong dev/ops split & a blame culture. I've experienced the alternative.

3/3

thisismissem,
@thisismissem@hachyderm.io avatar

@sgf @nikclayton oh, I should clarify: the team who was responsible for the outage & who's work caused it was completely blameless, but my team was blamed because apparently we didn't build resilient enough to the other teams breaking changes.

Yeah.. so, I've experienced both, but tbh, when SWE's where expected to be on-call almost always that was a sign of other unhealthy things happening at the company

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton If the gist is that the team that should have been blamed wasn't, and the one that shouldn't have been was, it's really not a blameless approach at all.

I mean, sure, blamelessness doesn't work if you just use the word without actually doing it, but that's expected!

The blameless approach, if Team A screws up & Team B assumes A won't screw up, is to investigate how to stop A screwing up and' B needing A not to screw up, as that's how you build defence in depth.

nikclayton, (edited )
@nikclayton@mastodon.social avatar

@thisismissem @sgf "blame" is not the same thing as "assigning responsibility".

A good red flag for this is teams that say "We do by not naming anyone in the postmortem".

No! You know you have a blameless postmortem culture when you can name people in postmortems without it causing problems.

This can be exceptionally hard to achieve, but it's worth it.

Edit: see also @danslimmon https://blog.danslimmon.com/2023/04/20/its-fine-to-use-names-in-post-mortems/

sgf,
@sgf@mastodon.xyz avatar

@thisismissem @nikclayton Most junior SREs can't fix a big outage on their own, either. And there are plenty of outages that require deep understanding of product code that SREs don't have. So, we focus on engineering the systems to simplify on-call in the common case (e.g. easy rollbacks) and make sure escalation paths are available both within SRE and to product dev.

hrefna,
@hrefna@hachyderm.io avatar

@thisismissem

This underscores the need for people who do ops work. Regardless of where they fall in the spectrum. It also emphasizes the need for both broad and deep experience (I hate the term full stack and think that it is a misnomer every time I've seen it applied, but I like the model of "T shaped developers").

Sometimes you need ops specialists, sometimes that work can be folded under a SRE, sometimes it works best under a product SWE, but you definitely need it.

@sgf @nikclayton

hrefna,
@hrefna@hachyderm.io avatar

@thisismissem

I like the model of pushing ops work as much as possible to the owning team, but with outside specialists that they can consult with and serious investment in the pipeline and architecture to facilitate them doing their jobs cleanly.

For example: I'm one of the very few PSQL specialists in the orbit of projects I've worked on. But that was true when I was a SWE and I don't use those skills much as a SRE (for obvious reasons).

You need someone to do that.

@sgf @nikclayton

hrefna,
@hrefna@hachyderm.io avatar

@sgf

Minor tangent: Another advantage of "G scale" that I have found classically gets lost in other circumstances is the level of consideration around risk.

For example: I am a SRE on Identity at Google. If something goes down that naturally has a fast response time and 24/7 oncall.

I've also worked on a project (previous company) that went down for four hours and no one noticed, but the managers wanted us to have a 24/7 oncall with 5 minute SLA in one TZ.

@thisismissem @nikclayton

hrefna,
@hrefna@hachyderm.io avatar

@sgf

Even at Google I've had plenty of this sort of nonsense, but at least at Google there's usually not much excuse for not staffing appropriately and more both policy and tooling to make it manageable.

Like at one point prior to being an SRE I shut down management being nonsensical about this sort of thing cold by bringing up the Actual Policies That Needed To Be Followed™ and telling them the level that escalations would need to happen at.

It works wonders.

@thisismissem @nikclayton

thisismissem,
@thisismissem@hachyderm.io avatar

@hrefna @sgf @nikclayton yeah, really my point is people need training, considerations, and compensation for on-call work — that's usually never seen in SWE contracts, and usually SLAs/SLOs have been arbitrary and not actually relevant to the product.

But also, you've gotta have appropriate people responding to on-call, as if they don't have the knowledge or permissions, then their only option is to escalate (or ignore)

hrefna,
@hrefna@hachyderm.io avatar

@thisismissem

Agreed with a note: I'd considering "requiring that they escalate" to be ideal.

It's a great way to expose the problems in your system and get people to fix them. If I need to wake the CTO up at 3 AM every time I need to access the database to fix a problem and I do that reliably then there's a good chance that is going to get prioritized at some point.

But to do that you need a culture that encourages and facilitates that kind of escalation.

@sgf @nikclayton

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton I know my limitations and my strengths, and on-call isn't one of them; it's not for me, and doesn't play to my strengths, and this is actually fine.

There's some people who prefer that type of work, this is also fine.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton think of it another way: you're probably not going to be able to tell me the intricacies of complex node.js, rails or react applications, and I'd not expect you to — likewise, I don't know how to do all the operational tasks that are typically classified as "devops" not in a way that's meaningful for on-call.

mekkaokereke,
@mekkaokereke@hachyderm.io avatar

@thisismissem @nikclayton

A lot of times, conversations about whether Product Engineers should or should not be on-call, break down because on-call often has a hidden assumption of "pages going off in the middle of the night, disruption of life events, etc."

On call also has a hidden assumption of "We will put you in a situation outside of your technical depth, without support or training."

It's less scary if you can say:

  • Your on-call is only during work hours
  • We will teach/support you
borland,
@borland@mastodon.nz avatar

@mekkaokereke @thisismissem @nikclayton I agree, but if you’re only “on-call” during work hours, isn’t that just doing your job?
Eg: My team has a “requests” slack channel where other parts of the org can ask us questions or raise problems in our area of ownership. Everyone takes a turn at monitoring the channel during work hours and triaging requests. I wouldn’t say that is on-call though?

mekkaokereke,
@mekkaokereke@hachyderm.io avatar

@borland @thisismissem @nikclayton

How I set things up sometimes:

  1. Feature Work
  2. On Call
  3. On Duty

You can work in any of the 3 modes in a given week. Usually you're On-Duty the week after On-Call.

Your team on an app store might own the "top charts" feature, where users see top 100 Apps by category. Example feature additions:

  1. add regional top charts
  2. add protection from spamming and spam detection
  3. add form factor specific top charts
  4. Drop staleness from daily to 10 mins
mekkaokereke,
@mekkaokereke@hachyderm.io avatar

@borland @thisismissem @nikclayton

If you're in a feature work week, you have no on call or on duty responsibilities. Just design and build the features in the best way possible.

If you're in an on-call week, you technically have no feature work responsibilities. If the service goes down, users see a blank screen instead of Top Charts. Maybe the service doesn't go down completely, but latency spikes, or resource utilization skyrockets. Restore service with minimum disruption to users or SWEs.

mekkaokereke,
@mekkaokereke@hachyderm.io avatar

@borland @thisismissem @nikclayton

You might decide that this service going down shouldn't cause the users to see a blank screen. You might have an idea for the Top Charts client library to keep slightly stale, but known good versions of Top Charts in it. So if the service call fails, it falls back to a stale version of the Top Chart. You need to check with others to make sure that this is acceptable.

  1. freshest data (100ms)
  2. last successfully fetched data (~2 mins)
  3. stale (~24hrs)
mekkaokereke,
@mekkaokereke@hachyderm.io avatar

@borland @thisismissem @nikclayton

The feature would reduce the severity of a "Top Charts service down" event. It could also reduce the number of pages On Calls receive. SLO can be safely reduced. Up to 1hr outages are now fine.👍🏿

You would build that feature in your On Duty week. During On Duty, you're also first up to answer or route inbound questions to your team, and building these types of "quality of life improvement" features. It's built in SWE controlled capacity to pay down tech debt.

thisismissem,
@thisismissem@hachyderm.io avatar

@mekkaokereke @nikclayton this entirely! I've worked at many companies where I've been expected to be on-call, but they'd schedule me when I'm normally asleep (and I keep weird hours!) or they'd expect my to follow processes or work with systems I had zero training or experience in, all whilst not wishing to do retrospectives & proper engineering practices for maintainability, observability, and backwards compatibility.

nikclayton,
@nikclayton@mastodon.social avatar

@mekkaokereke @thisismissem yes, definitely.

Stuff like this can really depend on your experience. If you've never seen a healthy on-call rotation you can end up thinking all on-call is painful.

Bad management is another example. If you've only ever had bad (or at best mediocre) managers it can be really difficult to identify how the manager is dragging the team down, and just assume that management is something you have to work around to get things done.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem Completely agree. That's why a mature org should be able to empower whoever is on-call with the ability and tools to mitigate the problem first.

Understanding why the issue occurred can come later. Stop the bleeding first. That means investments in tooling, telemetry / observability, staged rollouts, feature delivery through flag flips, and all that sort of good stuff.

On-call should not be scary. I'm getting a vibe that you've had, or seen, some poor on-call experiences.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem But that's not how it should be. Instead of a world where you're not comfortable being on-call as a product engineer the org. should be helping create a world where on-call is no more painful than other engineering responsibilities.

But way too many orgs jump to creating an on-call function before they're properly staffed to support it, and organisationally mature enough to support the people doing the work.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton you'd think, but I've worked in a LOT of dysfunctional organisations, so it's MUCH safer to just say "I don't do on-call"

There's the ideal world we want, and then there's the understaffed, undertrained, undercompensated hell hole of a world we typically get.

nikclayton,
@nikclayton@mastodon.social avatar

@thisismissem yeah, that sucks.

If it's any consolation at least it's given you a handy set of questions to ask future employers at interviews.

thisismissem,
@thisismissem@hachyderm.io avatar

@nikclayton I've also the experience that employers & hiring managers will say just about anything to hire me, but do extremely little to retain me.

NotYourSysadmin,

@thisismissem @nikclayton No one is expecting you to be able to repair K8S or other components of infrastructure. But consider the flip side of the coin, where I have an app for which I haven't written a line of code that I'm responsible for keeping up. The point of a shared on-call rotation across areas of expertise is that it makes the overall system more reliable. A developer or team of developers whose primary metric is number of releases or lines of code or features released isn't optimizing for overall system reliability. Instead, it falls to the Ops team; on-call is someone else's problem, and if it's something wrong with my app, it can wait until Monday. Obviously, individuals should be fairly compensated for work outside of normal hours. But overall system reliability is the responsibility of far more than the Infrastructure team, no matter whether they're called DevOps, SREs, Platform Engineers, or anything else.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • ngwrru68w68
  • tester
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • mdbf
  • tacticalgear
  • JUstTest
  • osvaldo12
  • normalnudes
  • cubers
  • cisconetworking
  • everett
  • GTA5RPClips
  • ethstaker
  • Leos
  • provamag3
  • anitta
  • modclub
  • megavids
  • lostlight
  • All magazines