As I get older, I find myself enjoying these types of stories less and less. My ...

Aeolun · on Nov 1, 2024

> What I want from a post mortem is to know how we could prevent, detect or mitigate similar incidents in future and to make those changes to code or process.

The answer this post gives to that bizarre question that always gets asked, is ‘nothing’, unless you want to significantly adjust the speed that we deliver features.

Any added process or check is going to impose overhead and make the team a little bit less happy. Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.

In extremis, you’re reduced to a situation in which you have zero incidents, but you also have zero work getting done.

scott_w · on Nov 1, 2024

That’s simply not true. Some processes are good, some post-mortem outcomes will focus on improving deployment speed (so you can revert changes faster), or improve your monitors so you can detect (and mitigate) incidents faster.

On the other hand, enforcing a manual external QA check on every release WILL slow things down.

You’re repeating the same mistake as the article by assuming “process” sits on a grade that naturally slows work down. This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you.

braza · on Nov 1, 2024

> This is because you’re not being precise in your reasoning. Look at the specifics and make a decision based on the facts in front of you

I agree with the premise here, but in my experience running incidents review the issue that I see it’s a mixture of a performatic safetycism with reactivity.

Processes are the cheap bandaid to fix design, architectural and cultural issues.

Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.

scott_w · on Nov 1, 2024

> Processes are the cheap bandaid to fix design, architectural and cultural issues.

They can be, yes. I have a friend that thinks I'm totally insane by wanting to release code to production multiple times a day. His sweet spot is once every 2 weeks because he wants QA to check over every change. Most of his employers can manage once a month at best, and once a quarter is more typical.

> Most of the net positive micro-reforms that we had after incident reviews were the ones that invested in safety nets, faster recoveries, and guardrailing than a new process and will tax everyone.

I 100% agree with this. Your comment also reminded me to say that incident reviews are necessary but not sufficient. You also need engineering leadership reviewing at a higher-level to make bigger organisational or technical changes to further improve things.

hamandcheese · on Nov 1, 2024

> Ocassionally you’ll have a unicorn situation where there is actually a relatively simple fix, but those are few and far between.

Perhaps we have different backgrounds, but even in late stage startups I find there is an abundance of low hanging fruit and simple fixes. I'm sure it's different at Google, though.

shubhamjain · on Nov 1, 2024

Thanks for saying this. This mindset has infiltrated all the engineering teams and made software development hell for those of us who actually like shipping. Being more careful, adding more checks and processes, has exponentially less returns (just invert the graph). Somehow though, teams have been lead to believe that each incident needs to be responded with more processes.

nicbou · on Nov 1, 2024

This is an even bigger problem putside of our profession. Organisations do everything in their power to reduce the agency of employees and the general public through process. A company would rather spend 20 hours verifying a purchase than allow an unnecessary purchase every now and then. German culture in particular seems to favour this to an extreme.

sverhagen · on Nov 1, 2024

I've worked in highly autonomous and empowered teams that still root cause-analyzed every incident to death. The rationale being that if you'd get PagerDuty-ied in the middle of the night, it better be worth losing your sleep over. And it was great. I've also worked in slow, bureaucratic environments. They're not the same. Turning the magic dial (up) towards "more care" apparently doesn't move you along the same axis towards bureaucracy per se.

braza · on Nov 1, 2024

I like the German workplace (to white collar job) for several reasons but this one is one that drives me away.

We used to have an issue of deployments breaking in production, and one of the reasons was that we did not have some kind of smoke test in the post deployment (in our case we only had a rolling update as a strategy).

The rational solution was only create that post-deployment step. The solution that our German managers demanded: cut access to deployment for the entire team, “on-call” to check the deployments, and a deployment spreadsheet to track it.

intelVISA · on Nov 1, 2024

Good lord, are all those non-tech managers the reason Europe can't seem to build viable tech companies?

nicbou · on Nov 1, 2024

In my experience it’s a cultural problem in Germany. Everything has to be done methodically, even if the method adds a disproportionate amount of friction. Often, the purported benefits are not even there. The thoroughness is full of holes, the diligence never done, the follow-ups never happening.

It leads to situations where you need a certificate from your landlord that you take to the certified locksmith that your landlord contracted and show a piece of ID to order a key double that arrives 3 business days later at a cost of 60€. A smart German knows that there’s a locksmith in the basement of a nearby shopping mall that will gladly duplicate any key without a fuss, but even then the price is inflated by the authorised locksmiths.

I document German bureaucracy for immigrants. Everything is like this. Every time I think “it can’t really be this ridiculous, I’m being uncharitable”, a colleague has a story that confirms that the truth is even more absurd.

It’s funny until you realise the cost it has for society at large. All the wasted labour, all the bottlenecks, and little to show for it.

chrsig · on Nov 1, 2024

The other end of the spectrum has snakeoil salesmen grifting from town to town. It's a hard equilibrium to blance.

RaftPeople · on Nov 2, 2024

> Thanks for saying this. This mindset has infiltrated all the engineering teams and made software development hell for those of us who actually like shipping. Being more careful, adding more checks and processes, has exponentially less returns (just invert the graph).

I'm going to strongly disagree with this (when it's done well, not bureaucratically).

We review what can be improved due to problems and we incorporate it into our basic understanding of everything we do, it's the gaining of experience and muscle memory to execute fast while also accounting for things proactively.

It's a long term process but the payoff is great. Reduced time+effort on problems after the fact ends up long term increasing amount of valuable work produced.

The key is to balance this process pragmatically.

paulstovell · on Nov 1, 2024

I call this the "grandpa's keys" problem!

When Grandpa was 20 years old he left the house and forgot to take his keys, so every time he left the house he checked his pockets for his keys.

When he was 24 he left the house and left the stove on. He learned to check the stove before leaving the house.

When he was 28 he left his wallet at home. He learned to check for his wallet.

...

Now Grandpa is 80. His leaving home routine includes: checks for his keys, his phone, his wallet. He ensures the lights are off, the stove is off, the microwave door is closed, the iron is off, the windows are closed in case it rains...

Grandpa has learned from his mistakes so well that it now takes him roughly an hour to leave the house. Also, he finds he doesn't tend to look forward to going out as much as he once did...

scott_w · on Nov 1, 2024

In response to this, I'd like to highlight what I wrote:

> We then need to lean on data and experience of what the trade offs of those changes would be.

As engineering leaders, this is a key part of our job. We don't just blindly add processes to prevent every issue. I should add that we also need to analyse our existing processes to see what should change or is not needed any more.

lanstin · on Nov 1, 2024

And he will be replaced by a younger version who hasn't learned so many lessons.

xivzgrev · on Nov 1, 2024

The road to bureaucrat hell is paved with good intentions

https://andrewchen.substack.com/p/bureaucrat-mode

Thorrez · on Nov 1, 2024

What you're saying agrees with the article.

However, you say you're agreeing with scott_w, and scott_w is criticizing the article. So this is confusing.

hamandcheese · on Nov 1, 2024

I get where you are coming from, and I certainly do expect actual facts, data, and reasoning to be a part of any serious postmortem analysis. But those will almost always be in relation to a very specific circumstance. I think there is still room for generalized parables such as this article - otherwise, we would be reading a postmortem blog post, which are also common here and usually do contain what you are asking for.

scott_w · on Nov 1, 2024

I think you can generalise without resorting to silly games like the article does. I gave some examples in a sibling comment that are high level enough to give an idea of the types of things I’d think about, without locking in to a specific incident I was part of.

odo1242 · on Nov 1, 2024

Isn't that the point of the story though? To say "we need to have a conversation about facts, rather than something like always saying "we need to be more careful next time" when there's a problem?

As far as I can tell, the author doesn't really give any generalized advice on how careful you should be, he's just pointing at the "carefulness dial" and telling people to make an informed decision.

scott_w · on Nov 1, 2024

I'm not really sure it is the author's point. I re-read the article to try and find your interpretation and I couldn't really find it there. Maybe a slight hint in the coda?

picometer · on Nov 1, 2024

Agreed. I recall being taught in college physics labs: there is no such thing as “human error”. Instead, think about the causes and mechanisms of each source of error, which helps both quantifying and mitigating them.

Same energy here. “Be more careful” is extraordinarily hand-wavy for a profession that calls itself engineering.

astura · on Nov 1, 2024

Exactly, these stories only appeal to children and childish adults.

scott_w · on Nov 1, 2024

I think we can criticise the analogy as reductive without insulting people, here.

gffrd · on Nov 5, 2024

Or maybe put more accurately: these stories don’t appeal to you and people like you.

What is your objection? Oversimplification? Lack of utility?

1oooqooq · on Nov 1, 2024

you're probably on the spectrum, as most anyone here.

there's a tradeoff on shipping garbage fast which won't explode on your hands and getting promoted. and there's also the political art of selling what you want to other people by masking it as what they want.

you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.

these analogies help us rational people point out the BS at least, without having to fight the BSer.

scott_w · on Nov 1, 2024

> you're probably on the spectrum, as most anyone here.

Your comment starts with an ableist slur so I’m sure it’s going to be good /s

> you and i and most people here will never understand any of that. good luck. people who do understand will have the careless knob stuck at 11.

Nah, reading this comment wasn’t worthwhile after all.

> these analogies help us rational people point out the BS at least, without having to fight the BSer.

How cute, you think you’re “rational.”