It's not inspiring to me with the 'cool tactical' framing this article attempts to convey.
I've worked as an oncall for a fundamental backbone service of the internet in the past and paged into middle of the night outages. It's harrowing and exhausting. Cool names like 'incident commander' do not change this.
We also had a "see ya in the morning" culture. Instead I'd be much more impressed to have a "see ya in the afternoon, get some sleep" culture.
> nstead I'd be much more impressed to have a "see ya in the afternoon, get some sleep" culture
I've led teams for over a decade that have oncall duties. One principle we have lived by is that if you are paged outside of hours, you take off the time you need to not hold a grudge against the team/company. Some people don't need it, some people take a day off, some people sleep in, some people cash in on their next vacation. To each their own according to their needs. Seems to work well.
We also swap out oncall in real time if, say, someone gets paged a couple nights in a row.
Yup, this is important or people will burn out real quick. And when there are major incidents, as IC it's especially important to dismiss people from bridges as early as possible when I know I'm going to need them sooner rather than later the following day. Or swap with a more junior person so that the senior one is nice and fresh for when the next wave is anticipated.
I don't disagree with your post, but one thing I want to mention is the origin of the term "Incident Commander"--it doesn't exist to be cool, but rather derives from how FEMA handles disasters. I suspect its usage in IT became a thing because it was already used in real-life, and it made more sense than creating a new term.
If you have two hours, you can take the training that describes the nomenclature behind the Incident Command System, and why it became a thing:
This online training takes about 2 hours and is open to the general public. I took it on a Saturday afternoon some years ago and it gave me useful context to why certain things are standardized.
The concepts and terms of incident command are not from the military (or ER as another poster suggested). It's from the fire service and emergency management in general. I don't know if that changes peoples' perceptions and I agree that no amount of terminology changes how exhausting being on call is. But if people are reacting negatively to "military" connotations, I think that is unwarranted.
I think actually learning what the ICS is _for_ might help people understand a bit better why it's not necessarily just "unnecessary tacticool". It's not just a bunch of important-sounding names for things.
ICS, at its core, is a system for helping people self-organize into an effective organization in the face of quickly changing circumstances and emergent problems.
Some simple rules are things like:
* The most senior/qualified person on-site is generally in charge. (How you determine that kinda varies depending on organization.)
* Positions are only created when required. You don't assign people roles unless there's a need for that role.
* Positions are split and responsibilities delegated as the span of control increases beyond a set point.
* Control should stay as local to the problem as it realistically can while still solving the problem.
From there, it goes on to standardized a template hierarchy and defines things like specific colours associated with specific roles so as roles change and chaos ensues, people can continue to operate effectively and in an organized manner. In-person, this means things like the commander/executive roles running around in red vests with their role on the back. If the role changes hands, so does the vest.
Some of the roles in that template organization are things like:
* The "Public Information Officer" who is responsible for preparing and communicating to the public. This makes a single person responsible to ensure conflicting or confusing messaging is not making its way out.
* A "Liason Officer" who is responsible for coordinating with other organizations. This provides another central point of coordination for requests flowing outside of your response.
I think we could all imagine how this starts to become valuable in, say, a building collapse scenario with police, fire, EMS, the gas company, search and rescue, emergency social services, etc all on scene.
In an IT context, what this means it that, generally, the most senior person online is going to be in charge of receiving reports from people and directing them. If there aren't many people around, they'd generally be pitching in to help as well.
As more people show up and the communication and coordination overhead increases, they step out of doing any specific technical work. If enough show up, they may then delegate people out as leaders of specific teams tasked with specific goals (they may also just tell them they're not needed and send them to wait on standby).
All roles, including the "Public Information" and "Liason" roles fall to the Incident Commander unless delegated out. At some point, if the requests for reporting from management start interfering with their role as Incident Commander, they delegate that role out. If it turns out the incident is going to require heavy communication or coordination with a vendor, they may delegate out the Liason role to someone else.
ICS is probably largely unnecessary if your response never spans larger than the number of people that can effectively communicate in a google meet call, but as you get more and more people involved it contains a lot of valuable lessons and things learned through real world experience in situations much more stressful and dangerous than we ever face that help you effectively manage and coordinate the human resources in response to an incident.
(Disclaimer: That's all basically from memory. The city sent me on a ICS, ICS in an emergency operations centre context, and a few more courses a few years back as part of volunteering with an emergency communications group. It's probably 90% accurate.)
Yeah, I have one of those introductory Incident Command System certificates floating around somewhere, too.
The Incident Command System contains guidelines to help different organizations work together in an emergency. You might need the fire service, or police, or hazmat, or EMS, possibly across multiple jurisdictions. If I recall correctly, ICS bans all those "10-4" radio codes, standardizes job roles, etc. And it contains rules for scaling the temporary organization up and transfering leadership as necessary.
Overall, it seems well-suited to real-world emergency response. And the training materials recommended using it in non-emergency situations, too, such as large parades. The idea was that major emergencies are rare, but it's worth getting practice with more common events.
I'm not quite sure how well the ideas behind ICS applies to IT outages. Most outages occur within a single organization, so there's less need to coordinate with outsiders. But some of the advice for scaling the response team and handling leadership transfers might be very useful.
It's not just different organizations working together, I'd call it more... working with what you have. In the event of an earthquake the disaster staging area isn't going to have a nice assortment of police, fire, EMS, and other employees--they're going to get whoever was in the area at the time and need to try and organize and allocate them to do the most good with what they have. Not everyone is going to be going back out to do the job they started the day with.
In that sense I could still see some value around the ideas of organizing based on whoever actually shows up when the 3AM call goes out, adjusting roles on the fly as the situation evolves, and assigning out specific people responsible for updating management, interfacing with vendors, etc. And I think it puts people in the right mindset to do all that.
But yeah, I wouldn't like... plan on sending all my staff out for ICS training next week or anything. I think there'd be some value in sending a couple people responsible for creating and implementing your disaster planning to take it though--there are a lot of good things you could copy, some you could modify, and some you could throw away to make a pretty damn good IT-oriented response procedure.
And yeah, as far as I've seen our city actually uses ICS any time the emergency operations centre is activated. That's whenever there's a large-scale emergency that requires more centralized coordination, but also sometimes just if there's a large event that requires coordination (e.g., a festival with 100k people showing up at the beach). It's not just for practice, but because it's just a system that everyone's already familiar with that can respond to the situation if it evolves. Nobody needs to make the call that "we're switching into emergency mode now" and stop everything to reorganize, they simply respond.
> We also had a "see ya in the morning" culture. Instead I'd be much more impressed to have a "see ya in the afternoon, get some sleep" culture.
German labor laws forbid employees from working 10-13 hours after a long on-call situation after a normal work day, just like that. Add in time compensation, and a bad on-call situation at night easily ends up as the next day off paid.
I've found this to take a lot of edge of on-call. Sure, it /sucks/ to get called at 1am and do stuff until 3, but that's a day to sleep in and recover. Maybe hop on a call if the team needs input, but that's optional.
These incident role names are fairly common in product companies these days. I guess you are correct that they do suggest a certain culture around incidents, but in my experience is definitely a good thing. It's a "don't blame people, let's focus on the root cause, get things back up, and figure out how to prevent this next time" sort of thing. People try to meet SLAs and they treat each other like humans. We focus on improving process/frameworks over blaming individual people. And yup, think this comes along with, "incident yesterday was intense, I'm gonna catch up on sleep".
I agree with your comment. These names are just ways for teams to delegate different responsibilities w.r.t incident management quickly and in a way that's understood by everyone. Having concrete names for such roles is both a good thing (everyone knows who can make the call for hard decisions) and helps you talk intelligently about the evolution of such roles. e.g. "our Incident Commanders used to spend 15% of their time in p0 incidents, but that has reduced to 10% due to improvements in rollout procedures/runbooks/etc."
It seems to be a bit of cargo cult, to be honest. They seem to take inspiration from ER teams or the military.
I think that this kind of drill helps a lot for cases where you can take a pre-planned route, like deploying that backup server or rerouting traffic. But the obvious question then is: Why not automate that as well?
When it comes to diagnosis or, worse, triage, in my experience you want independent free agents looking at the system all at once. You don't want a warroom-like atmosphere with a single screen but rather n+1 hackers focusing on what their first intuition tells them is the root cause. In a second step you want these hackers to convene and discuss their root cause hypotheses. If necessary, you want them to run experiemnts to confirm these hypotheses. And then you decide the appropriate reaction.
They seem to take inspiration from ER teams or the military
It’s probably nothing but overestimation but I feel like I’m seeing more of this later in my career than I did early on, or maybe I’m paying more attention?
Whatever it is: past experience (which includes coming from a military family in the states) has taught me to avoid companies that crib unnecessary amounts of jargon, lingo and colloquialisms from the military.
Curious if others have noticed or even feel the same and what your experiences have been for feeling similarly?
I don't know how old you are but my career now exceeds two decades. I definitely see this more now but that's because I institute it. Earlier in my career, we failed at incident management and at ownership. We now share the burden of on-call not just with the operators (sysadmins or old) but also with the people who wrote the code. We've spent a lot of time building better models based on proven methods, quite a few come from work done in high intensity roles paid by tax dollars: risk analysis, disaster recovery, firefighting, command and control, incident management, war games, red teams.
You've got a couple of years on me, I've been in the game a little over 13 years now.
I support the notion there's a strong difference between lingo that's properly applied to the situation and lingo that is recklessly applied because it "sounds cool".
The examples you gave seem to be fair game for the work being done-in the interest of brief, specific language; the examples I gave in another comment though ("flanking","breaching") however are just grating and...weird to use in a work environment.
Agree completely. It's a strong signal that someone has a military cosplay fetish (which very few people with experience in the actual military do), which in turn tends to come along with other dysfunctional traits. It's a warning for me that the person is not likely to be a good vendor, customer, or collaborator.
My favorite one was when a superior was explaining a plan to right-size some new machines as we slowly migrated customers onto the appliance, and some particularly aggravating issues we were having with memory consumption that upon inspection and a lot of time spent-made no real sense to us why it was occurring the way it was.
"dvtrn you are to take the flank and breach this issue with Paul"
And this ran all the way up to the top of the org. Senior leaders were constantly quoting that Jocko Wilink fella. It was...something.
My old man (a former Dill Instructor, made for an interesting childhood) found it utterly hilarious when I'd call him up randomly with the latest phrase of the day, uttered by some director or another. To my knowledge, and I sure-damn asked, the only affinity anyone on the executive team had with the military was two of them having buddies who served.
I agree. I think this particular framing gets things slightly wrong. You want parallelism, but you still need central organization (so that you can have clear delegation) and delegation of work to various researchers. For a complex incident, I've seen 5+ subteams researching various threads of the incident. But, importantly, before any of those subteams take any action, they report to the IC so that two groups don't accidentally take actions that might be good in isolation but are harmful when combined.
My experience is there’s little conflict between a central conference call or room, and multiple independent investigators, since those investigators need to present and compare their findings somewhere. It would indeed be a mistake to demand everyone look at one high-level view, though. Based on the organization depicted in the article, this would be the “researcher” role, split among multiple people.
Firefighter/EMT here. The fire service has trained on the Incident Command System for decades because it works, not because it's "cool." I didn't know ICS was used in emergency reliability response but it makes perfect sense. Heck, ICS works great for organizing a family camping trip.
Like Agile, ICS is a set of principles that work well if you are properly trained in the system. Unlike Agile, ICS has not yet been buzzword-harvested by clueless managers to get promoted, so the concept itself still has some value.
Yeah, the tone of this article is really odd, and like, the bulk of the content is just a narrativization of the incident roles in the Google SRE book. The only 'trick' is running game days?
The worst part is hearing fron your manager the next day that the NOC operator complained about your rude tone of voice when waking up and answering the phone at 3am.
> which captures all the key log files and status information from the ailing machine.
Machine? As in singular machine goes down and you wake up 5 people? That just sounds like bad planning.
> Pearson is spinning up a new cloud server, and Rawlings checks the documentation and procedures for migrating websites, getting everything ready to run so that not even a second is wasted.
Heroic. But in reality you have already wasted minutes. Why is this not all automated?
I understand that this is a simulated scenairo. Maybe the situation was simplified for clarity, but really if a single machine going down leads to this amount of heroics then you should work on those fundamentals. In my opinion.
They skipped over a few steps of ICS. ICS starts with a single person playing all roles.
It prescribes a way to scale up and down the team in ways that streamlines the communication so everyone knows their role, nothing gets lost when people come in and out of the system and you don't have all hands conference calls and multiple people telling the customers multiple things or multiple people asking for status updates from each person.
Not only that but they appear to be okay with the fact that a single ISP has knocked them offline. If I was a customer of theirs and found out, I would probably change providers.
While reading this, I was thinking "This is so important that you'll wake all these people up in the middle of the night, but you only have a single ISP? No backup ISP with automated failover?"
Nice to see articles like this describing a company's incident response process and the positive approach to incident culture via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an incident management startup).
Regarding gamedays specifically: I've found that many company leaders don't embrace them because culturally they're not really aligned to the idea that incidents and outages aren't 100% preventable.
It's a mistake to think of the incident management muscle as one you'd like exercised as little as possible when in reality it's something that should be in top form because doing so comes with all kinds of downstream values for the company (a positive culture towards resiliency, openness, team building, honesty about technical risk, etc).
Sadly this can be a difficult mindset to break out of especially if you come from a company mired in "don't tell the exec unless it's so bad they'll find out themselves anyway."
Relatedly, the desire to drop the incident count to zero discourages recordkeeping of "near-miss" incidents, which generally deserve to have the same learning process (postmortem, followup action items, etc) associated with them as the outcomes of major incidents and game days.
Hopefully this outdated attitude continues to die off.
If you're just getting started with incident response or are interested in the space, I highly recommend:
- For basic practices: Google's SRE chapters on incident management [2]
- For the history of why we prepare for incidents and how we learn from them effectively: Sidney Dekker's Field Guide to Understanding Human Error [3]
> Relatedly, the desire to drop the incident count to zero discourages recordkeeping of "near-miss" incidents, which generally deserve to have the same learning process (postmortem, followup action items, etc) associated with them as the outcomes of major incidents and game days.
Zero recorded incidents is a vanity metric in many orgs, and yes, this looses many fantastic learning opportunities. The end results is that these learning opportunities eventually do happen, but with significant impact associated with them.
> Regarding gamedays specifically: I've found that many company leaders don't embrace them because culturally they're not really aligned to the idea that incidents and outages aren't 100% preventable.
So. Much. This. Unless leaders were engineers in the past or have kept abreast of evolution in technology, the default mindset is still "incidents should never happen" rather than "incidents will happen how can we handle them better". This is especially pronounced in politics heavy environments since outages are seen as a professional failure, a way to score brownie points over the team that fails. As a result, you often have a culture that tried to avoid being responsible for outages at any cost, which (ironically) leads to worse overall quality of the system since the root cause is never dealt with.
My team (5 devs, 10 people total on the product) currently doesn't use any incident response-specific tooling. We have a Confluence SOP for incident response, a page template for RCAs, an #incident-response slack channel, and Zoom but no specific tooling. Just yesterday someone recommended Kintaba/incident.io/OpsGenie/etc, but I don't know if that's overkill for our team.
At what point do you think a tool like yours is necessary or worthwhile, as opposed to using generic tools?
Obviously biased but I definitely think you can get good value out of a tool like Kintaba at your scale (if you're only using it within engineering it would actually have no cost since we're free for 5 or fewer users!).
Kintaba is built to be simple out of the box and allow more depth and complexity as you grow, so initially you might use it the same way you manually use slack today (announce incidents, create a specific incident channel) where your primary initial value is that it makes those motions easier and helps you be more consistent with how you approach incidents and improve recordkeeping, but as you grow you can start to add oncall rotations for your incident roles, automated actions for different incident types, and other things like tagging for better reporting.
Feel free to reach out to us at hello@kintaba.com with questions, or even if you'd just like to chat about how to get up and running!
I've done a LOT of incident management and I'm not happy about it. The biggest issue I have run into other than burnout is this:
Thinking and reasoning under pressure are the enemy. Make as many decisions in advance as possible. Make flowcharts and decision trees with "decision criteria" already written down.
If you have to figure something out or make a "decision" then things are really really bad.
That happens sometimes, but when teams don't prep at all for incident management (pre-determined plans for common classes of problem) every incident is "really really bad"
If have a low risk, low cost action with low confidence of high reward, I'm going to do it and just tell people it happened. Asking means I just lost a half-hour+ worth of money and if I just did it and I was wrong we would have lost 2 minutes of money. When management asks me why I did that, I point at the doc I wrote that my coworkers reviewed and mostly forgot about.
A really common example is "it looks like most the errors are in datacenter X", you fail out of the datacenter. Maybe it was sampling bias or some other issue and it doesn't help, maybe the problem follows the traffic, maybe it just suddenly makes things better. No matter what we get signal. Establish well in advance of a situation what the common "solutions" to problems are and if you are oncall and responding, then just DO them and document+communicate as you do.
It sort of baffles me how much engineer time is seemingly spent here designing and running these "gamedays" vs just improving and automating the underlying systems. Don't glorify getting paged, glorify systems that can automatically heal themselves.
I spend a good amount of time doing incident management and reliability work.
Red team/blue team gamedays seems like a waste of time. Either you are so early on your reliability journey that trivial things like "does my database failover" are interesting things to test (in which case just fix it). Or, you're a more experienced team and there's little low hanging reliability fruit left. In the later, gamedays seem unlikely to that closely mimic a real world incident. Since low hanging fruit is gone, all your serious incidents tend to be complex failure interactions between various system components. To resolve them quickly, you simply want all the people with deep context on those systems quickly coming up with and testing out competing hypotheses on what might be wrong. Incident management only really matters in the sense that you want to allow the people with the most system context to focus on fixing the actual system. Serious incident management really only comes into play when the issue is large enough to threaten the company + require coordinated work from many orgs/teams.
My team and I spend most of time thinking about how we can automate any repetitive tasks or failover. In the case something can't be automated, we think about how we can increase the observability of the system, so that future issues can be resolved faster.
If you think of incidents as component failures, and the solution as increasing automation related to getting the faulty component back online again, you're under the old view of system failure. This view works for simpler systems.
More complex systems experience failures due to interactions between fully functioning components. The teams that made them didn't, for one reason or another, foresee that mode of interaction.
These are errors designed deeply into the system, and you can't automate recovery. You need to fix the problem at the cause.
Proper analysis is required, and if a game is what it takes to do that then why not? Additionally, it helps people learn to do that analysis on the fly. That is a crucial skill because those incidents are normal in complex systems. They will happen.
There is a month and day, Feb 15, in the header, but no year. I can't figure out if that's ironic or apropos, since this story reads like a thriller from perhaps ten years ago, but the post date appears to have been 2020-02-15 - yikes.
I may never understand why some places are all about assigning titles and roles in this kind of thing. You need one, maybe two, plus a whole whack of technical skills from everyone else.
I find Comms Lead role to be super useful bc i dont want to be bogged down replying to customers in the middle of the incident + probably don’t even have all the context/access. Everything else except ICM seems like a waste of time to me especially Recorder
You're mixed up, they're drilling. You drill so that when an emergency happens and it's 4:30am and you're bleary eyed your hands already know what to do (and are doing it) before your eyes even open all the way.
This sounds like a great way to wipe a database accidentally (like GitLab). The worst thing you can do to help fix a problem is having people asleep at the wheel.
Noted, but the point of the drill is precisely to uncover these failure modes and attempt to fix them. e.g. you might have automated runbooks to fix the problem rather than access the DB directly. You might have frequent backups and processes to easily restore from backups in case of database wipes.
Personally I rarely drink anything, so not applicable to me.
But I had seen a good attitude from a colleague once: If you want me to put my normal life on hold for on-call shifts, that is fine with me. But then you need to pay me for that time as if I was sitting in the office: "So I'm going to be paid an additional 128hrs the week I'm on call, okay?"
In the specific case it was not about drinking but about weekend or evening trips with the family where the company expected the employee to sit at home instead and wait to potentially be paged.
People quickly came to the conclusion that some lowered expectations for the on-call person would be appropriate and working via a 4G connection from a notebook is totally acceptable.
I would say the experience is transferrable to drinking. Don't drink yourself into a stupor is sage advice, but we're not worrying about a bottle of wine, especially if it's shared over Dinner or such.
Why not? I am not working just to work, I am working to enjoy my life with the people around me. I am also from Europe so it might be different views, but just because I am on call doesn't mean I'm going to stop living my life. Work doesn't define me as a person. :)
If I'm on call (OC), I'm responsible for the uptime of the system even after hours. So If I'm planning on going hiking, I will inform the secondary OC, or delay plans to a weekend when I'm not OC. Generally I do tend to avoid getting heavily inebriated (although of course there are times when this is unavoidable).
I'm not judging, but just pointing out that I've certainly experienced a different OC culture in the US.
Eh the whole point of being on call is that you are effectively assigned to ensure that you're ready to jump on incidents quickly at off hours if needed.
This depends on the importance you assign to on-call.
And that value is usually defined by the likelihood of incidents and the impact of these.
If you constantly have incidents that are critical, you can either spend the engineering hours to fix the problems once and for all. If that is not possible because it's a different problem every time, it might be important to invest in more engineering resources and have them work shifts.
Printing for example has such a system where the impact of a stopped printing press can be catastrophic because no newspapers tomorrow. Thus there are on-site engineers that are paid to sit around and wait for a press to stop working.
If the occurrence of an incident is rare or the impact of them is basically nil for whatever reason, feel free to consider on-call not that important. Maybe an SLA of 6 hours is acceptable in such a situation.
If incidents are happening often and are important yet you do not want to spring for extra engineers but have your existing staff work on these on top of their regular duties, you need to come up with the right incentives. Massive pay helps to sweeten the deal and also provides incentives to prevent pages.
I agree with that, but all I am saying is being on call shouldn't stop you from having a drink or two and hanging out with friends. Life is boring otherwise being worried about this. :)
It doesn't match my experience, with a real incident.
I was a dev in a small web company (10 staff), moonlighting as sysadmin. Our webserver had 40 sites on it. It was hit by a not-very-clever zero-day exploit, and most of the websites were now running the attacker's scripts.
It fell to me to sort it out - the rest of the crew were to keep on coding websites. The ISP had cut off the server's outbound email, because it was spewing spam. So I spent about an hour trying to find the malicious scripts, before I realised that I could never be certain that I'd found them all.
You get an impulse to panic when you realise that the company's future (and your job) depends on you not screwing up; and you're facing a problem you've never faced before.
So I commissioned a new machine, and configured it. I started moving sites across from the old machine to the new one. After about three sites, I decided to script the moving work. Cool.
But the sites weren't all the same - some were Drupal (different versions), some were Wordpress, some were custom PHP. It worked for about 30 of the sites, with a lot of per-site manual tinkering.
Note that for the most part, the sites weren't under revision control - there were backups in zip files, from various dates, for some of the sites. And I'd never worked on most of those sites, each of which had its own quirks. So I spent the next week making every site deploy correctly from the RCS.
I then spent about a week getting this automated, so that in a future incident we could get running again quickly. Happily we had a generously-configured Xen server, and I could test the process on VMs.
My colleagues weren't allowed to help out, they were supposed to go on making websites. And I got resistance from my boss, demanding status updates ("are we there yet?")
The happy outcome is that that work became the kernel of a proper CI pipeline, and provoked a fairly deep change in the way the company worked. And by the end, I knew all about every site the company hosted.
We were just a web-shop; most web-shops are (or were) like this. If I was doing routine sysadmin, instead of coding websites, I was watched like a hawk to make sure I wasn't doing anything 'unnecessary'.
This incident gave me the authority to do the sysadmin job properly; and in fact it saved me a lot of sysadmin time - because previously, if a dev wanted a new version of a site deployed, I had to interrupt whatever I was doing to deploy it. With the CI pipeline, provided the site had passed some testing and review stage, it could be deployed to production by the dev himself.
It would have been cool to be able to do recovery drills, rotating roles and so on; but it was enough for my bosses that more than one person knew how to rebuild the server from scratch, and that it could be done in 30 minutes.
Life in a small web-shop could get exciting, occasionally.
It sounds like you're working in a different environment than the author. The environment they describe involves an ops _team_ rather than an ops _individual_ (what you've described). If you had to work with a team to resolve the incident, and had to do so on a fairly regular cadence, processes like this would likely be more useful.
I've worked as an oncall for a fundamental backbone service of the internet in the past and paged into middle of the night outages. It's harrowing and exhausting. Cool names like 'incident commander' do not change this.
We also had a "see ya in the morning" culture. Instead I'd be much more impressed to have a "see ya in the afternoon, get some sleep" culture.