> If people are trying to assign blame to a bug or outage, it's time to move on.
This is one of my favorite excerpts. I once worked in a lab where we would have frequent catastrophic failures because there was never any disaster planning or contingency management plan. I personally triaged 3 such incidents alone or with people who happened to be there when the problem arose and attempted to disseminate some suggestions for how to prevent similar problems in the future. No one was interested. People were primarily interested in tearing my head off because I hadn't handled the problem the way they would have done it (of course, they were out drinking beers or sleeping while I was dealing with the issue at 12 AM or on a weekend).
After the third time I said fuck it, the next time there is an issue I am going to insure my own projects are safe and then I'm going home and turning my phone off. Let someone else deal with it. That is the not the culture you want to be promoting.
I was on call as a new developer on a system. I was not given any procedures or trouble shooting documents. I got a call at 1 am, missed it, and waited one minute to see if there was a message. I did see a voicemail, so I started listening and logging on. Before I could even get halfway through, the person called again (why not leave the voicemail on the final attempt?). So I'm looking for the issue/fix for 5 minutes and they tell me they know who the SME is for the functionality, so they will call them. Why even call me if you're just going to call the SME without giving me time to look at it? I got negative feedback from my manager about the way I handled it. So, I asked how I should have handled it without any training or documentation. They said I should have called the SME. Well, I didn't know who the SME was and there's no documentation or list of what who is the SME for which part of the system, nor was I instructed to immediately call the SME. Again, why not just call the SME first if they knew who it was and the SME didn't create documentation because they are "too busy".
The hiring process for the company wasn't special. Of course half the stuff they claimed in the interview changed later (was hired as a Java dev but was assigned to Filenet, they said rhet dont outsource or layoff but have started doing both).
This was an internal transfer. There were definitely warning signs in that interview. I was desperate because they were outsourcing my job in an obscure tech (Filenet) and we were expecting a kid.
The hiring manager said something to the effect of, "I was surprised anyone internal even applied to this job".
'Warning flag' doesn't do this justice. I have no idea what to call it, but desperation required I ignore it.
What do you mean exactly? There are tons of problems with the company. Stay long enough at any large company and I'm sure there are plenty. The issues can change dramatically from department to department.
The lack of documentation/procedure, and the process issues with others contacting you needlessly instead of the SME. They just seem like structural issues that would not be specific to one team.
Basically. Except there were 3 other tech leads in that area. They didn't know that specific piece of functionality, but they could be given the new work to take stuff off that team's plate to make time for documentation. The leadership in that area didnt really care about anything other than delivering fast. Testing? Eh... Security issues? They're not that big of a deal - do them on an above and beyond basis (contrary to enterprise policy). On call documentation? Not even going to try to create it. I mean really, all you have to do is create a knowledge document out of the SNOW incident ticket. Then the next time it happens there will be a link to the steps taken. But no.
Eh, that's a nice thing to say, but it only makes sense at certain scales, and no matter what, there's always a person that can break it.
If any random person can break it, it's already broken.
If any employee can break it, it's probably broken (there are very small scales where even this doesn't apply. Ever worked for a company with less than ten people? There's probably something any employee can break).
If any employee that's an engineer, sysadmin or developer can break it, well now you're at least reducing the problem to a more specific set of people.
If only the people on a specific team responsible for a system can affect the system, now you've reached fairly good point, where there's separation of concerns and you're mitigating the potential problems.
If only a single person can break the system, you've gone to far. That effectively means only a single person can fix or work on the system too, and congratulations, you've engineered yourself into a bus-factor of one. Turn right around and go back to making sure a team can work on this.
Finally, realize that sometimes the thing only one team can break is an underlying dependency for many other things, and they may inadvertently affect those. You can't really engineer yourself out of that problem without making every team and service a silo that going from top to bottom. Got a shared VM infrastuture, whether in house or in the cloud? The people that administer that can cause you problems. Don't ever believe they can't. Your office IT personnel? Yep, they can cause you problems too.
Some problems you fix by making it so they can't happen. Other problems you fix by making it hard to happen and putting provisions in place that mitigate the problems if they do.
There are lots of places where we require that no single person can break the system at least in a certain way.
For example code review and LGTM ensures that a single individual can't just break the system by pushing bad code.
Often there are other control planes that don't have the same requirement, but I think the idea that there must always be one person who can break the system isn't clearly true.
I'm making an (admittedly subtle) distinction here between complex mistakes, where something was missed, and simple mistakes/bad actors where someone used a privilege in a manner they shouldn't have.
LGTM ensures that, for example, a single individual can't push a code change that drops the database. On the other hand, that same individual might be able to turn of the database in the AWS console.
> LGTM ensures that, for example, a single individual can't push a code change that drops the database.
Personally, I've seen LGTM let slip complex bugs in accounting code (admittedly, not great code) that went on to irreversibly corrupt hundreds of millions of accounting records.
Yes, it will catch "DROP DATABASE", but when it's still letting through major bugs that similarly require a full restore from backup... It seems functionally equivalent?
Given:
> There are lots of places where we require that no single person can break the system at least in a certain way.
I don't think code reviews are a solution. I mean, they're one of the better solutions I can think of, but they're not actually a solution.
> For example code review and LGTM ensures that a single individual can't just break the system by pushing bad code.
There's always someone with rights to push code manually, or tell the system to push an older version of code which won't work anymore. Someone needs to install and administer the system that pushes the code, and even if they don't have direct access to push the code to where it eventually goes, someone's access credentials (or someone that controls the system that has access credentials) has access somewhere along the way.
But who controls that the code system is up and available even allow checkins? Can one person break that? What about who controls the power state of the systems the code gets checked in on? Is that also ensured not to be a single person? What about office access? What about the power main at your building? Is it really impossible for one person to cause problems there?
It might sound like I'm changing the goal posts, but that's sort of my point, these are all dependencies on each other. It's impossible to actually make it so one person can't cause any problems, because you can't eliminate all dependencies, and you can even accurately know what they all are. What you can do is focus on the likely ones, put whatever in place you can that's sane, but focus all the crazy effort you would have to do to track down the diminishing returns of trying to make failure impossible and start spending that time and effort on making recovery quick and easy.
Unfortunately, some work that goes into attempting to make sure any one person can't cause a problem might actually make that harder. Requiring someone to sign off on a commit to go live is great at 2 PM Tuesday, but not so great when it's required to fix something at 2 AM Sunday. This is the tightrope that needs to be walked, and also while even if you don't necessarily know about it, there probably is someone that has access to break something all by themselves, because they're who is called in to makes sure it can be fixed when the shit hits the fan and all those roadblocks to prevent problems need to be bypassed so the current problem can actually be fixed.
Any system that doesn't have some people like that at various levels persists in that state only until they have a problem and in the incident assessment someone needs to answer why a 5 minute fix took hours and the answer includes a lot of "we needed umpteen different people and only a fraction were available immediately".
Even at Google (which I see you work at from your profile), my guess is that people in the SRE stack can cause a very bad day for most apps. My guess is that even if the party line is that no one person can screw anything up, you probably don't have to ask to many SREs there before someone notes that it's more of an aspiration than a reality.
Sorry if that's a bit rambly. I know you weren't specifically countering what I was saying. I've just had a lot of years of sysadmin experience where it's pretty easy to see the gaps on a lot of these solutions where the face presented looks pretty secure.
What systems are you working on? Many are held together by ritual, and deviating from the ritual causes outages. They’re very fragile in some form (deployment, change, infrastructure, dependencies, etc.). They won’t break if you follow the happy path, but to say they’re so robust that an active attempt at breaking won’t bring them down is ... naive? Not sure if that’s the word I’m looking for.
I say this as someone who’s worked at large tech companies that are “internet scale”.
Maybe. I've seen the opposite, where no one takes responsibility for anything, and it's also bad. In fact, the situation you describe could also be a lack of anyone else taking responsibility for disaster planning and etc.
I think what is needed is a culture of -ownership-. That's basically people saying "I'm responsible". Not one where everyone tries to avoid responsibility, and not one where peopel point fingers.
Why does someone need to take responsibility when you can have a culture of blameless postmortems where everyone focuses on making sure what ever happened never happens again instead? In blameless postmortem culture, everyone is responsible by default
"Everyone focuses" = nothing gets done. I've been at places like that, where a post-mortem happens, a course of action is decided on...and then no one owns actually carrying out that course of action.
You could argue that "It should be assigned" - yeah, it should. But assigning it implies either "here is the team that is responsible for it", i.e., this is the team responsible and they need to be told to fix their shit (which very much sounds like blame), OR it implies "here is the team that I am entrusting to fix it DESPITE their obviously not being responsible for it", which is just as bad, since it implies that the team that 'is' responsible for it is incompetent.
The only healthy option is that the 'responsible' team stands up to say "hey, that's ours; we'll fix it", and the only way they'll do that is if you have a culture of safety and ownership.
Also, one thing to make clear - ownership = responsible = blame. They're all words for the same thing, just different implications. You can't have someone 'own' something without making them responsible, and apt to be blamed if you don't ensure the culture is one that does not attach blame. That's really what I was getting at; of course you shouldn't blame. But, you can't also avoid ownership. But ownership implies you know WHO to blame, and so blame comes very easily. And it's very easy to mistake pointing out responsibility/ownership for something as blame; I have had multiple managers tell me "it's not us vs them" when I've raised up the fact that I'm unable to deliver to deadlines because I have been unable to get anything from product.
The people most capable of taking the action items are assigned it. This could be expertise, resourcing, proximity, etc..
In an open discussion of the root cause, many times the issue is across multiple services / organizations within a company. You’d assign tasks appropriately across teams as needed. The key is to find and create actionables to address the root cause, not to punish / blame individuals.
"The people most capable of taking the action items are assigned it. This could be expertise, resourcing, proximity, etc."
Expertise and proximity are facets of responsibility (well, technically they are facets of knowledge, but ideally knowledge, empowerment, and responsibility are aligned, else things ALSO won't get done). Resourcing is a red herring; I've seen things get assigned to teams based on "they have the capacity", without it being an area whose domain they're familiar with (i.e., they don't work in that area, and ergo are not responsible for the outcome) - those things rarely get done, and never get done well.
The blameless postmortem an "legal fiction" that don't really mean that blame cannot be assigned just that blame cannot result in punishment or loss of face/standing.
At the end of they day you are going to have someone stand up and say: yep we should have planned for this, and we will correct this in x, y, z, ways.
What does it mean to be responsible? Just to say it? Responsibility should be accompanied with fines corresponding to the damage or something like it. Otherwise those are just words. I'm responsible, but I'm not getting any fines if something goes wrong, so whatever, but I'm responsible. Fire me if you want, I'll find new work in a few days, but I was responsible.
It's business owner who's responsible, because ultimately he's getting all the expenses when critical event happens, client leaves, client sues the company, and so on. Other people are not really responsible, they just pretend to be.
So I've actually written about this in the past, but, responsibility is -actually effecting the entity-.
That is, "you're responsible for this" - if they do it, and it succeeds, what happens? If they don't do it, and it fails, what happens? If the answer is "nothing" in either of those cases, they're not actually responsible. If the result is too detached, they're also not actually responsible (i.e., if I decide not to do one of the ten tasks assigned to me, and I don't hear about it until review time, if at all, then I was never responsible).
Responsibility is innately tied with knowledge and empowerment, but without going on at length, and to just give an example - if I'm the one woken up by the pagerduty alarm when something breaks, I am responsible for that something, because its success or failure directly affects me. If, however, there is a separate ops team that has to deal with it, and I can slumber peacefully, responsibility has been diluted; you won't get as good a result.
Honestly, I don't know; I unknowingly followed the author's advice. About half a year after the last incident, a friend who I went to school with called me up and offered me a job at his fledgling biotech. I accepted and never looked back.
This is one of my favorite excerpts. I once worked in a lab where we would have frequent catastrophic failures because there was never any disaster planning or contingency management plan. I personally triaged 3 such incidents alone or with people who happened to be there when the problem arose and attempted to disseminate some suggestions for how to prevent similar problems in the future. No one was interested. People were primarily interested in tearing my head off because I hadn't handled the problem the way they would have done it (of course, they were out drinking beers or sleeping while I was dealing with the issue at 12 AM or on a weekend).
After the third time I said fuck it, the next time there is an issue I am going to insure my own projects are safe and then I'm going home and turning my phone off. Let someone else deal with it. That is the not the culture you want to be promoting.