Writing Runbook Documentation When You’re an SRE

williamDafoe · on Feb 1, 2020

This is not a very good article. Here are some improvements.

1. Runbooks go out of date faster than anything. Therefore it is absolutely crucial to have the whole book on a single page and alerts vector to the proper subsection. Also pepper each section with a good set of keywords. This will allow users and newbies to easily search for procedures or related alerts when links invariably break.

2. Group related alerts ("host xyz down") in a single section with many section titles. One for each possible xyz.

3. Go ahead and put commandline commands in the runbook, in shell-command or code highlighted boxes (NEVER inside sentences). User defined fields should be delineated with $HOST etc (never <host>) and sample values for the vars given beforehand with sample output afterward. Never use a copyable $ to delineate shell commands. This creates the best possible user experience for people reusing your commands by cut and paste so they can more easily change values and so they know what to expect.

4. Link all relevant consoles in very small links (CPU usage in clusters: ym qf ij) including historical links to console views for past problems and how they look.

5. Section templates might look like this:

1.8. Too many cacheservers are down

1.8.1 Definition. This means that 25% of the hosts (on average) have been down over the past 10 minutes

1.8.2 Severity. Our load balancer will route all requests to surviving hosts and clients will retry on timeout so normally this is not severe (there is only a performance impact). However it could cascade due to RAM exhaustion or a query-of-death or due to a config push of broken software so assess the service right away for these problems.

1.8.2 Remediation ... Rollback or resource scaling or bypassing the cache service on the command line ...

- Google search SRE

mooreds · on Feb 1, 2020

Agree 100% on clearly delineating the commands. Do you have any suggestions for keeping commands up to date?

I don't understand your comment #4: "CPU usage in clusters: ym qf ij"

Thanks for the great suggestions for improvement (disclosure, I work with the author and posted the link).

joshuamorton · on Feb 1, 2020

They're being google-jargon-y.

Replace "ym qf ij" with "us-east-1 us-west-1 eu-2" and each links to the monitoring console for aggregate cpu usage for the job in that region.

phrotoma · on Feb 1, 2020

The value of $HOST vs <host> isn't jumping out at me. Can you elaborate?

LukeShu · on Feb 1, 2020

It's about copy/paste-ability. If you're copy/pasting in to a terminal, with $HOST you can type

    HOST=myhost.tld

first, then just paste, and everything will work.

With <host>, if you paste the command without editing it, then it will do something different and write some files to disk (unless quoted).

kureikain · on Feb 2, 2020

I have been using this at work internally.

Basically I write an app that parse a directory of markdown file. However, the code block is runnable.

The code block is organize like this:

``` input: - name: description: - othervar: description: --- real shell code here ```

It adds a run button under the code block, when run, it parse the inputs to generate UI for input these parameter. Then it spin up a pod in kubernetes and run the script. We are already using Vault, so the script can access vault to get the secret it wants.

It feels awesome because we put the link to runbook in pagerduty alert. It keeps the document and code in-sync.

However, it's a kind of like a backdoor :-(. It essentially give shell access to entire infrastructure(Since it has Vault access). I tried hard to protect this tool but still feel uneasy about it. But without it I don't know how hard it is to keep code and document in-sync.

epicgiga · on Feb 1, 2020

Runbooks are more of an anti-pattern than anything.

A list of commands you should run to accomplish X -> that's a script. Or it should be. If accomplishing a task takes more than a single step, it means there's insufficient automation or programmability, and that's the problem you should fix, not teaching devs how to be better writers of prose.

Code can be tested quickly and relatively easily. Docs can't.

Docs should be as minimal as possible: "# Server ## Rebooting - ./scripts/server/reboot.sh".

Any docs written by a dev should be run by an editor, who should strip them down to the bare minimum. Your typical readme is inundated with waffle, and the actual meat in it is wrong well over half the time.

Automate everything.

ggambetta · on Feb 1, 2020

> Runbooks are more of an anti-pattern than anything.

No. They're essential to the sanity of the individual SRE oncall, and of the SRE team they are part of. It's also how institutional knowledge is preserved after successive rounds of oncall shifts.

If you think a playbook is bad, try being oncall for the first time for a massively (Google-scale) distributed system without a playbook.

> A list of commands you should run to accomplish X -> that's a script.

A list of possible root causes an SRE should consider, complex interactions with other systems that might be problematic, not overreacting to spurious alerts -> that's a playbook.

0x445442 · on Feb 1, 2020

It seems the problems your describing are management problems not engineering problems. The people who designed, built and are responsible for maintaining and updating the system should be on call if something goes wrong.

I view runbooks in the same way I view "knowledge transfers". It's management's desperate hope that somehow a person leaving a project or company can convey their acquired knowledge of a system on some Confluence page or from a few meetings. It's a complete failure to recognizing the essence of the work.

I don't think runbooks are necessarily bad if they are used to bootstrap new team members. But thinking their existence is a green light to allow people without domain expertise in a system to be responsible for administering the system during "off hours" is misguided.

ggambetta · on Feb 1, 2020

> The people who designed, built and are responsible for maintaining and updating the system should be on call if something goes wrong.

In my experience, the devs were somewhere in the US west coast, and the SRE teams were geographically distributed to cover the 24 hour period during local daytime (nobody likes to be paged in the middle of the night). As an SRE in Zürich, I got paged in what was the middle of the night for the Kirkland people, dealt with the emergency (using the playbook), root-caused it (with the assistance of the playbook), and filed bugs to be looked at by the dev team when they woke up.

The systems stayed up, everyone could sleep at night, working as intended.

0x445442 · on Feb 1, 2020

> and the SRE teams were geographically distributed to cover the 24 hour period during local daytime

Management problem number 1. These people should not be responsible for the running system.

> nobody likes to be paged in the middle of the night

Excellent motivation for the people that should be responsible for the running system to build quality software.

SpicyLemonZest · on Feb 2, 2020

Why would randomly waking your engineers up in the middle of the night be an excellent motivation strategy?

iso1631 · on Feb 3, 2020

It's an incentive to not release stuff that breaks in the middle of the night.

On the flip side that can lead to slower releases, or more expensive solutions

notyourday · on Feb 1, 2020

> If you think a playbook is bad, try being oncall for the first time for a massively (Google-scale) distributed system without a playbook.

You are not Google scale. Don't invent a pen that writes in space when a pencil would do the trick.

marcinzm · on Feb 1, 2020

I'd argue 100% automation is a space pen while run books are a pencil.

notyourday · on Feb 1, 2020

You do not need 100% automation. What you need is a systematic approach to handling problems followed by fixing the root cause.

Runbooks came from techops in broadcasting, power plant operations, etc where there was a clear division between operators who pushed buttons, ran cables, etc and those that made decisions about buttons to push and cables to run. Dumb hands + runbooks created "smart hands".

If your SRE runs like that it is not SRE.

Look at the incident handling:

1. Identify the issue

2. Implement a workaround to restore the service

3. Identify the root cause

4. Implement a fix for the root cause

5. Remove the workaround

Runbooks cover 1. and 2.

ggambetta · on Feb 1, 2020

> You are not Google scale.

I don't know, I think I was kind of Google scale when working as a Google SRE.

notyourday · on Feb 1, 2020

1) https://www.transposit.com/ is not Google.

2) Your impact on Google scale was nearly 0 because you were one of a thousand SREs at Google.

williamDafoe · on Feb 1, 2020

There are only 500.

notyourday · on Feb 1, 2020

My bad

GaryNumanVevo · on Feb 1, 2020

There’s no tolerance for rude and unsubstantiated comments here

easymodex · on Feb 1, 2020

Did you just assume my scale?

p_l · on Feb 1, 2020

And check your metaphors before using them.

The pencil was more expensive.

marcinzm · on Feb 1, 2020

>If accomplishing a task takes more than a single step, it means there's insufficient automation or programmability, and that's the problem you should fix, not teaching devs how to be better writers of prose.

Perfect is the enemy of good.

In practice your automation will not be good enough due to either tooling issues or script issues or time to manage automation for rare use cases. Worse, automation can also become stale and if it's for rare events then it's likely that it won't work when needed. After all, we all know that documentation is always out of date, why do you think the script won't be be? Run books solve that by involving a human and explaining to a human how to think about the problem and allow them to work around misinformation.

mooreds · on Feb 1, 2020

> Automate everything.

Whenever I see someone say something like this, I'm always curious about where human judgement and/or investigation comes in.

Don't get me wrong, I'm happy to automate what can be automated if it makes sense (that is, the action can be easily turned into a script and the cost to create and maintain the automation is not higher than the value created) but aren't there plenty of areas where some level of human judgement is needed and you can't just automate everything?

Isn't that kinda what SREs do? Develop intuition for what can be automated, automate what they can, set up process for what they can't?

scaryclam · on Feb 1, 2020

The "Automate everything" mentality tends to come from developers who are not yet really experienced enough to know what value automation is bringing. I'm a developer and I love automation of things, but I also know that some things don't automate well (self-healing systems are great, but at this point in time they mostly fail in interesting ways, which is one of the reasons SREs exist).

Automation also doesn't mean a system is actually of value. I've seen plenty of automation that has made life harder and incidents more common. It's also meant that some systems have been designed in a sub-par way, just to shoehorn automation in.

The mantra "automate everything" should be a nice rule of thumb, but not followed with any dogma.

rtpg · on Feb 1, 2020

I'm basically of the same opinion as you, but recently read this article[0] on "do nothing" scripts and really like it. The idea is to write at your process _as a script_ that will at first just print what to do. Then you can use computer magic to improve things over time and automate steps in the middle.

I generally agree that there are a lot of processes that are like "use your brain and think at this step", which is hard to automate, really. Open-ended workflow tools don't really exist for this kind of thing.

(Clarifying example: what script do you write to "diagnose a response time spike"? You can definitely write up usual suspects but at one point you don't have a script that gets you from the beginning to end )

[0]: https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-...

mooreds · on Feb 1, 2020

That's a great article. The HN discussion on it is worth reading too: https://news.ycombinator.com/item?id=20495739

epicgiga · on Feb 2, 2020

This is a great intermediate step.

epicgiga · on Feb 1, 2020

In the context of docs/"runbooks", my point is you're not saving anything by doing that instead of scripting. You're going to pay for it in the long run when the docs quickly drift out of date and then people stop trusting them, and the docs have limited value anyway because they're just going to ask for your help in person instead 90% of the time (because that's 10x more effective).

Scripts can have similar problems, but they're rapid and testable, so far easier to maintain their integrity.

Where judgment is needed, i.e. unique cases, the runbooks won't help much anyway, and its up to the tech lead to work out why that problem happened once they've fixed it, and then either modify the system, or add a new script to fix that particular new problem if it arises again.

mherdeg · on Feb 1, 2020

One of the advantages of having documentation live alongside code in source control is that, in principle, the person reviewing the pull request could edit the prose.

The pattern I see most often is "oh my goodness, thank you for making any effort at all to document this mess, this is way better, your pull request is approved."

btmiller · on Feb 1, 2020

Automate everything and verbose runbook are not mutually exclusive. As you automate, your alerting events trend toward more complex issues that require understanding complex systems.

In those cases, you don’t always know how much context the on-call engineer will have when receiving a page. That’s why I preach for verbosity and context setting in runbooks. They are extremely important for fast remediation of production-impacting issues.

andrewflnr · on Feb 1, 2020

You still need documentation on what scripts to run under what circumstances and how, for instance flags and configuration. That's what your "runbooks" or whatever should be. I'm currently in the process of replacing as many procedural docs with scripts as possible, but the docs aren't going to disappear, even if they get reduced to mostly a shell command or two with some context.

You're also making several assumptions like "everyone involved are devs" and "surely you don't have to manually plug and unplug wires, right?" Yes, my job is weird...

0xbadcafebee · on Feb 1, 2020

You can't really automate everything. Even if you could, and you had the time, and the money, and the omnipotence to know of every possible potential failure and Domino condition and write automation for it before it happened, it would still be a bad idea.

Computers can't replace the dynamic capability of humans to solve difficult problems quickly. But they can provide tools for humans to use to solve it quicker. Runbooks are just one of many tools humans use to this end, like scripts, like some automation, autonomation, escalation, incident management, etc. There will always be bugs no automation can fix, and so we rely on humans to fix it, and make it as easy for them to do that as possible.

Furthermore, code is not documentation, because code does not explain how or why the decisions were made to write the code that particular way. Minimalism is not great when what you may need is knowledge. Documentation is different from a runbook.

Documentation doesn't have to be tested, it just needs indicators of age and health. If it's >1yr old, it probably needs rewriting. A view count helps, as does a like button, this was helpful/unhelpful button, and comments section. If you link them, ones that change can indicate the others that may need changing.

acvny · on Feb 1, 2020

SRE is a scam

ryanklee · on Feb 1, 2020

Care to qualify?

0x445442 · on Feb 1, 2020

I agree as well, here's the start of a qualification:

https://news.ycombinator.com/item?id=22210952

andybak · on Feb 1, 2020

SRE = Site Reliability Engineering apparently...

Sigh...