Looks very nice -- as a feature suggestion, a floating number that automatically routes to the currently on-call staff would be a great addition.
As an aside, I find the best way to avoid regular failures and decrease the necessity for a large operations staff is to put the individuals responsible for building the system on-call for when it fails. Your operations staff is woken up when a server crashes or a hard drive fails, and your engineers get woken up when their code crashes in the middle of the night.
If you don't do this, the costs of writing poor production code have to be levied across departments by management, rather than avoiding externalities entirely and letting engineers and operations deal with the direct impact of their implementation choices.
Of course, this is ultimately a wash if you don't also institute development methodologies to help reduce the number of production-impacting bugs, rather than simply relying on engineer's reactive fixes to one-off issues.
Hmm... what do you mean by "floating number"? What we do right now is route all alarms to the engineer currently on-call. Each person can set up their own notification sequence so they get alerted using any combination of phone calls, SMSes, and emails.
A floating phone number that can be handed to, say, 24-hour support staff, that will automatically direct incoming calls to the on-duty engineer.
The value is that when support staff has real phone numbers available to them, they tend to dial historically responsive individuals directly in order to get a problem resolved, thus creating a negative feedback loop -- if you ignore notifications and phone calls, you get called less by support staff in the future.
Having a floating number -- especially if we could get statistics on who answers and who always ignores them, and if the number could "call up the chain" automatically when nobody answers, would be a useful tool to solve this issue.
Of course, the preference is that human staff doesn't need to call anyone, but it still happens.
Hi, this is Andrew, one of the co-founders of PagerDuty.
That's an interesting idea. We've actually thought a bit about adding phone-based triggering to PagerDuty (via a 1-800 number + access code). The idea was to make PagerDuty useful to non-IT businesses like plumbers that also have the concept of out-of-hours on-call duty. From your comment, though, it sounds like this kind of feature would be pretty useful even in the IT world.
If my code crashes in the night, save the core files, logs, etc and send me a mail; unless of course the system is not restartable - then you have bigger issues. Calling an engineer in the night is going to prompt what? Sane code changes? QA'd code?
It forces engineering to internalize the cost of their errors, rather than externalizing those costs by pushing them to our customers and human support/operations staff.
This encourages the following:
- Closer correlation between business robustness requirements and software implementation.
- Adoption of more robust design or methodologies when required by the business. If all that is required is an automated restart, then automatically restart the software. If the same class of bugs causes regular failure, adopt a strategy for avoiding that class of bugs.
- A realistic platform for negotiating external support. If engineering staff is unable to produce software sufficiently robust as to support the business requirements, then we must make a business decision as to decide whether maintaining additional operations staff is cheaper -- in the short and long-term -- than correcting the engineering issues.
In my experience, software that fails regularly enough to cost significant engineering resources in responding to those failures is generally broken software. The goal is to not let software get to that point, and to correct it quickly if it does.
It's very easy for your Operations budget to unnecessarily balloon under the load of supporting failure-prone software; engineering has every incentive to externalize the costs of their implementation decisions, while operations has every incentive to increase their headcount and budget by supporting those failure-prone systems.
As an aside, I find the best way to avoid regular failures and decrease the necessity for a large operations staff is to put the individuals responsible for building the system on-call for when it fails. Your operations staff is woken up when a server crashes or a hard drive fails, and your engineers get woken up when their code crashes in the middle of the night.
If you don't do this, the costs of writing poor production code have to be levied across departments by management, rather than avoiding externalities entirely and letting engineers and operations deal with the direct impact of their implementation choices.
Of course, this is ultimately a wash if you don't also institute development methodologies to help reduce the number of production-impacting bugs, rather than simply relying on engineer's reactive fixes to one-off issues.