Looks very nice -- as a feature suggestion, a floating number that automatically...

alexsolo · on Aug 12, 2009

Hmm... what do you mean by "floating number"? What we do right now is route all alarms to the engineer currently on-call. Each person can set up their own notification sequence so they get alerted using any combination of phone calls, SMSes, and emails.

antonovka · on Aug 12, 2009

A floating phone number that can be handed to, say, 24-hour support staff, that will automatically direct incoming calls to the on-duty engineer.

The value is that when support staff has real phone numbers available to them, they tend to dial historically responsive individuals directly in order to get a problem resolved, thus creating a negative feedback loop -- if you ignore notifications and phone calls, you get called less by support staff in the future.

Having a floating number -- especially if we could get statistics on who answers and who always ignores them, and if the number could "call up the chain" automatically when nobody answers, would be a useful tool to solve this issue.

Of course, the preference is that human staff doesn't need to call anyone, but it still happens.

agmiklas · on Aug 12, 2009

Hi, this is Andrew, one of the co-founders of PagerDuty.

That's an interesting idea. We've actually thought a bit about adding phone-based triggering to PagerDuty (via a 1-800 number + access code). The idea was to make PagerDuty useful to non-IT businesses like plumbers that also have the concept of out-of-hours on-call duty. From your comment, though, it sounds like this kind of feature would be pretty useful even in the IT world.

brianobush · on Aug 12, 2009

If my code crashes in the night, save the core files, logs, etc and send me a mail; unless of course the system is not restartable - then you have bigger issues. Calling an engineer in the night is going to prompt what? Sane code changes? QA'd code?

antonovka · on Aug 12, 2009

It forces engineering to internalize the cost of their errors, rather than externalizing those costs by pushing them to our customers and human support/operations staff.

This encourages the following:

- Closer correlation between business robustness requirements and software implementation.

- Adoption of more robust design or methodologies when required by the business. If all that is required is an automated restart, then automatically restart the software. If the same class of bugs causes regular failure, adopt a strategy for avoiding that class of bugs.

- A realistic platform for negotiating external support. If engineering staff is unable to produce software sufficiently robust as to support the business requirements, then we must make a business decision as to decide whether maintaining additional operations staff is cheaper -- in the short and long-term -- than correcting the engineering issues.

In my experience, software that fails regularly enough to cost significant engineering resources in responding to those failures is generally broken software. The goal is to not let software get to that point, and to correct it quickly if it does.

It's very easy for your Operations budget to unnecessarily balloon under the load of supporting failure-prone software; engineering has every incentive to externalize the costs of their implementation decisions, while operations has every incentive to increase their headcount and budget by supporting those failure-prone systems.