Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

tl;dr

1. On February 29, 2012, new certificates created with a one-year expiration date by adding 1 to the year. Since February 29, 2013 is an invalid date, VMs wouldn't start.

2. After multiple attempts to restart failed VMs, physical hosts marked as failed, and VMs migrated to other physical machines -- the problem propagates.

3. Management services disabled to prevent customers from starting more VMs, compounding the problems.

4. After leap-day bug fixed, secondary failures caused by mixing up incompatible versions of a networking plugin, so VMs had no network access.

5. Total duration of outages: about 16 hours.

6. 33% of a month's service to be credited to all customers, regardless of who was affected.



Why is it that they think a single customer would be happy with 33% of a fee which is likely to be only a very small part of what their downtime cost them?

Not to mention that 16 hours time to fix is insane, unless all your datacenters had been blown up or war had broken out.


> Why is it that they think a single customer would be happy with 33% of a fee

Because most other providers would have refunded the customer 16 / (24*28) = 1 / 42 = 2.4% of the bill.

Microsoft paid out 10x that amount.

The type of an SLA that you are talking about (that pays out to cover all loss of business) does not exist anywhere, and if it did, it would cost you more than $10-$100/month hosting account that you'd normally buy.


Not exactly. For big outages like this, both Google and Amazon provide bigger refunds.

When Amazon's elastic block store was down, they credited back 10 days of service.

http://aws.amazon.com/message/65648/

And Google offers a 99.95% SLA for App Engine which refunds 10%, 25% or 50% of the total monthly bill if uptime falls below 99.95%, 99.00% or 95.00% respectively.

http://code.google.com/appengine/sla.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: