Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is what I call "fool's availability": reducing single points of failure (one cloud provider) without adding any actual redundancy.

If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.

The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.

If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.

This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.



Its probably worse—a given stack using multiple of these small providers will probably have more “single points of failure” (providers used in series rather than parallel.)

(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)


Yes but most of those companies aren't morons, they're just taking an acceptable risk. Multi-region or multi-cloud setup is nontrivial.


Most companies I've worked for (and have heard about from others) have either lacked the knowledge, or the will, to evaluate risk. They build things until they "just work", and their thought process ends there. They don't examine the design to identify its reliability and security risks. They don't calculate the losses. They still have issues, but they just happen to be acceptable most of the time.

Example1: A company's infra goes down, but it doesn't come back up correctly. People run around trying to get it working again. It takes much longer than they hoped/expected, and they lose a lot more money than they expected. This is because they never really understood the risk they were exposed to. If they understood it, they would have done more ahead of time to mitigate that much risk.

(today's outage is this case. A lot of companies are going to lose money after today, because their customers are not happy with these "acceptable risks". Presumably, losing this much money due to one outage will not be an acceptable risk in hindsight. So the company either didn't understand its risk, or it did but was too stupid to prevent it)

Example 2: A company gets hacked, and its data is either exposed or wiped. This is a much worse result; they can lose tons of money, chase off customers, damage their brand, open them up to lawsuits and fines, even tank the whole company. It's clear that this risk is pretty unacceptable. But it keeps happening. And the reason usually isn't "some genius hacker"; it was a lack of understanding the risk of not investing in security.

(there's tons of examples of these in the news. presumably, not investing in security was not an acceptable risk in hindsight when it ended their business! almost always, the people involved in making these products don't know enough about security to understand the risks. but they also don't invest in security training, mandatory security controls, checklists, processes, quality gates, etc)

You don't need multi-region or multi-cloud to mitigate reliability risks. Just like you don't need to hire a big security team or invest tons of cash to mitigate security risks. You can use your existing infra and tools, and mitigate both issues. You just have to use them wisely. It takes some effort and time, but you do it once and it pays dividends indefinitely.

Building something without identifying its security/reliability risks, and then not calculating those risks' impact, is not acceptable risk; it's ignored risk. Is tanking your company and shedding customers an acceptable risk? Well, there's one way to find out.


This outage would've required cross region failover to be immune to. We'll see if customers switch to whatever company was resilient, but this has happened before and the answer was no.


The company I work for has a ton of stuff in us-east-1, many large products and sites, and we didn't go down. Our products/services aren't multi-region or multi-cloud. We don't pay exorbitant bills or have super complicated architectures.


If you were using AWS services that went down in us-east-1, how did you avoid an outage without failing over to anything outside that region?


That's the thing - most AWS services didn't "go down", as in stop working entirely. There were specific operations of specific services that were failing. Increased API error rates, inability to start new EC2 instances, billing metrics unavailable, AWS console unavailable, etc.

The outage wasn't like "all our servers stopped running". It was dynamic, new, specific operations that failed. If you just had a Fargate container that was started a week ago, and you have no need to restart the container today, it just kept chugging along.

Our architecture is stuff that just keeps chugging along. Fargate, S3, RDS, CloudFront, CloudFlare, etc. From our perspective, there was no outage in us-east-1. Literally the only alert we got the entire time was "billing limit exceeded" - and that was a false alarm, because it was set to alarm if there is zero billing data.


But is this strategy or luck? I'm not seeing how those many companies did something dumb or wrong here while you did it right. Like are they only affected because they overcomplicated their deployments? Either way, your service isn't resilient against a generalized regional outage it sounds like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: