This is what I call "fool's availability": reducing single points of failure (on...

dudeinjapan · 2025-10-20T18:21:19 1760984479

Its probably worse—a given stack using multiple of these small providers will probably have more “single points of failure” (providers used in series rather than parallel.)

(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)

morshu9001 · 2025-10-20T18:53:34 1760986414

Yes but most of those companies aren't morons, they're just taking an acceptable risk. Multi-region or multi-cloud setup is nontrivial.

0xbadcafebee · 2025-10-21T00:07:24 1761005244

Most companies I've worked for (and have heard about from others) have either lacked the knowledge, or the will, to evaluate risk. They build things until they "just work", and their thought process ends there. They don't examine the design to identify its reliability and security risks. They don't calculate the losses. They still have issues, but they just happen to be acceptable most of the time.

Example1: A company's infra goes down, but it doesn't come back up correctly. People run around trying to get it working again. It takes much longer than they hoped/expected, and they lose a lot more money than they expected. This is because they never really understood the risk they were exposed to. If they understood it, they would have done more ahead of time to mitigate that much risk.

(today's outage is this case. A lot of companies are going to lose money after today, because their customers are not happy with these "acceptable risks". Presumably, losing this much money due to one outage will not be an acceptable risk in hindsight. So the company either didn't understand its risk, or it did but was too stupid to prevent it)

Example 2: A company gets hacked, and its data is either exposed or wiped. This is a much worse result; they can lose tons of money, chase off customers, damage their brand, open them up to lawsuits and fines, even tank the whole company. It's clear that this risk is pretty unacceptable. But it keeps happening. And the reason usually isn't "some genius hacker"; it was a lack of understanding the risk of not investing in security.

(there's tons of examples of these in the news. presumably, not investing in security was not an acceptable risk in hindsight when it ended their business! almost always, the people involved in making these products don't know enough about security to understand the risks. but they also don't invest in security training, mandatory security controls, checklists, processes, quality gates, etc)

You don't need multi-region or multi-cloud to mitigate reliability risks. Just like you don't need to hire a big security team or invest tons of cash to mitigate security risks. You can use your existing infra and tools, and mitigate both issues. You just have to use them wisely. It takes some effort and time, but you do it once and it pays dividends indefinitely.

Building something without identifying its security/reliability risks, and then not calculating those risks' impact, is not acceptable risk; it's ignored risk. Is tanking your company and shedding customers an acceptable risk? Well, there's one way to find out.

morshu9001 · 2025-10-21T15:16:17 1761059777

This outage would've required cross region failover to be immune to. We'll see if customers switch to whatever company was resilient, but this has happened before and the answer was no.

0xbadcafebee · 2025-10-21T22:37:03 1761086223

The company I work for has a ton of stuff in us-east-1, many large products and sites, and we didn't go down. Our products/services aren't multi-region or multi-cloud. We don't pay exorbitant bills or have super complicated architectures.

morshu9001 · 2025-10-22T00:46:50 1761094010

If you were using AWS services that went down in us-east-1, how did you avoid an outage without failing over to anything outside that region?

0xbadcafebee · 2025-10-22T16:05:27 1761149127

That's the thing - most AWS services didn't "go down", as in stop working entirely. There were specific operations of specific services that were failing. Increased API error rates, inability to start new EC2 instances, billing metrics unavailable, AWS console unavailable, etc.

The outage wasn't like "all our servers stopped running". It was dynamic, new, specific operations that failed. If you just had a Fargate container that was started a week ago, and you have no need to restart the container today, it just kept chugging along.

Our architecture is stuff that just keeps chugging along. Fargate, S3, RDS, CloudFront, CloudFlare, etc. From our perspective, there was no outage in us-east-1. Literally the only alert we got the entire time was "billing limit exceeded" - and that was a false alarm, because it was set to alarm if there is zero billing data.

morshu9001 · 2025-10-24T00:59:24 1761267564

But is this strategy or luck? I'm not seeing how those many companies did something dumb or wrong here while you did it right. Like are they only affected because they overcomplicated their deployments? Either way, your service isn't resilient against a generalized regional outage it sounds like.