> This issue is affecting the global console landing page, which is also hosted ...

all_usernames · on Dec 7, 2021

Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.

cronix · on Dec 7, 2021

> Facepalm in the form of $100Ms in service credits.

It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.

jjoonathan · on Dec 7, 2021

Oh no... I think you may be in for a rough time, because I purchased something this morning and it only popped up in my orders list a few minutes ago.

toomanyrichies · on Dec 8, 2021

They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".

vkgfx · on Dec 7, 2021

>Facepalm in the form of $100Ms in service credits.

Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.

sitharus · on Dec 8, 2021

That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.

Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.

JPKab · on Dec 7, 2021

Utter lies on that page. Multiple services listed as green aren't working for me or my team.

GabrielBen · on Dec 7, 2021

This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.

Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.

CrazyCatDog · on Dec 7, 2021

Suggesting that when the status page sends a status request and hears no response—it defaults to green—hear no evil and see no evil —> report no evil

Either way—overt lies or engineering incompetence—it’s disappointing!

skj · on Dec 8, 2021

Pretty low chance that the status page is automated, especially via health checks. I imagine it's a static asset updated by hand.

jandrese · on Dec 8, 2021

Or the service that updates the status page runs out of us-east-1.

erhk · on Dec 8, 2021

It has customer relationship implications. I guarantee you it is updated by a support agent.

zainhoda · on Dec 8, 2021

https://stop.lying.cloud

brentcetinich · on Dec 8, 2021

Don’t think there is an sla for the console , so you would not be claiming anything for the console at least

stevehawk · on Dec 7, 2021

I don't know if that should surprise us. AWS hosted their status page in S3 so it couldn't even reflect its own outage properly ~5 years ago. https://www.theregister.com/2017/03/01/aws_s3_outage/

tekromancr · on Dec 7, 2021

I just want to serve 5 terabytes of data

mrep · on Dec 7, 2021

Reference for those out of the loop: https://news.ycombinator.com/item?id=29082014

ithkuil · on Dec 7, 2021

One region? I forgot how to count that low

edoceo · on Dec 7, 2021

It's like three regions - when two of them explode.

Two is one & one is none.

ithkuil · on Dec 9, 2021

the obvious solution is to put all internet in one region so that when that one explodes nobody notices your little service

sangnoir · on Dec 8, 2021

> At a different (unnamed) FAANG

I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.

1. https://www.youtube.com/watch?v=3t6L-FlfeaI

jesboat · on Dec 9, 2021

It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)

The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)

The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.

And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.

alfiedotwtf · on Dec 7, 2021

Maybe has something to do with CloudFront mandating certs to be in us-east-1?

tekromancr · on Dec 7, 2021

YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.

jamesfinlayson · on Dec 7, 2021

Agreed - in my line of work regulators want everything in the country we operate from but of course CloudFront has to be different.

tekromancr · on Dec 8, 2021

Wouldn't using a global CDN for everything be off the table to begin with, in that case?

jamesfinlayson · on Dec 9, 2021

Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.

ehsankia · on Dec 7, 2021

Forget the number of regions. Monitoring for X shouldn't even be hosted on X at all...

mise_en_place · on Dec 7, 2021

Exactly. And I’m surprised AWS doesn’t have failover. That’s basic SOP for an SRE team.

hericium · on Dec 8, 2021

> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.

sheenobu · on Dec 7, 2021

I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.

jesboat · on Dec 9, 2021

I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.

I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)

After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.

sheenobu · on Dec 11, 2021

Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".

balls187 · on Dec 8, 2021

MAANG*

How long before Meta takes over for Facebook?

mastax · on Dec 8, 2021

Well, alphabet needs to take over for Google first.

boringg · on Dec 8, 2021

So its MAAAAN? Seems disappointing

rdiddly · on Dec 9, 2021

That's just, like, your opinion maaaan

majewsky · on Dec 9, 2021

I like MAGMA (Meta, Amazon, Google, Microsoft, Apple).

Especially when you are getting burned by an outage.

ents · on Dec 8, 2021

MANGA