Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This issue is affecting the global console landing page, which is also hosted in US-EAST-1

Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.



Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.


> Facepalm in the form of $100Ms in service credits.

It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.


Oh no... I think you may be in for a rough time, because I purchased something this morning and it only popped up in my orders list a few minutes ago.


They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".


>Facepalm in the form of $100Ms in service credits.

Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.


That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.

Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.


Utter lies on that page. Multiple services listed as green aren't working for me or my team.


This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.

Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.


Suggesting that when the status page sends a status request and hears no response—it defaults to green—hear no evil and see no evil —> report no evil

Either way—overt lies or engineering incompetence—it’s disappointing!


Pretty low chance that the status page is automated, especially via health checks. I imagine it's a static asset updated by hand.


Or the service that updates the status page runs out of us-east-1.


It has customer relationship implications. I guarantee you it is updated by a support agent.



Don’t think there is an sla for the console , so you would not be claiming anything for the console at least


I don't know if that should surprise us. AWS hosted their status page in S3 so it couldn't even reflect its own outage properly ~5 years ago. https://www.theregister.com/2017/03/01/aws_s3_outage/


I just want to serve 5 terabytes of data


Reference for those out of the loop: https://news.ycombinator.com/item?id=29082014


One region? I forgot how to count that low


It's like three regions - when two of them explode.

Two is one & one is none.


the obvious solution is to put all internet in one region so that when that one explodes nobody notices your little service


> At a different (unnamed) FAANG

I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.

1. https://www.youtube.com/watch?v=3t6L-FlfeaI


It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)

The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)

The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.

And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.


Maybe has something to do with CloudFront mandating certs to be in us-east-1?


YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.


Agreed - in my line of work regulators want everything in the country we operate from but of course CloudFront has to be different.


Wouldn't using a global CDN for everything be off the table to begin with, in that case?


Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.


Forget the number of regions. Monitoring for X shouldn't even be hosted on X at all...


Exactly. And I’m surprised AWS doesn’t have failover. That’s basic SOP for an SRE team.


> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.


I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.


I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.

I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)

After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.


Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".


MAANG*

How long before Meta takes over for Facebook?


Well, alphabet needs to take over for Google first.


So its MAAAAN? Seems disappointing


That's just, like, your opinion maaaan


I like MAGMA (Meta, Amazon, Google, Microsoft, Apple).

Especially when you are getting burned by an outage.


MANGA




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: