Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the kind words. I'm sorry for letting our users down. We'll ask the the 5 why's https://en.wikipedia.org/wiki/5_Whys We need to go from the initial mistake (wrong machine, solve by better hostname display and colors), to the second (not having a recent backup), to the third (not testing backups), to the fourth (not having a script for backup restores), to the fifth (nobody in charge of data durability and no written plan). The solutions above are just guesses at this point, we'll dive into this in the coming days and will communicate what we will do in a blog post.


Good morning (posted from throwaway for reasons Ill describe).

I feel for you greatly here, and I commend your openness about how data restoration caused 6 hours of data loss. I too work in a critical area where even minutes DB lost is bad.

We just had our own test event recently. We make sure that we can fail everything over, and run on all secondaries. I found out how that worked; we failed. The problem with this, is I found out after the fact. Due to the secrecy, not even the teams knew why things failed the way they did. I had to piece it together from disjoint hearsay, and now I believe I have a competent picture.

So yes, when I read your post mortem and RCA, it reminded me greatly of what happened here as well. But we all can learn from your example. As for me, I'm posting it as a throwaway due to likely threats on my job.


I agree that 6 hours is way too much.


Just know that your transparency here in this situation has put you leagues and bounds ahead of other vendors in my mind.

Thank you for sharing.


I admire the response and quick action, but after reading this, I get the strong feeling Gitlab et al didn't know they were running a real business with real projects and real people who trusted you to be a good custodian of their data.


I also would ask Why are we running our own Postgre setup and not using RDS? And then Why do we not have production Postgre DBAs on staff that would do this rather than an engineer?


They want to have more control over the database and moving providers etc., which is why they're not using RDS, as was explained on the live stream.


make sure everyone is doing ok, don't let them beat themselves over it

all the love and support in the world towards the team


Thanks mozair. Everyone is OK but sad. I sent this tweet earlier https://twitter.com/sytses/status/826598260831842308

Thanks for the support, we've received a lot of kind reactions and are very grateful for them.


If you do the 5 why's and are honest you'll discover the real root cause is not having a backup and restore procedure that is programmatic.

All outages are blameless. It's always a process failure or lack of a proper system or tool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: