Thanks for the kind words. I'm sorry for letting our users down. We'll ask the t...

throwaway8124 · on Feb 1, 2017

Good morning (posted from throwaway for reasons Ill describe).

I feel for you greatly here, and I commend your openness about how data restoration caused 6 hours of data loss. I too work in a critical area where even minutes DB lost is bad.

We just had our own test event recently. We make sure that we can fail everything over, and run on all secondaries. I found out how that worked; we failed. The problem with this, is I found out after the fact. Due to the secrecy, not even the teams knew why things failed the way they did. I had to piece it together from disjoint hearsay, and now I believe I have a competent picture.

So yes, when I read your post mortem and RCA, it reminded me greatly of what happened here as well. But we all can learn from your example. As for me, I'm posting it as a throwaway due to likely threats on my job.

sytse · on Feb 1, 2017

I agree that 6 hours is way too much.

djm_ · on Feb 1, 2017

Just know that your transparency here in this situation has put you leagues and bounds ahead of other vendors in my mind.

Thank you for sharing.

jameskegel · on Feb 1, 2017

I admire the response and quick action, but after reading this, I get the strong feeling Gitlab et al didn't know they were running a real business with real projects and real people who trusted you to be a good custodian of their data.

throwaway1236t4 · on Feb 1, 2017

I also would ask Why are we running our own Postgre setup and not using RDS? And then Why do we not have production Postgre DBAs on staff that would do this rather than an engineer?

AsyncAwait · on Feb 1, 2017

They want to have more control over the database and moving providers etc., which is why they're not using RDS, as was explained on the live stream.

mozair · on Feb 1, 2017

make sure everyone is doing ok, don't let them beat themselves over it

all the love and support in the world towards the team

sytse · on Feb 1, 2017

Thanks mozair. Everyone is OK but sad. I sent this tweet earlier https://twitter.com/sytses/status/826598260831842308

Thanks for the support, we've received a lot of kind reactions and are very grateful for them.

sumbry · on Feb 2, 2017

If you do the 5 why's and are honest you'll discover the real root cause is not having a backup and restore procedure that is programmatic.

All outages are blameless. It's always a process failure or lack of a proper system or tool.