Thanks for the kind words. I'm sorry for letting our users down. We'll ask the the 5 why's https://en.wikipedia.org/wiki/5_Whys We need to go from the initial mistake (wrong machine, solve by better hostname display and colors), to the second (not having a recent backup), to the third (not testing backups), to the fourth (not having a script for backup restores), to the fifth (nobody in charge of data durability and no written plan). The solutions above are just guesses at this point, we'll dive into this in the coming days and will communicate what we will do in a blog post.
Good morning (posted from throwaway for reasons Ill describe).
I feel for you greatly here, and I commend your openness about how data restoration caused 6 hours of data loss. I too work in a critical area where even minutes DB lost is bad.
We just had our own test event recently. We make sure that we can fail everything over, and run on all secondaries. I found out how that worked; we failed. The problem with this, is I found out after the fact. Due to the secrecy, not even the teams knew why things failed the way they did. I had to piece it together from disjoint hearsay, and now I believe I have a competent picture.
So yes, when I read your post mortem and RCA, it reminded me greatly of what happened here as well. But we all can learn from your example. As for me, I'm posting it as a throwaway due to likely threats on my job.
I admire the response and quick action, but after reading this, I get the strong feeling Gitlab et al didn't know they were running a real business with real projects and real people who trusted you to be a good custodian of their data.
I also would ask Why are we running our own Postgre setup and not using RDS? And then Why do we not have production Postgre DBAs on staff that would do this rather than an engineer?