Our PostgreSQL failover plan is very pedestrian, but it works well (even across ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		scurvy on Dec 7, 2017 \| parent \| context \| favorite \| on: PostgreSQL HA cluster failure: a post-mortem Our PostgreSQL failover plan is very pedestrian, but it works well (even across datacenters). We run streaming replication off of the primary to a pair of replicas, with one replica in another datacenter. The primary write DB advertises a loopback IP out OSPF into a top-of-rack switch, where it's aggregated by BGP and distributed throughout our network. There's a health check script [0] running every 3 seconds that makes sure PG is happy and that it is still writeable. If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers. While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want). I will definitely agree with you that manual operator intervention is better than automated failover. [0] https://github.com/unixsurfer/anycast_healthchecker

bringtheaction on Dec 7, 2017 [–]

I'd love to read more about this.

If you have the time and motivation to write a detailed post that would be much welcome by me and many others as well I am sure.

scurvy on Dec 7, 2017 | [–]

Sure! I can probably write something up over the holidays.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact