PostgreSQL: What happened to Hot Standby?

gfodor · on March 8, 2009

As frustrating as it is, PostgreSQL users have always benefited from the lack of defects in the software due to hard choices made like this.

aditya · on March 8, 2009

Hmm. PgSQL is rock solid as an RDBMS, but the lack of dependable replication options is quite annoying.

I'm curious to know if anyone here has actually used {londiste, pgpool2, slony} to do master/slave replication and made it work well?

mdasen · on March 8, 2009

It is a great database, but replication is lacking. Slony-I is the best of the bunch currently, but it works by using triggers, requires you to set it up for every table and key, scales quadratically, and just generally has a lot of overhead (when compared to MySQL's replication). Mammoth Replicator is the most interesting project to me since it operates a lot more like MySQL's replication in that it works off log shipping. Currently they only have a beta for download off their site (https://projects.commandprompt.com/public/replicator), but I'm hopeful for it.

Replication is a really hard problem to solve given all the little inconsistencies that can creep in, but it's one of those areas that's holding PostgreSQL back most. Hot failover and scalable read-only replicants are a big selling point for MySQL. I really hope that PostgreSQL nails this area in 2010 because it's query planner is much better at handling complex queries (really, even something as simple as a sub-query).

But you're right, it it annoying that PostgreSQL doesn't have a nicer replication option.

EDIT: If performance isn't your biggest concern and you don't mind a highly tedious setup, Slony-I is reliable. However, it's not suitable for larger clusters of machines since communication costs grow quadratically.

neilc · on March 8, 2009

it's not suitable for larger clusters of machines since communication costs grow quadratically.

Quadratically? Why?

mdasen · on March 8, 2009

http://www.pgadmin.org/docs/1.4/slony/slonylistenercosts.htm...

olefoo · on March 8, 2009

I have set up and run both slony and mammoth-replicator in production. It is not a trivial task.

The big problem I've encountered is that the database to be replicated must have defined primary keys. Far too few developers understand what is meant by referential integrity.

Andys · on March 9, 2009

I've used pgpool2 in (lightly loaded) production.

It is quite amazing and works as advertised. True multi-master for updates, and load balancing for reads. I don't know why it doesn't receive more attention.

It is, however, very fiddly to set up and maintain, and you have to provide and roll your own "initial synchronization" which you can base on WAL archiving or pg_dump, or whatever.

It has also been a work in progress, with its design and features changing over time as it matures. This has made it a bit hard to wrap in easy to use scripts.

mdasen · on March 9, 2009

The issue is that pgpool is that it doesn't keep data consistent necessarily. For example, MySQL's default replication is statement based. That means that if you execute "UPDATE table SET var=57 WHERE id=10" on the master, it will run that same statement on the mirror. BUT, MySQL's replication is smart. If you say, "UPDATE table set updated_at=now()" on the master, it won't run that on the mirror. Rather, it will run "UPDATE table set updated_at=<the time that now evaluated to on the master>". pgpool also requires table locking when dealing with SERIALs and that is just unacceptable for performance.

It just has limitations that don't make it a good replication tool for production environments. pgpool is an awesome tool, but not for replication. Use it for load balancing and connection pooling which it is wonderful for.

Andys · on March 9, 2009

I believe pgpool also deals with timestamps properly.

Secondly, no one is forcing you to use SERIAL datatypes - the table locking is there for backwards compatibility. SERIAL is not the best performance choice for replicated servers, and there's nothing to stop you using another style of ID field.

patrickg-zill · on March 8, 2009

For all the noise about replication, the reality is that buying a server with a) lots of RAM b) dual power supplies, each going to a separate UPS or power feed c) dual or quad CPUs will get you 95% of the way there.

If you really care, get an external storage solution and buy two identical servers; then you can switch the storage over to the working machine if the first one fails.

So many people I know have sites that do not require fancy-pants replication - if they just shelled out the $20K for a big 128GB RAM system and learned how to do decent backups they would save on hosting charges and have a site that is just as reliable.

tobi · on March 8, 2009

Maybe for some use cases but most people on this site are concerned about scaling web sites where scaling up (which is what you suggest) isn't feasible because of the obvious cap to the approach and the fact that you can't even add your own hardware to cloud/virtual hosting which is quickly becoming the norm.

In general, every problem that can be solved in software should be solved in software.

Postgres can be the second coming of jesus but it's still utterly infeasible for high traffic web applications because of the replication issue. Even the current hacks that add replication will not work because they cannot deal with schema changes without downtime. At Shopify we add an average of 12 columns and 2 tables to the database every month and we had a grand total of 58 minutes of downtime in 2008.

moe · on March 8, 2009

but it's still utterly infeasible for high traffic web applications because of the replication issue

Sorry, but what a nonsense.

High traffic sites scale by caching, sharding and non-relational databases, in that order.

Replication can, at best, be an intermediate kludge for read-heavy websites that haven't learned about caching (wink) or insist on abusing their RDBMS as a fulltext search engine.

It was also commonly used as an incremental backup solution until filesystem snapshots became commonplace.

tobi · on March 9, 2009

Replication has nothing to do with scalability in most cases. Replication is how you reduce downtime!

It's the same reason why Raid5 is more then useless in a web server - if a disk dies it has to reboot and has to reconstruct data for hours. You cannot have downtime in a web application.

This is of course slightly domain specific. We host several thousand e-commerce stores that are the livelihood of our clients. If we are down then no one can earn money.

Shopify being down is a lot worse then twitter being down. In fact we may even see legal action if our downtime is too bad.

The reason why you need replication is because you need 3+ 100% accurate and up to the second copies of your database which can take over at a moments notice. One single server, no matter how beefy it is, can never accomplish this. Someone is going to trip over it's power cords eventually ( and if it has two then someone is going to trip over the power cord of the network switch it's connected to ).

wmf · on March 9, 2009

Replication is how you reduce downtime!

It's one possible way.

It's the same reason why Raid5 is more then useless in a web server - if a disk dies it has to reboot and has to reconstruct data for hours.

Clearly you're using the wrong RAID; the whole point is to keep going after a disk failure.

you need 3+ 100% accurate and up to the second copies of your database which can take over at a moments notice. One single server, no matter how beefy it is, can never accomplish this.

An alternative would be a single reliable copy of the database (using RAID, redundant controllers, and multipathing) attached to multiple servers.

I suppose database replication is mighty attractive if you assume that disks have to be trapped inside servers.

aditya · on March 8, 2009

I think having multiple read slaves (and even write masters) is a perfectly acceptable way of distributing load, and in turn, scaling.

To completely ignore that sounds foolish.

EDIT: xzilla is right, I wasn't really recommending blindly adding write masters since that wouldn't work, but just that read slaves are not a bad idea. :)

moe · on March 9, 2009

Well, I have never seen a read-slave setup in a webapp for the purpose of scalability that would've made a lot of sense.

It's a sometimes a stopgap measure when money is cheaper than time, but will come back to haunt you later.

When your reads are starving in a webapp then you should look into fragment caching and content-generation but not much further. If that doesn't solve your problem then you're just very likely doing something fundamentally wrong and better go ask someone smarter.

xzilla · on March 8, 2009

Anyone who thinks you can scale writes by adding more masters is plain ignorant on the subject. The only way to scale writes is with better hardware or federation of application/data. The first of those is easy, the latter is hard.

Andys · on March 9, 2009

When I was using pgpool2, it dealt with schema changes seamlessly - you connect to the proxy and issue your table updates, and it all just works.

aditya · on March 8, 2009

Tobi, that's good to know... I'm assuming you're running MySQL then?

tobi · on March 8, 2009

We actually launched Shopify beta on Postgres. One of the last things we setup before launch (literally 2 days before) was replication.

This was foolish by us but there was a lot on our plates back in the day. We spent the entire weekend setting up every different replication solution for postgres and in the end I just had to make a judgement call and went to mysql.

Luckily we used rails which is pretty agnostic and we have a huge unit test coverage so we could quickly get it ported. We converted the data and launched on MySQL and have been very happy with it ever since.

In fact this was a good decision for other reasons as well: the MySQL query cache saved us from having to seriously look at caching for a crucial 3-4 months which we could spend on more important things.

On a personal note, i like Postgres much better if it weren't for those issues.

xzilla · on March 8, 2009

Wow! For less than $5000 and a days work, you could have hired any number of postgres consultants to setup PITR warm standby for you (or Slony if you really thought you needed the read slaves). Instead, you migrated your entire architecture over to another database system. Doesn't sound like a good idea to me, but I have to give you points for chutzpah!

tobi · on March 9, 2009

First of all, at the time 5k was completely unrealistic. I've worked for nearly 20 months without salary on Shopify before launching it and all my last savings went to pay (very low) salaries to my friends and colleagues who agreed to work on the project because they were passionate about it.

But even now, Shopify being a multi million dollar company, I still don't think this would have been well invested money. We had Slony setup and PITR is of no use here either. The point is that you cannot upgrade the schema and data in a migration without downtime in a Postgres setup. Mysql simply replicates alter table statements to it's clients and everything stays in sync.

Besides, changing architecture took - as i said - a few hours which ends up even at crazy hourly rates to be no more than 1k so even by direct comparison we saved money.

I know that you mean well with your suggestion but it's the same thing i've been hearing from a lot of Postgres supporters, they always argue that we did something wrong.

xzilla · on March 10, 2009

Yikes... didn't mean to make you so defensive! Let me retort and clarify (hopefully with understanding that this is not meant as criticism) :-)

First, I know people doing rails+slony; I think the most common way is by piping the sql to a file and then to slonik, but there are other methods too. Granted, it's not awesome. For a new rails/postgres shop, I'd look at pgpool (provided you really needed replication, which is dubious for a lot of people)

Second, this wasn't really about postgres/mysql. If you had told me you went from MySQL->Oracle 2 days before launch, I'd have concerns. Heck, even going from Postgres 8.2 -> 8.3, I'd want to make sure you had good test plans (which it sounds like you did).

As a general rule, swapping out your database infrastructure is not something I recommend people take lightly. Yes, rails shops have already decided that the application code is more important than the data, so it's a lot easier to do, but given subtle differences in SQL implementations, people still get bitten by it.

Please note, I never said you did anything wrong. You did something risky. Most people don't succeed with risky (which is why it doesn't sound like a good idea to me), but since you did I have to give you credit (again) for making a ballsy call and pulling it off. But I think even you realize that you probably would do things different if you had to do it over again.

russss · on March 8, 2009

You've got the wrong end of the stick here a bit. Postgres has had a "warm standby" system since version 8.1 which you can use to keep a second server in recovery mode waiting to go live. So as far as HA goes, that's pretty good - we use it at Last.fm and it works well for us.

What's lacking is a decent way of setting up read-only slaves like you can with MySQL. That was what was promised with 8.4 and that's what's annoying.

At Last.fm we have a large, frequently-changing DB schema which is impractical to replicate in its entirety using the existing Postgres replication mechanisms (Slony, Londiste, etc).

The annoying thing is that it appears they backed these changes out of 8.4 because of issues with the synchronous part of the replication patch (which we don't care about).

Oh well.

xzilla · on March 8, 2009

Just to clarify, hot standby wasn't backed out, because the changes were never "in". There was a working patch set against the 8.4 tree (which you could still get today) for hot standby, and the testing/review of that code ran into enough corner cases that the reviewers thought it was best to push it. While many of the same people are involved in the synchronous replication piece, the issues in that code didn't really play a part (it had pretty much been decided at least month ago that synchronous replication wouldn't make it)

aditya · on March 8, 2009

So, what do you guys do then?

I imagine Last.fm is not running on one gigantic postgres db, are you using db partitioning or something similar?

russss · on March 9, 2009

Most of our hardcore data is in non-relational sources (including Hadoop).

Our global DB is Postgres - that holds everything which we haven't had a reason to move off it - it's a pair of Sun boxes with 128GB of RAM using Postgres warm standby (in this case the hardware is a lot cheaper than trying to shard it).

And we also have a ton of other MySQL and Postgres databases running other services.

RobGR · on March 8, 2009

I think if you compare the cost of getting a dual power supply server, to the cost of getting two servers, you will conclude that buying two servers gives you more reliability per dollar.

I have come to the same conclusion even with regards to RAID. After using linux software RAID and occansionally the expensive 3Ware cards for years, I have decided that the cost and complecation of the RAID system inself also needs to be accounted for. Sometimes there is downtime directly attributable to the RAID scheme.

What I would like to be able to do, is set up two linux servers, such that any change to the file system on one would be writen to the other. This can be done with inotify and rsync, but I suspect that it would be hard to configure rsync to both have a low enough load on a heavily used database machine and also "keep up" enough that the slave system has a consistent snapshot.

It would be nice if there were some block-level way of doing this, so that I could have the slave system just be running the OS without the DB server started, and on the master every change that happened to the database data area would be quickly synced on a block level with the slave. This would remove the database software from the replicaton issue, replication would be handled on a filesystem or directory level, and the database would just have to be configured to sync to disk within a reasonable amount of time.

wmf · on March 8, 2009

I think if you compare the cost of getting a dual power supply server, to the cost of getting two servers, you will conclude that buying two servers gives you more reliability per dollar.

Really you should compare the TCO, including the sysadmin time to configure replication vs. the time to configure failover.

What I would like to be able to do, is set up two linux servers, such that any change to the file system on one would be writen to the other. This can be done with inotify and rsync, but I suspect that it would be hard to configure rsync to both have a low enough load on a heavily used database machine and also "keep up" enough that the slave system has a consistent snapshot.

This is called a dual-ported array plus GFS2 plus Heartbeat.

RobGR · on March 9, 2009

I have just spent some time reading up on GFS2. It sounds interesting. But it also seems aimed at "SAN" style storage, which is expensive. The acronyms that gaius posted above were also interesting to google, but I don't want to buy storage from a place like EMC or NetApp, I have some drives from Fry's that should be more than sufficient.

Here is what I want to do:

I will take two consumer grade computers less than a couple years old, and install an extra gigabit ethernet card in each one. I will connect the extra crads with a cross over cable. On the "master" machine, I will run Postgres or MySQL, with the data directory on a separate partition if the replication scheme is at the file system level. On the "slave" machine, I willnot have the database server started up. Slow option with a lot of overhead: when the data dir changes, inotify kicks off an rsync that copies the datadir from the master to the slave. Better option: inotify kicks of something that doesn't have to scan whole data file on each side to find the changed data and move it.

On a failure of the master, either I would manually move a cable to make the system start using the slave, or IPs could be shuffled remotely, or one of the failover systems could do it automatically. For my particular application, it would be fine if I had to ssh in and change a DNS entry to make our web servers start using the slave database, and because I would not want the database server on the slave to start up and start writing to a disk that might also be receiving syncs from the master, I would prefer it be manual for now.

Aside from an inotify / rsync solution that seems limited and kludgy, does anyone have any tips on how to go about this ?

gaius · on March 8, 2009

You would do this with VVR, SRDF or SnapMirror, depending on who you bought your storage off. Physical standbys at the database level (a la Dataguard) are much nicer from a management perspective, because at any time you can choose to take a consistent copy by simply stopping recovery at the end of the next archivelog.

easp · on May 21, 2009

Belatedly, you probably want to check out DRBD. It can do sync and async mirroring of Linux block devices over a network. http://www.drbd.org/

With a cluster-aware filesystem like GFS, both copies can be r/w (though that might be a little touchy), otherwise it can be used for warm standby, or, in conjunction with LVM and an appropriate filesystem, you could create snapshots on the primary, and then mount them as read-only on the secondary (but this may not be very appropriate for any DB scenarios).

chiffonade · on March 8, 2009

> As expected, the NTT Open Source team working on Synch Standby ran into some unexpected issues

Is that sort of like a known unknown? Or an unknown known?

patio11 · on March 9, 2009

I think people who mock those phrases mercilessly probably never heard them used in context. They're government contractor-speak that is probably popular around the DoD, but they're not meaningless. Here, let me try.

Known known: There is a bug open in the tracker which reports that there is a race condition under certain circumstances. You have someone assigned to look into it.

Unknown known: One of the programmers noticed a race condition under certain circumstances, but he did not put it in the tracker, and as a result the information is compartmentalized and the issue is not being dealt with.

Known unknown: In preparation for your release you've got a load balancer, four application servers, and memcached set up, but you're not sure whether it will be able to handle a Slashdot on release day. You resolve to test this, when the schedule permits.

Unknown unknown: Your password recovery form permits a SQL injection attack and nobody in your organization has a clue.

The manager take away point: it is far better to have known unknowns than unknown unknowns because you can take steps to mitigate the risk, and unknown knowns are almost useless to you as a manager because they prevent you from mitigating the risk even if you're aware of a solution at some level of the organization.

chiffonade · on March 10, 2009

I think anyone who can figure out what an adjective put in front of a noun means understands what these phrases mean, the point is that they're just so fucking ridiculous as to be laughable.

discojesus · on March 8, 2009

"Hot Standby"?

Sounds like a porno movie set in an airport terminal.