Dribbble offline due to Postgres problem

rosser · on May 15, 2013

This looks like disk page corruption.

In such a case, the very first thing you do — aside from verifying that your backups, replication, and WAL archiving are all working — before you even try diagnosing your problem any further than "something weird is happening" is make a filesystem-level copy of your PostgreSQL cluster. (If you're running LVM, ZFS, a SAN, or any other thing that lets you take an instantaneous filesystem snapshot, so much the better. Do that, and then copy it.) Then, and only then, should you even contemplate trying to un-fuck your database.

I cannot stress this point enough. That means make a tarball, or cp the directory, or rsync it, or whatever disk file level tool you prefer, and specifically and emphatically not pg_dump. (It's probably not going to make things any worse, but if you do have corrupted disk pages, pg_dump isn't likely work, anyway.)

Flailing around trying to fix things can sometimes make them worse. If you're working on your already broken data, and break it further without the safety net of a fs-level backup, you're ... well, you're worse off than you were five minutes prior, aren't you?

astine · on May 15, 2013

Better, if you do have backups, just drop your database, reconstitute a new one (using your schema definition file which you should have saved,) and load your most recent backup. I've dealt with disk file level corruption and the best way to deal with it IMHO is not to deal with it at all.

Zenst · on May 15, 2013

^^This

Still true today and as I say backup don't fuckup and plan for the worst and let Murphies law work in your favour.

PhantomGremlin · on May 15, 2013

Since "dribbble.com" appears to be "taking a timeout for maintenance" I can't at a glance determine what, exactly, they do. So maybe they're a bunch of hobbyists. But if not ...

I don't understand how a "real" company, even in todays overheated environment of soon-to-fail hipster startups, doesn't do (at least) one of these two things: a) have competent employees on staff that are intimately familiar with their critical infrastructure and how to support it or b) pay, yes, gasp, pay some other company for professional support services. A glance at the Wikipedia entry for Postgres shows some possible companies that do that. Or, as already mentioned, there are mailing lists.

But twitter? Really? That's support? For someone other than a hobbyist running a website in their spare time?

Go ahead, flame me. But my first impression is "amateur hour". My apologies if dribbble.com really is an amateur effort.

Xylakant · on May 15, 2013

Even people intimately familiar with a pieces of infrastructure - hardware or software - can't possibly know all failure modes. Asking on twitter about an obscure error that you've never seen doesn't cost anything and may be very well just one way of finding a fix. It's akin to asking on IRC or writing a mail to a mailing list - something which you seem to condone. I fail to see how it's any less professional.

papsosouid · on May 15, 2013

>It's akin to asking on IRC or writing a mail to a mailing list

No it isn't. It is akin to shouting at random people on the street "help I don't know what to do!". Asking on IRC on a mailing list is going to a group of people who are there specifically to help people with problems on that subject.

Xylakant · on May 15, 2013

If I had the reasonable expectation to find somebody qualified willing to help on the street with practically no effort I'd go that route as well. It's just very time consuming to do so, so that's why I usually don't.

papsosouid · on May 15, 2013

That's precisely the point though, it isn't a reasonable expectation. The odds of a postgresql expert just happening to follow them on twitter are poor. The odds of multiple postgresql experts being on the postgresql mailing list is quite high.

Xylakant · on May 16, 2013

Sure, the chances of finding an expert are low, but you'd be surprised how well it works. See, the tweet was posted to HN by someone and the chances of finding a Postgres expert here are significantly higher. More important is that the costs are low, even lower than writing to a ML. So why not try?

alexcroox · on May 15, 2013

Plenty of big sites use Twitter to inform users of down time, I prefer that, do you expect me to go hunting down their status page or just have it come up conveniently in my newsfeed...

Just google "reddit status" and check the first result.

nezza-_- · on May 15, 2013

It's the asking for help on Twitter he's criticizing, not the fact that they notify about the down time on Twitter.

BoyWizard · on May 15, 2013

Who's to say they aren't? Spending 10 seconds asking on Twitter, and getting an 'oh yeah, I had that and fixed it with x, y and z' in response sure beats staying up until 2am fixing it.

kermatt · on May 15, 2013

Also, posts error without Postgres version, operating system, or any other relevant details which could help the community diagnose the issue quicker.

https://twitter.com/samuelfine/status/334499315949133825

joevandyk · on May 15, 2013

Everyone should be running pg_basebackup + pg_receivexlog on a separate machine, preferably at a different data center.

These let you go back to any point in time. If you ran 'delete from orders where id=id', you can restore to the transaction before you ran that command.

http://www.packtpub.com/how-to-postgresql-backup-and-restore... contains more information.

Also, postgresql 9.3 (out in a few months) supports disk page checksums which can detect filesystem failures immediately.

jordanthoms · on May 15, 2013

I'm using https://github.com/wal-e/wal-e which ships the logs to S3 with good results

joevandyk · on May 15, 2013

That's good as well!

If you are hosting your database on ec2, using only wal-e means that all of your data is hosted on amazon. If they were to cancel your service, you'd lose your data.

Running pg_basebackup+pg_receivexlog on a different provider is cheap insurance against that.

Be sure to test how fast wal-e can restore your data, btw. Restoring from s3 was significantly slower than restoring from a local disk (in my testing a while ago).

wiredfool · on May 15, 2013

I ship wal logs to s3 for backup, and I do test restores on EC2.

What I've found is that most of the files can be grabbed really quickly, in the 3mb/sec range, but there's always a handful that run at 300k/sec. Running the downloads 8 or more at a time tends to help with that so that we're mostly maxing out the local network.

scraplab · on May 15, 2013

We use Barman, which is similar, using rsync to ship WAL files to a different server.

http://www.pgbarman.org/

parfe · on May 15, 2013

The error in question is raised at http://doxygen.postgresql.org/sequence_8c_source.html#l01083

My guess is hard drive failure. Hope they have good backups.

mattryanharris · on May 15, 2013

And if they don't?

hijinks · on May 15, 2013

then they have learned an important messages.. raid or replication isn't a backup

rbritton · on May 15, 2013

It's not a backup because it will happily propagate data corruption through making all drives worthless. The only backup-like protection is provides is for sudden drive failure.

hijinks · on May 15, 2013

Replication and Raid isn't a backup because what if i do this in mysql

mysql> drop database production;

No way raid/replication will save you then. Same thing for some rogue delete statement gets in the code that deletes too much data.

Raid isn't a good backup because groups of drives are known to die at once or around the same time.

wiredfool · on May 15, 2013

You can't drop a database on postgres if anyone is connected to it, which is likely the case on production.

Xylakant · on May 15, 2013

TRUNCATE TABLE users; should work and will have similar disastrous effects.

wiredfool · on May 15, 2013

You can revoke truncate from public.

Xylakant · on May 15, 2013

Oh, sure. I'll just do "DELETE FROM users" and forget about the WHERE clause. Same effect.

wiredfool · on May 15, 2013

Replication on Postgres can be the basis of a backup, IFF you store base base backups and WAL logs.

In normal life, your secondaries just chug along, consuming WAL logs.

Should you do something like drop/truncate a table, you can start with your base backup and replay the logs till just before your fat finger.

k_bx · on May 15, 2013

If it's disk failure -- then replication is best backup ever.

mattspitz · on May 15, 2013

Can you further explain why raid/replication isn't a backup? Sure, you can't backup beyond what you've replicated (locally or across a network), but for most folks, that should be fine.

redler · on May 15, 2013

Trolling, I assume? Just in case not: your own oh-shit moments (forgot that WHERE clause?), coding errors and oversights, bad luck, adversary SQL injection because the contractor who wrote some obscure admin page two years ago didn't know what he was doing, customer screw-ups (your web app customer fat-fingers their own critical data out of existence and begs you to roll them back to yesterday), etc.

If you have a real business you need a multi-generational backup scheme of some kind.

joevandyk · on May 15, 2013

What happens if you accidentally run "delete from orders"? Or drop the database? Or need to access data that was updated in place last week?

sreitshamer · on May 15, 2013

"Backup" means, more specifically, multiple backups at various points in time.

bradleyland · on May 15, 2013

RAID and replication are good for protecting against hardware failure, but hardware failure actually accounts for only a handful of the reasons you want a backup. RAID, replication, and backups are all solutions to problems. Once you look at the problems instead of the solutions, your answer becomes clear.

Often, the need for a backup arises from the need to go "back in time". RAID and replication offer no solution for this. Ok, so depending upon your replication setup, it may offer some help, but the steps required to go back in time usually involve starting from a recent backup, and using files involved in the replication to "replay" recent changes to get you back to the moment in the past you wish to return to.

bifrost · on May 15, 2013

You're kidding right? RAID is only good for dealing with HW failures, there are many more kinds of SW failure that can screw you and leave you with nothing, while having a perfectly good raid. Replication isn't a backup because if you screw up your dataset, well that screwup just got replicated.

Zenst · on May 15, 2013

indeed, may don't realise that, and a mistake/error replicated is still a mistake/error.

Still least I hope they didn't have transaction logging onto the same discs, seen that in horror because somebody had large raid array and did not think they needed the expense of another.

samuelfine · on May 15, 2013

We do. :)

sickpig · on May 15, 2013

Did they try to post about their problem on postgresql mailing list (both pgsql bugs and general apply here)?

Postgres community is quite responsive and usually give very effective advice, as long as you use the proper channel to communicate.

pilif · on May 15, 2013

Also, #postgres on Freenode. No matter when or what you ask, one of the regulars will usually respond instantly with very high-quality advice. No single product, no matter whether it was free software or not has ever reached this kind of support quality for me.

bifrost · on May 15, 2013

I feel for you guys, but it sounds like a disk error. If you're not experts, you need to do one of the following A) rollback to a snapshot B) restore from backup. If you have neither of those, take a sql backup, and pray you can recreate the broken data.

taitems · on May 15, 2013

Might as well comment here as tweets to the account seem to do nothing. Every 25th click or so has always resulted in a Chrome "no content" error. It's a frequent problem, and it's been happening for years.

samuelfine · on May 15, 2013

You should probably email support@dribbble.com. Twitter isn't an "official" support channel, so your tweets are far less likely to get noticed.

In the interest of saving time: browser / version / OS? Any browser extensions installed?

ryantownsend · on May 15, 2013

I also have this problem intermittently - using Chrome v26, Mountain Lion.

taitems · on May 15, 2013

Agreed, but as a follower I see how active they are and how often they respond to users.

For clarity: Mac and Windows Chrome sans extensions, but these days I'm running a few like Adblock and Chime.

on May 15, 2013

[deleted]

samuelfine · on May 15, 2013

When a problem first rears it's head, we usually don't know what the actual cause is. Some problems take 5 minutes to resolve, and others keep me up until 1 in the morning. So, making any kind of specific statement in the middle of a crisis is often impossible, and always imprudent.

19_ploT · on May 15, 2013

It's back but I think one day of data is lost!