Twitter, PayPal reveal database performance

dkhenry · on Oct 5, 2012

I always knew the company I work for does cool things with MySQL, but I never knew how far we really have pushed the envelope until I see how much more we push MySQL then the twitters and facebooks of the world. Doing some napkin math I would say our largest cluster of MySQL does about 320 Billion inserts a Day across 80 nodes and some of our nodes can peak at about 200K queries per second ( that would be 16 Million QPS if all the nodes were fully loaded ). Meaning at one of our installs we do the same volume as two facebooks !

ddorian43 · on Oct 5, 2012

Curious what type of servers, what type of writes, do you use compression, how you shard, what type of indexes? Write an article for highscalability and get exposure?

ericcholis · on Oct 5, 2012

Out of curiosity, what industry do you work in?

dkhenry · on Oct 5, 2012

Performance Monitoring.

aristus · on Oct 5, 2012

Coincidently I work on performance monitoring at Facebook. We gave up logging perf data to MySQL because it was a management headache at scale. So instead my colleagues built Scuba, which is essentially a search engine they taught to perform aggregate functions.

http://gigaom.com/cloud/exclusive-facebooks-scuba-project-di...

dkhenry · on Oct 5, 2012

Lots of people gave up trying to do their own performance logging, it turns out its a non trivial problem when you get to large scale systems. Which is why the company I work for exists.

aristus · on Oct 5, 2012

Does the name start with New and end with Relic, by chance? :) I'd love to compare notes sometime. My email's in my profile.

dkhenry · on Oct 5, 2012

No, but I understand why you might think that. I sent you an E-mail.

nc17 · on Oct 5, 2012

Your company is easy to find on Google. It's not stealth, and it's venture-funded. Curious why you can't or won't mention the name here.

dkhenry · on Oct 5, 2012

I am in no way an official spokesperson for them, just a developer, I don't know how much publicity I am allowed to draw to our back end architecture without getting into trouble. It is after all the thing that separates us from everyone else out there. (BTW I have posted the Who's Hiring post a couple times so my Username is pretty easy to track to whom I work for)

ericcholis · on Oct 5, 2012

Ah, that'll do it.

ericcholis · on Oct 5, 2012

Not surprising, especially if you've ever used their interface or API. I'm reasonably certain that they don't have any indexes.

Slowest API ever....

ch0wn · on Oct 5, 2012

The sandbox they provide is even worse in terms of response times.

nopassrecover · on Oct 5, 2012

And lacks documentation. Or the documentation that is there is wrong. Or it's there but missing fundamental details. Or it's right but you question how it could possibly be a good design decision.

ch0wn · on Oct 5, 2012

Or the documentation that describes something like it works in the sandbox, but not in the production system or vice versa. Developing for Paypal is a lot of fun.

eli · on Oct 5, 2012

Oh, please. I think a lot of their core backend is still in C++ so maybe we should blame C++ for the slow API?

TazeTSchnitzel · on Oct 5, 2012

Heck, Bitcoin, a decentralised peer-to-peer "currency", feels faster. Even if it takes ten minutes to confirm a transaction, and longer to be certain.

taligent · on Oct 5, 2012

Reasonably certain based on what evidence ? Because its API is slow ?

robryan · on Oct 5, 2012

The speed of their transaction search actively discourages me using it. It is extremely slow even when searching by the transaction Id which you would assume at least that would have a decent index on it.

taligent · on Oct 5, 2012

Have you EVER actually worked on a reasonably sized system ?

Because there are a lot more contributors to request/response performance than just database indexes. Especially when you are at the scale of Paypal.

drm237 · on Oct 5, 2012

I have, and I can tell you that the speed at which they are able to search for transactions is disturbing. Regardless of if it's related to indexes or another cause, it's still a huge issue for people who have to use PayPal. That appears to be the point the parent comment was trying to make.

Your comment had the potential to be constructive and educational, but instead you took a condescending approach. That's unfortunate.

robryan · on Oct 5, 2012

The point I was trying to make was that of all possible transaction searches you can make, one made on the transaction id should be the easiest one to make fast.

aidos · on Oct 5, 2012

"get a timestamp error if one node attempts to read data that was written by another node in the future (due to data moving across datelines)"

Honest question: Which part of a replication/db system would be using time stamps that weren't universal?

blrgeek · on Oct 5, 2012

And that's the part that Google Spanner gets around with Atomic clocks and GPS clocks.

donavanm · on Oct 5, 2012

Just the concept of using time for the logical clock is sketchy. It's not monotonically increasing, it drifts, it can go decrease, etc... Curious what their cross host/region ordering and consistency looks like.

sanxiyn · on Oct 5, 2012

"An atomic clock is not that expensive." -- http://research.google.com/archive/spanner.html

jrockway · on Oct 6, 2012

Indeed. A GPS clock accurate to 50ns is around $25 on eBay.

mikecb · on Oct 5, 2012

Especially since there are 24 circling the earth that send out time continuously.

andrewcooke · on Oct 5, 2012

curiously, 4629 tweets a second is 4 million a day[1], which is more than the number of rows per day (3 million).

this must be someone not thinking, right? i don't see how they can have less rows than tweets. even with de-duplication of repeated text they still need a key (and timestamp) per tweet.

oh! and it gets worse. the next line says "400 million new tweets a day". that one must be plain wrong (it's a rate 100x higher than the number per second), given that the other two are (comparatively) consistent (it would also mean an average of 3 tweets per day per active user (140 million), and i suspect they define active user to be anyone over 1 tweet per month...).

so the section "by the numbers" contains four items, but only two independent values. and appears to be inconsistent twice.

[1] almost exactly - presumably the per second value is derived from a rough figure for the daily value.

jeremycole · on Oct 5, 2012

Howdy. I was the presenter from Twitter.

Aside from your bad math above, the reporter heard million when I said billion. More than 3 billion rows per day. That math is easy enough to come by if you do any math on the rest of the numbers: 400M (Tweets per day) * 4 (replication) = 1.6B rows per day to store the Tweets, plus the same amount for an entry in a timeline. So that's 3.2B right there. And there are a lot of other types of indexes.

the_cat_kittles · on Oct 5, 2012

Thanks for clearning that up, I was very curious how you could make so few rows with so many tweets. Alas, no magic, just bad reporting.

shawabawa3 · on Oct 5, 2012

4629 (tweets per second) * 86400 (seconds per day) = 399,945,600 (400 million tweets per day)

So that checks out... I'm guessing the number of new rows per day is excluding tweets.

andrewcooke · on Oct 5, 2012

argh. you are right - i cannot do simple arithmetic. sorry. [and now it is too late to alter or delete the comment]

taligent · on Oct 5, 2012

Yes because finding ways to not insert a new row for each tweet is a classic sign of someone not thinking.

From the UI Twitter only shows tweets to the minute so you could imagine that there is an optimisation there for merging retweets with the same text within a 60 second period.

pyre · on Oct 5, 2012

Assume they normalized the timestamp, and only stored something on the level of "these people tweeted/retweeted this message sometime during this 60 second interval."

The only way to avoid adding a new row per account that tweeted/retweeted would be to store the list of account in a single row, and keep updating it on every re-tweet. This seems like it would be a less than optimal solution.

andrewcooke · on Oct 5, 2012

it certainly seems like you would need to aggregate keys somewhere (if it's not the typo i assumed). that might just be possible in postgres these days (update: a quick google suggests it's in development for 9.3), but i am pretty sure not in mysql. on the other hand, for all i know, at this scale, they may be treating mysql as some kind of distributed hashtable and doing a lot more work in higher layers...

[update: I completely screwed up the maths; looks like this is not rows per tweet, but support data.]

pyre · on Oct 5, 2012

  > for all i know, at this scale, they may be treating
  > mysql as some kind of distributed hashtable and doing
  > a lot more work in higher layers...

If MySQL is just used as hash table, and all of the interesting stuff happens at higher layers, then talking about the number of rows isn't that meaningful.

gogetter · on Oct 5, 2012

Twitter, Paypal don't really need to be that fast. Nor does Facebook.

But do you think major banks on Wall Street tracking billions of trades a day in near real-time would be satifisfied with these numbers?

It all depends on what your users demand and what they will put up with.

spudlyo · on Oct 5, 2012

If anyone is interested in watching Jeremy's presentation online it can be found at:

http://medianetwork.oracle.com/video/player/1873920417001

keyz · on Oct 5, 2012

What I would really like to see is a guesstimate of the volume of user behavior/analytics data that FB/Twitter generate. It would be interesting to see the average amount of data per user-interaction.

VMG · on Oct 5, 2012

That's surprising to me. I didn't expect financial transactions to create that much data. Also wouldn't most of the data be archived and only rarely be accessed?

Firehed · on Oct 5, 2012

Financial transactions tend to create a lot of data for the logging and recoverability they require, and the logging isn't terribly interesting.

The trick is realizing that things can and do fail in any way for any reason, and being able to automatically recover from any point in that process. Our (WePay's) system currently requires, at an absolute minimum, seven inserts across six tables, though more commonly 9-10 rows across seven, plus updates on several others, and significantly more if we can't authorize the payment on the first attempt (e.g., mistyped ZIP code on your billing address). Only one of those tables is for logging (account_history, which is functionally identical to a check ledger).

I've completely ignored both the logging of state changes (largely redundant at this point, it was more for early debugging a couple years ago, but still sometimes useful for figuring out why something stalled) and recording of fraud-detection data which can easily be a hundred rows, albeit very small ones (80 bytes or so). It's also interesting to look at what updates need to be performed in a transaction and which ones don't, although that's of course irrelevant to the actual amount of data produced.

That doesn't account for the product side of payments at all - recording contents of shopping carts, donations, ticket purchases, etc. That's at best one stateless row, but use your imagination about the data layout for various money-collecting tools.

Archival is interesting and something I'm sure we'll look at more in the future, but right now it would tend to create more problems than it solves -- we have automated data integrity checks running multiple times per day to ensure nothing is out of line, and kicking data out to an archive somewhere would complicate that significantly. We also of course don't have nearly as much data as PayPal being significantly newer, so it's less of a problem.

conflicts/row locks being one of the more common, but that's easy to deal with. It's when an external call to a processor dies halfway through that things get tricky.

ericcholis · on Oct 5, 2012

In some cases they store more data than just transaction information. Some people are using PayPal to manage their entire shopping cart. Meaning, that they store basic item information for orders.

conradfr · on Oct 5, 2012

Every action you do on your PayPal account is logged, that must account for something.

trustfundbaby · on Oct 5, 2012

It'd be great to read about people's experiences using postgres on a comparable scale to that discussed in this article ... does that even happen?

d0ugal · on Oct 5, 2012

Yahoo had a huge postgres database a few years ago. No idea what they are doing now. http://www.computerworld.com/s/article/9087918/Size_matters_...

And Skype sounds pretty interesting too. http://www.slideshare.net/adorepump/small-overview-of-skype-...

hendler · on Oct 5, 2012

Does the "commercial instance of MySQL" refer to Percona, something else, or is it simply the MySQL commercial license?

jeremycole · on Oct 5, 2012

The article author is confused. I clearly stated in the presentation that we're developing our own branch of MySQL, which is available on Github:

https://github.com/twitter/mysql