I currently work for a company that uses Aerospike quite heavily. In the past co...

jboggan · on May 6, 2015

What is the rationale for storing monetary values in this sort of system? Not being snarky, just legitimately curious what scale of service could possibly necessitate that and what solutions didn't work beforehand.

CyrusL · on May 6, 2015

Transactions in AdTech are different than normal payments.

For example, imagine an ad campaign spending $30k/month at a rate of $5 per 1,000 impressions. The customer may want their budget spread evenly throughout the month, so the software sets a daily budget of $1000. But this really represents 200,000 daily impressions, each of which is a transaction that subtracts from the available balance in real-time. The buyers software is talking to an ad exchange and keeping track of the budget every time an individual impression is won.

To add some more complexity, the impressions are probably billed as second-price auctions, so they aren't all exactly $0.005 each. Some are $0.00493, some are $0.00471, ect. Each one of these numbers is reported back from the exchange to the buyer's software in real time and the buyer is responsible for managing their budget.

This is just an example, but hopefully it illustrates how it can become impractical to account for this kind of thing using something more traditional like PostgreSQL. It would be reasonable to log all the impressions to something like Hadoop for the analytical piece of the software, but there needs to be something more real-time for budgeting to prevent overspending. The big ad exchanges can host hundreds of thousands or even millions of auctions per second, so not turning off bidding can be very costly.

This process of auctioning ad impressions across many buyers through an API is called real-time bidding.

hurin · on May 6, 2015

Why does this need to be in real-time, if their daily budget is $1000, you can still wait quite a bit and then apply increments in aggregate (e.g. hourly)? More so it sounds the customers aren't inter-connected - it hardly seems like the complex distributed problem.

cldellow · on May 6, 2015

The impressions may be sparse, e.g. say you're retargeting CEOs (demographic information you're getting from a DSP) who have visited your website in the last month (via a pixel you drop) who are in New York City (via a geoIP DB).

So, fine, a probabilistic model might work well. And you might decide to bid on 100% of impressions. And you might decide that you have to bid $200 CPM to win -- which you're OK doing, because they're sparse.

And then say that FooConf happens in NYC and your aggressive $200 CPM bid 100% of the time blows out your budget.

Often you can charge the customer you're acting on behalf of your actual spend + X% up to their campaign threshold. So you really want to ensure that you spend as much as possible, without spending too much. Pacing is hard. Google AdWords, for example, only promise to hit your budget +/- 20% over a 1 month period.

hurin · on May 6, 2015

I'm really not seeing what you gain out of running fine-grained control all-the-time here. Even if it were vital for a customer that you hit a budget target exactly you could dynamically change the granularity of control as you got closer. If anything predictive modeling would give you better budget use when you do have the flexibility than granular adjustment would (I don't know much about the area though and just going with your description of the problem here.)

phamilton · on May 6, 2015

A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?

With a real-time data stack you can avoid the duplicated ad a good percent of the time. Better experience for buyers, for publishers, and for users.

jmalicki · on May 6, 2015

Or you could just store that in a cookie.

phamilton · on May 6, 2015

Store what in the cookie? Every ad the user has seen across the entire web, along with how many times they've seen it in the last N minutes/hours/days/weeks?

A cookie won't fit all that data and a more traditional database generally won't work. In memory k/v stores like Redis won't work due to data size (TBs of data). Hbase/Cassandra/etc sort of work with latency in the 5ms range. That's fairly expensive in a 90ms SLA, but you can make it work. It does limit the amount of work you are able to do.

vtuulos · on May 6, 2015

We (Adroll) have been very happy with DynamoDB for use cases like this. Works fine with ~500B keys, while maintaining very low, and consistent, latencies.

nkohari · on May 6, 2015

+1 -- we also used DynamoDB at Adzerk for the same use case.

easytiger · on May 6, 2015

The alst 4 ads

phamilton · on May 6, 2015

Even then we need a fast lookup. 4 ads means 4 advertisers, 4 creative, 4 potentisl clicks, imps, mids/ends (video), conversions/ etc. There are also multiple cookies depending on which exchange the ad came from. Having to map them on the fly requires a fast and large datastore.

hurin · on May 6, 2015

> A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?

Yeah, but when that happens I usually don't think, oh hey they are lacking an optimal in memory distributed database solution.

I think, well... their engineers suck. Or they don't care. Pick one.

edit: His point is vague, so there is nothing technical to respond to. I am very much interested in a good technical example - but the things mentioned so far are by all appearances relatively straight-forward and linear, hence lack of effort or bad engineering are the only reasonable assumptions left.

cwyers · on May 6, 2015

I don't get your point here. He's explaining why it works the way it does. You're saying you don't think about it as a user. That doesn't invalidate or even respond to his point.

phamilton · on May 6, 2015

Volume and latency requirements make it more difficult to track individuals on the web. It's an easier problem to solve in 50ms. It's also much easier to solve when it's only a million individuals rather than a couple hundred million individuals.

Like most problems, scale makes it hard.

hurin · on May 6, 2015

It's simply not a difficult problem: there are no consensus requirements between individuals, so scaling can't be made any harder by increasing N.

cldellow · on May 6, 2015

phamilton touched on another good example - the budget can be implicitly set per-user via a frequency cap. If you see the user and have a chance to bid on an impression for them, the odds are good you'll see them again in two seconds -- winning two auctions means you've overspent by 100%. Oops.

minaguib · on May 6, 2015

Perhaps also worth pointing out that in RTB, quite often you won't know if you've actually won the auction or not for a few seconds later (when the exchange calls you back with a win notification, or you get it directly via a pingback from the user's device)

During that delay you might have actually already processed new bid requests (auctions) for the same user.

Depending on the order's characteristics and how much you're willing to deviate from target - especially when observed within a small time window - the above poses additional challenges w.r.t. overspending.

jitix · on May 6, 2015

IN such scenario, 'near' realtime would work just fine. Just process the impressions through storm or spark and put the results in HBase (CP type store) or even PostgreSQL.

jboggan · on May 6, 2015

Fascinating, thank you.

DrJosiah · on May 7, 2015

Having stored money values in Redis several times in the past (sometimes without replicas at all!), the answer is knowing how much you can trust the system.

I trust enough to get the job done, but not enough to get bitten when these systems drop data. Because here's the truth: they all drop data.

obituary_latte · on May 5, 2015

Can you add any detail to this anecdote? It's interesting and important, but detail might help steer others appropriately. What kind of inconsistency? What kind of fluctuation?

Ankhers · on May 5, 2015

I'm current working in AdTech. We are using counters to keep track of a lot of things that are important to us (e.g., reasons for not bidding on a given campaign, money spent within a given transaction, number of bid requests we get by exchange, etc). I personally have found two different, yet very similar, data fluctuations.

The first I found when debugging an issue. I noticed the counters going up, dip down, then continuing up. Rinse repeat. (e.g., 158 -> 160 -> 158 -> 170 -> 175 -> 173 -> 180)

The second I found was when trying to debugging the previous issue. I noticed the counters were essentially cycling. (e.g., 158 -> 160 -> 170 -> 158 -> 160 -> 170). This just repeated for the duration we watched the counters (approximately five minutes).

Please note that I used small numbers here. The counters I was monitoring were in the hundred millions, and I saw decrements in average between 2-3k.

justinsb · on May 6, 2015

Isn't this exactly what the article found? (search for split-brain)

linc01n · on May 6, 2015

I work in AdTech too. I'm still looking for a perfect counter solution. The counter we are using always overrun (which is better than up/down/up/down). We still manage to hack it by patching the number periodically.

p.s. I am working in an Ad Network but not plugging into exchange. Our system is not capable for that.

annnnd · on May 6, 2015

Genuinly curious: did you try implementing realtime stream processing solution, using Storm or similar?

jhugg · on May 6, 2015

VoltDB is pretty fantastic at counters. (VoltDB Engineer)

Rapzid · on May 6, 2015

Some of this stuff looks to me, as somebody not familiar with the domain of course, as a really good use case for event sourcing and in particular something Kafka/Samza could tackle well.

For future consideration and delayed evaluation of course. I guess if you absolutely must have the most up-to-date information so you can make decisions on it RIGHT NOW that wouldn't work very well :|

Or would it? If your bids and stuff are also going through the event stream..

SystemOut · on May 6, 2015

Bids generally have to be responded to within 100ms on these AdTech stacks and if you start to timeout they slow down bids being sent to you....which makes your bid manager less desirable by advertisers. You can probably use stream processing to build and update your models but I'd be surprised if you could handle the bid responses through the same mechanism.

23david · on May 6, 2015

Very strange. I've never encountered something like that with any database or cache. Wonder if it's somehow related to the way that your cluster is setup?

How big is this cluster? Are you writing and reading to the entire cluster, or do you have certain nodes that you write to and others that you read from?

bbq · on May 6, 2015

The analysis in the article shows that Aerospike is designed, intentionally or not, as a loosely accurate data store. It doesn't matter how you set it up or use it.

jamshid · on May 6, 2015

You use distributed databases and have never encountered an inconsistency? What scale? What are you using, I guess not any of these: https://aphyr.com/tags/jepsen.

Ankhers · on May 6, 2015

I have to admit, I'm not 100% on the entire configuration. However...

We have two clusters of 8 nodes each. Each cluster is setup with 2 factor replication. The clusters are setup with cross datacenter replication.

Your read / write question is a little hard to answer. In Aerospike, a given key will always reside on the same node, something to do with how they optimize their storage. Which means that anytime you write to, or read from, a given key your query will always be routed to the same node.

ngx1234 · on May 6, 2015

When Aerospike ships XDR batches it does not replay events, it just re-syncs the data. This is true even for increments. So if cluster A has 10 increments of n to n+10, and cluster B has 20 increments of n to n+20, it's possible XDR will ship A to B and cluster B gets set to n+10. XDR only guarantees data consistency if writes are 15 mins apart and your cross datacenter network doesn't go down.

The suggested method of solving this is to have two keys, one for each cluster, and XDR both keys. Then add them together in the app. You can maybe do it through a lua script, though I haven't tried.

cmpxchg16b · on May 6, 2015

This is interesting, and of course I have a couple of questions, but only two of them really matter: what client are you using, and are you using the cross-datacenter (XDR) replication functionality?

We* tested the increment functionality heavily (300K-1M aggregate ops/sec) before we turned it on in revenue service. We use it for a couple of different things, event counting is absolutely the major use case.

In a single-cluster world, it works phenomenally well. In a XDR world, things get a little tricky, and we had to change the way our application logic worked to compensate for it.

Any more information you can share about your use case?

*a big ad tech company that uses Aerospike heavily

digitalzombie · on May 6, 2015

What are you migrating to ?

As is away from Aerospike to... ?

Thanks, very interesting anecdote/case.

Ankhers · on May 6, 2015

We are first evaluating MongoDB. I believe the main reason behind this is we are already using Mongo in other parts of our application, so there is no additional setup when converting.

Note that nothing is set in stone. The decision to begin migrations only happened today. It is possible that we will end up using some other technology altogether, or even we find out the issues we are having with Aerospike and continue using that service.

ploxiln · on May 6, 2015

oh... dude... exact same mistake twice...

OK, let me try to be more constructive. Since accounts are independent, shard based on account (in the application, not in some magic shard-distributing layer). Treat each shard as its own cluster.

If you want super fast requests but can accept being down for an hour or two a couple times a year, a shard can be a single beefy host with a replicating slave. I'd consider either Redis or Mysql/Postgresql. Really, these old-style sql databases can be the fastest things that have the kind of consistency you need.

I've maintained a mongodb cluster configured a couple of different ways. Performance at high load and reasonable consistency is not as great as some older alternatives.

preetamjinka · on May 6, 2015

You should take a look at "Call me maybe: MongoDB" [0]

[0] https://aphyr.com/posts/284-call-me-maybe-mongodb

sans-culottes · on May 6, 2015

Not to mention "Call me maybe: Redis" [0]. Rinse and repeat for every NoSQL database out there.

Point is, @aphyr skewers everybody. Aerospike is just the flavor of the month.

[0] https://aphyr.com/posts/283-call-me-maybe-redis

jbergens · on May 6, 2015

I think FoundationDb got great scores from Aphyr, sadly it is now owned by Apple. For the others the big problem is when they promise some kind of ACID and cannot achieve it, if they were explicit with what the supported all customers could make informed decisions. Relations systems are generally bad ad horizontal scaling and can get very slow with full ACID over many servers.

masklinn · on May 6, 2015

> I think FoundationDb got great scores from Aphyr,

FoundationDB ran Jepsen internally and reported stuff[0], Kyle never worked with it.

[0] http://blog.foundationdb.com/call-me-maybe-foundationdb-vs-j... half broken now, none of the images load for me. @aphyr seems to have taken them at their word wrt testing though: https://twitter.com/aphyr/status/405017101804396546

annnnd · on May 6, 2015

Curious: how difficult is it to replicate aphyr's tests? Can we beleive independent entities posting jepsen reports on some random DB?

lmm · on May 6, 2015

I think that's a false equivalence. Aphyr is pretty positive about Riak (with the correct configuration) and Cassandra (when used for appropriate scenarios). If I was choosing a new system those are the two I'd be looking at.

woozy · on May 6, 2015

Antirez was very receptive http://antirez.com/news/55

masklinn · on May 6, 2015

Aphyr wasn't overly impressed by that response though https://aphyr.com/posts/287-asynchronous-replication-with-fa...

bluef00t · on May 6, 2015

Try HBase. At another AdTech company, we have been very happy with it. https://eng.yammer.com/call-me-maybe-hbase

annnnd · on May 6, 2015

Nooooooooooooooooooooo! Seriously, no! Use mongoDB, PostgreSQL, even flat files if you must, but HBase?

We used it in production about 3-4 years ago and it was a nightmare from both usage and especially maintenance point. Fortunately we had a flat-files based backup system so we were able to rescue data every! Single! Time! the damn thing crashed and took (part of) data with it.

Of course, this is anecdotal evidence, and things might have changed from then, but I wouldn't touch it. Life is too short.

EDIT: Also, I am curious how the results in the above link would compare to aphyr's if he performed the test on HBase?

bluef00t · on May 6, 2015

I see where you are coming from. HBase was unstable 3-4 years ago, but after a great amount of dev effort and battle hardening from Cloudera, Salesforce, etc., it is very stable now. We have ~ 400 nodes running in production for a very critical use case and have seen 0 data loss edge cases in the last 2 years, along with some of our servers running > 6 months without any reboots.

We use is in a very real time use case with latency requirements of single digit milliseconds, and if you tweak it the right way, you can the required performance from it, along with easy horizontal scaling.

Also, I am curious too for aphyr to take on HBase, but I don't think the result would be different since running Jepsen is straightfoward and not much to a person's interpretation. The results and further experiments are what aphyr does nicely.

annnnd · on May 7, 2015

Thanks for the info on HBase stability. I probably won't use it again (once burnt...), but if they really managed to pull their act together - good for them!

perlgeek · on May 6, 2015

When storing financial data, I'd certainly go with some kind of event sourcing: store deltas / financial transactions, not counters. The counters are just a sum over all deltas.

If performance is an issue, you can make the counters available in a second database that's only for reading, and updated from the original deltas.

obstinate · on May 6, 2015

It seems from this article that Aerospike can lose acknowledged writes, so that would not be enough to save them.

drvsrinivasan · on May 7, 2015

Today, the inconsistencies here were diagnosed by company and Aerospike to be caused by two clusters connected with XDR concurrently writing data to the same counter and shipping to each other and intentionally overwriting some data (bad design that somehow slipped through the cracks). So, this issue is unrelated to the Jepsen network partitioning tests that is the subject of the original article. The work that @Aphyr is doing is very valuable and much appreciated. (Aerospike Founder)

smt88 · on May 5, 2015

Yikes. What made you choose Aerospike in the first place?

Ankhers · on May 5, 2015

Sadly, a requirement of the application is that it needs to be backward compatible with the previous system.

I'm sure many of you know, this leads to quite a few issues.