What is the rationale for storing monetary values in this sort of system? Not being snarky, just legitimately curious what scale of service could possibly necessitate that and what solutions didn't work beforehand.
Transactions in AdTech are different than normal payments.
For example, imagine an ad campaign spending $30k/month at a rate of $5 per 1,000 impressions. The customer may want their budget spread evenly throughout the month, so the software sets a daily budget of $1000. But this really represents 200,000 daily impressions, each of which is a transaction that subtracts from the available balance in real-time. The buyers software is talking to an ad exchange and keeping track of the budget every time an individual impression is won.
To add some more complexity, the impressions are probably billed as second-price auctions, so they aren't all exactly $0.005 each. Some are $0.00493, some are $0.00471, ect. Each one of these numbers is reported back from the exchange to the buyer's software in real time and the buyer is responsible for managing their budget.
This is just an example, but hopefully it illustrates how it can become impractical to account for this kind of thing using something more traditional like PostgreSQL. It would be reasonable to log all the impressions to something like Hadoop for the analytical piece of the software, but there needs to be something more real-time for budgeting to prevent overspending. The big ad exchanges can host hundreds of thousands or even millions of auctions per second, so not turning off bidding can be very costly.
This process of auctioning ad impressions across many buyers through an API is called real-time bidding.
Why does this need to be in real-time, if their daily budget is $1000, you can still wait quite a bit and then apply increments in aggregate (e.g. hourly)? More so it sounds the customers aren't inter-connected - it hardly seems like the complex distributed problem.
The impressions may be sparse, e.g. say you're retargeting CEOs (demographic information you're getting from a DSP) who have visited your website in the last month (via a pixel you drop) who are in New York City (via a geoIP DB).
So, fine, a probabilistic model might work well. And you might decide to bid on 100% of impressions. And you might decide that you have to bid $200 CPM to win -- which you're OK doing, because they're sparse.
And then say that FooConf happens in NYC and your aggressive $200 CPM bid 100% of the time blows out your budget.
Often you can charge the customer you're acting on behalf of your actual spend + X% up to their campaign threshold. So you really want to ensure that you spend as much as possible, without spending too much. Pacing is hard. Google AdWords, for example, only promise to hit your budget +/- 20% over a 1 month period.
I'm really not seeing what you gain out of running fine-grained control all-the-time here. Even if it were vital for a customer that you hit a budget target exactly you could dynamically change the granularity of control as you got closer. If anything predictive modeling would give you better budget use when you do have the flexibility than granular adjustment would (I don't know much about the area though and just going with your description of the problem here.)
A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?
With a real-time data stack you can avoid the duplicated ad a good percent of the time. Better experience for buyers, for publishers, and for users.
Store what in the cookie? Every ad the user has seen across the entire web, along with how many times they've seen it in the last N minutes/hours/days/weeks?
A cookie won't fit all that data and a more traditional database generally won't work. In memory k/v stores like Redis won't work due to data size (TBs of data). Hbase/Cassandra/etc sort of work with latency in the 5ms range. That's fairly expensive in a 90ms SLA, but you can make it work. It does limit the amount of work you are able to do.
We (Adroll) have been very happy with DynamoDB for use cases like this. Works fine with ~500B keys, while maintaining very low, and consistent, latencies.
Even then we need a fast lookup. 4 ads means 4 advertisers, 4 creative, 4 potentisl clicks, imps, mids/ends (video), conversions/ etc. There are also multiple cookies depending on which exchange the ad came from. Having to map them on the fly requires a fast and large datastore.
> A better example is frequency capping. Ever watch something on Hulu and see the same ad 4 times in a twenty-minute commercial? Or even, worse, back to back?
Yeah, but when that happens I usually don't think, oh hey they are lacking an optimal in memory distributed database solution.
I think, well... their engineers suck. Or they don't care. Pick one.
edit: His point is vague, so there is nothing technical to respond to. I am very much interested in a good technical example - but the things mentioned so far are by all appearances relatively straight-forward and linear, hence lack of effort or bad engineering are the only reasonable assumptions left.
I don't get your point here. He's explaining why it works the way it does. You're saying you don't think about it as a user. That doesn't invalidate or even respond to his point.
Volume and latency requirements make it more difficult to track individuals on the web. It's an easier problem to solve in 50ms. It's also much easier to solve when it's only a million individuals rather than a couple hundred million individuals.
phamilton touched on another good example - the budget can be implicitly set per-user via a frequency cap. If you see the user and have a chance to bid on an impression for them, the odds are good you'll see them again in two seconds -- winning two auctions means you've overspent by 100%. Oops.
Perhaps also worth pointing out that in RTB, quite often you won't know if you've actually won the auction or not for a few seconds later (when the exchange calls you back with a win notification, or you get it directly via a pingback from the user's device)
During that delay you might have actually already processed new bid requests (auctions) for the same user.
Depending on the order's characteristics and how much you're willing to deviate from target - especially when observed within a small time window - the above poses additional challenges w.r.t. overspending.
IN such scenario, 'near' realtime would work just fine. Just process the impressions through storm or spark and put the results in HBase (CP type store) or even PostgreSQL.
Having stored money values in Redis several times in the past (sometimes without replicas at all!), the answer is knowing how much you can trust the system.
I trust enough to get the job done, but not enough to get bitten when these systems drop data. Because here's the truth: they all drop data.