I don't want to give away too much detail - Company IP, self identifying, etc.
However, lets just say that if normal peak traffic is 1X, then bot traffic can come in at 100-10000X. And in most cases it's not a smooth rise, it'll be a case of suddenly all this new traffic came in, no warning, no rise.
How many non-trivially complex sites do you know that can suddenly take 100-10000X their normal peak in under a minute? Those that do - are their managers happy about spending on idle infrastructure?
How fast do you think even services like AWS can scale? (hint: it's not fast enough).
Normal traffic is, in aggregate, quite predictable in load. Outside of very unusual situations, you don't get hundreds of thousands of people suddenly hitting your site.
Plus, because that traffic is nice and predictable, you can have decent caching rules, reducing the cost of the traffic again.
Along comes Mr Botnet and they want to scrape everything: every hotel in every city, every checkin/out tuple for stays of 1-7 days for the next 12 months, and a bunch of room configurations.
Kiss your caching goodbye, because now not only is your cache hit ratio for their searches going to be effectively zero, but now it's being filled with other shit that nobody else wants.
And this is just our infrastructure. There's the third party OTAs that we hit, and they have the same issue, often with smaller and less experienced teams running on crappier infrastructure.
So, you get angry calls from them that their shared hosting provider is cutting them off and you're ruining their business. Because of course, this shared hosting is not only hosting their room availability search API, but their hotel check-in and sales applications - so nobody in 30 hotels can check-in/out.
This wouldn’t be a problem if these companies offered an API to access their data. But of course, that wouldn’t be good for a business which ultimately depends on information asymmetry and walled garden lock-in. Personally I’m not a big fan of these business models (see LinkedIn for another example), but I at least understand where they’re coming from. I also understand it’s sometimes out of your hands, i.e. the hotel partners don’t want you to make the data easily available. Similar incentive structures exist in flight booking and geoblocked Netflix content.
FWIW, I’ve been on the other side of this (not in the travel industry), and written scrapers. I never scraped the HTML, though. My strategy was to MITM the mobile app and use the same API. I also made sure to throttle the traffic, if not to be respectful, at least to blend in without setting off alarms...
My intent was not to attack you, but to respond to your assumptions about the business model, and your comparisons to geoblocking and LinkedIn's anti-crawling stuff.
There were APIs. I won't go into the business relationships that existed over their use.
We also had problems with scrapers using our data. We contacted lawyers, got advice to plant extra messages to get evidence. They then took care of it and the problems have slowly gone away. Cache now works much better. Not fun when you spend a lot of time and money to accumulate data and someone thinks they should have it for free.
If you put the data on the internet, and it’s accessible at a public address, it’s free. If you don’t want people (or robots) to access it, don’t make it public.
You might be interested in the case of LinkedIn vs HiQ [0], which is setting precedent for protecting scraping of public data.
Based on the fact that you “inserted special messages,” it sounds like the people scraping your site may have been republishing the data. That is a separate issue that in some cases can violate copyright. But in that case, it’s not the scraping of the data that is the problem, so much as it is republishing the data outside the bounds of fair use.
I am of the strong belief that if you make your data publicly available to users, you should expect bots to scrape it too. If your infrastructure is setup in a way that makes traffic from those bots expensive, that’s your problem. The solution is not to sue people or send them letters. You can mitigate it with infrastructure changes like aggressive caching, or you can charge for access for everyone, not just bots. IMO, it’s especially wrong if you allow google to scrape your data, but try to stop every other bot from doing the same.
> You can mitigate it with infrastructure changes like aggressive caching
Rate data has a very limited validity period. Customers get super super pissed (and assume you're scamming them) if when they click through they find out that the hotel/flight/whatever that on the previous page you had said was $200, is now either $250 or sold out. Customers, and the local authorities also tend to get lawyers involved if it happens (in their eyes) too frequently without a good explanation.
It's expensive to get that rate data, because unless you have your own inventory, you have to go out to third party APIs to request that rate for the search parameter tuple which has a specific checkin/checkout dates. When you're searching larger cities - where you might have thousands of hotels - that can be an insanely large number of API calls to return rates.
Most places (including my former employer) don't have a problem issue with scrapers, so long as they didn't abuse the platform to the point that it was causing a ton of extra load. When you have someone who spins up huge numbers of connections at once, that's when we have to do something about it.
> you can charge for access for everyone
That's implicit in the purchase process.
It's like if there's a little cafe that provides free water and tables to sit at on their balcony. That works out for them because it attracts customers. Not everyone might buy something, but most do.
Then someone who runs a dog walking business decides to make that a stop on their walk with 20 dogs. Their dogs eat all the treats, run around the balcony, while the walker sits at the table and drinks the water. Meanwhile, customers are annoyed that there's now 20 barking dogs running around and so they leave.
The business is well within their rights to tell the dog walker to leave and not return without also blocking others who aren't abusing the system.
A fair amount of apps use cert pinning, not sure on the percentages. It’s easy to circumvent if you have a jailbroken device. I haven’t done this in a few years but there used to be something called SSLKillSwitch for jailbroken iOS which would hook the HTTP request method to remove the cert pinning.
Thanks for the detail. It's just mind-boggling to me that you could have a peak of (for example) 1K requests per second, then a bot rises that to 10M requests per second.
If nothing else, it seems incredible that bot authors would be so willfully harmful - they must know that kind of behaviour is going to prompt a reaction
However, lets just say that if normal peak traffic is 1X, then bot traffic can come in at 100-10000X. And in most cases it's not a smooth rise, it'll be a case of suddenly all this new traffic came in, no warning, no rise.
How many non-trivially complex sites do you know that can suddenly take 100-10000X their normal peak in under a minute? Those that do - are their managers happy about spending on idle infrastructure?
How fast do you think even services like AWS can scale? (hint: it's not fast enough).
Normal traffic is, in aggregate, quite predictable in load. Outside of very unusual situations, you don't get hundreds of thousands of people suddenly hitting your site.
Plus, because that traffic is nice and predictable, you can have decent caching rules, reducing the cost of the traffic again.
Along comes Mr Botnet and they want to scrape everything: every hotel in every city, every checkin/out tuple for stays of 1-7 days for the next 12 months, and a bunch of room configurations.
Kiss your caching goodbye, because now not only is your cache hit ratio for their searches going to be effectively zero, but now it's being filled with other shit that nobody else wants.
And this is just our infrastructure. There's the third party OTAs that we hit, and they have the same issue, often with smaller and less experienced teams running on crappier infrastructure. So, you get angry calls from them that their shared hosting provider is cutting them off and you're ruining their business. Because of course, this shared hosting is not only hosting their room availability search API, but their hotel check-in and sales applications - so nobody in 30 hotels can check-in/out.