Wow, that's surprising. I would have bet heavily on those messages being straight out false. I have seen the "3 other guests are looking" message in hotels in quite off-track places and seasons, when booking well in advance. If that message is true, I guess I just underestimate the size of the market and/or the number of hotels that people will browse before booking.
There's actually many things that I hated about travel sites, and then I worked at a "travel tech" company and it changed by perception of the industry as a whole. Part of the issue is the suppliers themselves. Bcom, Expedia, Kayak etc are all at the mercy of what individual hoteliers and airlines report. I've seen data issues where hotels will report little/no inventory, and we investigated the issue and it was on their end (user error, oops). Or, a huge party booked up the hotel, but users were angry because just 24 hours before, we showed N open rooms, and now we have almost nothing.
Issues like this happen all the time, and the debate always centered around messaging. We would love to tell users why something is unbookable, but trying to figure out the details with suppliers ended up being near impossible.
That being said, OTAs do have their own set of issues to answer for, including misleading messaging, scummy sales tactics, etc.
I’ve recently joined travel/hospitality tech company adjacent to but not exactly the same as the core booking segment. What you said about investigating a lack of vacancy coming from the Hotelier/user error I would suspect has to do with poor PMS integration. You probably know some of the same vendor names that I do.
For those unfamiliar, think of a Hotel’s PMS system like an ERP platform but for hotel operations and associating guest services with guest billing info. Buy the high-speed WiFi instead of using the slow free WiFi? That goes to the PMS and tells the hotel to bill you for it (if they don’t farm that out to a third-party who bills and remits on their behalf).
Anyhow.
I’ve come to learn, without naming any of my clients publicly that technology integration in hospitality (I specifically refer to household name hotel brands, not necessarily brands associated with ordering reservations like Expedia and Booking.com) ranges from exceptional but difficult to horrifying but trivial, and a common source of my personal frustrations as a product manager for our specific services (which require integration to these PMS platforms) is incomplete or inaccurate data produced by the property or hotel brand THEMSELVES
Totally! I agree with everything you've said, their relatively lack of tech sophistication a huge PIA and it causes us daily headaches.
In addition, I've noticed that their hardware and software is often terrible, non-performant or buggy. So when we try to interface with them, we need to do hacks like severe rate limiting and long term caching of inventory (sometimes 24+ hours), simply because if we wanted to run at the TPS we run at, they would tip over. This leads to stale inventory on our end, and customer anger. We would love to do everything live (with short term caching), but the suppliers just can't take the load. Just about the only direct call made is the booking one, everything else is stale by some amount.
Monitoring/logging is often non-existent, so when you say "hey we're seeing latency" they have no idea how to diagnose or fix. Often, they don't have metrics in place and are unaware that there is latency to begin with.
> incomplete or inaccurate data produced by the property or hotel brand THEMSELVES
I believe it's largely because the RPC soap calls that many of these integrations use are stateful, so making the same call over and over again can lead to different results. Much of their tech stack is stuck in the 90s/early 2000s, and it shows.
I believe the alerts false, statistically speaking, these alerts don't make sense. Every room you click is in an end-of-world situation where you have to book now otherwise "who knows" and 5 days later the same amount of rooms is available and the same exact scenario is presented for all rooms/hotels.
ps. What I usually do, is look at booking.com for hotels at cheaper prices and then head to the hotel's website. They usually have more categories of available rooms and at a cheaper price.
As you might intuitively conclude, Booking and its competitors invest quite heavily into detecting real vs. bot traffic. Often, sites will block bot traffic or degrade the content. Booking is very torn on that because they don't want an arms race. When I worked on the experiment tooling, we very much preferred to have more, but tagged bot traffic than less and undiscovered. That's because for the experiment analysis, the undiscovered bot traffic was poison.
I worked for a competitor, and the amount of time+money we spent blocking scrapers was insane.
The reason we spent all that time+money? It affected the metrics on which we were paid, AND consumed a huge amount of compute resources of ourselves and partners.
It's also (at least in part) why places are doing the "Sign up to see special offers" thing - because the offer isn't for display on the open site, and they don't want that value to appear on scrapers. So... logged in users only, and validate that user isn't doing scraping.
I don't want to give away too much detail - Company IP, self identifying, etc.
However, lets just say that if normal peak traffic is 1X, then bot traffic can come in at 100-10000X. And in most cases it's not a smooth rise, it'll be a case of suddenly all this new traffic came in, no warning, no rise.
How many non-trivially complex sites do you know that can suddenly take 100-10000X their normal peak in under a minute? Those that do - are their managers happy about spending on idle infrastructure?
How fast do you think even services like AWS can scale? (hint: it's not fast enough).
Normal traffic is, in aggregate, quite predictable in load. Outside of very unusual situations, you don't get hundreds of thousands of people suddenly hitting your site.
Plus, because that traffic is nice and predictable, you can have decent caching rules, reducing the cost of the traffic again.
Along comes Mr Botnet and they want to scrape everything: every hotel in every city, every checkin/out tuple for stays of 1-7 days for the next 12 months, and a bunch of room configurations.
Kiss your caching goodbye, because now not only is your cache hit ratio for their searches going to be effectively zero, but now it's being filled with other shit that nobody else wants.
And this is just our infrastructure. There's the third party OTAs that we hit, and they have the same issue, often with smaller and less experienced teams running on crappier infrastructure.
So, you get angry calls from them that their shared hosting provider is cutting them off and you're ruining their business. Because of course, this shared hosting is not only hosting their room availability search API, but their hotel check-in and sales applications - so nobody in 30 hotels can check-in/out.
This wouldn’t be a problem if these companies offered an API to access their data. But of course, that wouldn’t be good for a business which ultimately depends on information asymmetry and walled garden lock-in. Personally I’m not a big fan of these business models (see LinkedIn for another example), but I at least understand where they’re coming from. I also understand it’s sometimes out of your hands, i.e. the hotel partners don’t want you to make the data easily available. Similar incentive structures exist in flight booking and geoblocked Netflix content.
FWIW, I’ve been on the other side of this (not in the travel industry), and written scrapers. I never scraped the HTML, though. My strategy was to MITM the mobile app and use the same API. I also made sure to throttle the traffic, if not to be respectful, at least to blend in without setting off alarms...
My intent was not to attack you, but to respond to your assumptions about the business model, and your comparisons to geoblocking and LinkedIn's anti-crawling stuff.
There were APIs. I won't go into the business relationships that existed over their use.
We also had problems with scrapers using our data. We contacted lawyers, got advice to plant extra messages to get evidence. They then took care of it and the problems have slowly gone away. Cache now works much better. Not fun when you spend a lot of time and money to accumulate data and someone thinks they should have it for free.
If you put the data on the internet, and it’s accessible at a public address, it’s free. If you don’t want people (or robots) to access it, don’t make it public.
You might be interested in the case of LinkedIn vs HiQ [0], which is setting precedent for protecting scraping of public data.
Based on the fact that you “inserted special messages,” it sounds like the people scraping your site may have been republishing the data. That is a separate issue that in some cases can violate copyright. But in that case, it’s not the scraping of the data that is the problem, so much as it is republishing the data outside the bounds of fair use.
I am of the strong belief that if you make your data publicly available to users, you should expect bots to scrape it too. If your infrastructure is setup in a way that makes traffic from those bots expensive, that’s your problem. The solution is not to sue people or send them letters. You can mitigate it with infrastructure changes like aggressive caching, or you can charge for access for everyone, not just bots. IMO, it’s especially wrong if you allow google to scrape your data, but try to stop every other bot from doing the same.
> You can mitigate it with infrastructure changes like aggressive caching
Rate data has a very limited validity period. Customers get super super pissed (and assume you're scamming them) if when they click through they find out that the hotel/flight/whatever that on the previous page you had said was $200, is now either $250 or sold out. Customers, and the local authorities also tend to get lawyers involved if it happens (in their eyes) too frequently without a good explanation.
It's expensive to get that rate data, because unless you have your own inventory, you have to go out to third party APIs to request that rate for the search parameter tuple which has a specific checkin/checkout dates. When you're searching larger cities - where you might have thousands of hotels - that can be an insanely large number of API calls to return rates.
Most places (including my former employer) don't have a problem issue with scrapers, so long as they didn't abuse the platform to the point that it was causing a ton of extra load. When you have someone who spins up huge numbers of connections at once, that's when we have to do something about it.
> you can charge for access for everyone
That's implicit in the purchase process.
It's like if there's a little cafe that provides free water and tables to sit at on their balcony. That works out for them because it attracts customers. Not everyone might buy something, but most do.
Then someone who runs a dog walking business decides to make that a stop on their walk with 20 dogs. Their dogs eat all the treats, run around the balcony, while the walker sits at the table and drinks the water. Meanwhile, customers are annoyed that there's now 20 barking dogs running around and so they leave.
The business is well within their rights to tell the dog walker to leave and not return without also blocking others who aren't abusing the system.
A fair amount of apps use cert pinning, not sure on the percentages. It’s easy to circumvent if you have a jailbroken device. I haven’t done this in a few years but there used to be something called SSLKillSwitch for jailbroken iOS which would hook the HTTP request method to remove the cert pinning.
Thanks for the detail. It's just mind-boggling to me that you could have a peak of (for example) 1K requests per second, then a bot rises that to 10M requests per second.
If nothing else, it seems incredible that bot authors would be so willfully harmful - they must know that kind of behaviour is going to prompt a reaction
Another reason why they do that is because their partners require they don't advertise special pricing to the world at large, they're contractually forbidden from doing so.
We did have an API, but it wasn't public - because search traffic is expensive.
Both, as mentioned, in compute resources, but also in pissing off and/or outright crashing vendors.
Logging in - requiring an email validation loop - would be simple for bots, but again, anti-bot tech stops them from doing it en-masse.