This is one of the more interesting policy questions on the web. Our search engine crawls a lot of blogs and what not on the web, criminals who want to find unpatched wordpress sites try to acrape our crawl by sending automated (scripted) queries to find them. We have developed a number of defenses over the years and pretty regularly ban them[1]. Here is the weird part though, if they hired 300 people on mechanical turk and paid them each a dollar to do one search, it probably would get them more information. Look at folks like 80legs or other 'distributed' scrapers. They exist almost solely to subvert these service terms. Are they evil? Creative?
One of the things that stands out is called out in this article. The people involved really want this information, so much that they are willing to expend time and effort to construct scraping bots and what have you . Why not just buy it? How is it that someone gets a request from their boss to get some information, but their boss expects them to get it for free? Can you imagine if they said, "We need pens, pencils, notebooks, staplers, the works for the office here. Oh and you can't spend any money getting that stuff, just get it here." Would they construct some elaborate raid on a nearby Office supplies store using a mercenary army of criminals? Why do that with information?
We did an experiment where we would 'grep the web' for you, basically run a regex over a multi-billion page crawl, give you the first 50 results for "free" and you could buy the complete set. I think we sold exactly one of those.
It is a weird thing, the OP captured it perfectly.
I think there is a great meta-question in here, about business models for digital data and software.
Here you have a great case study, about an organization that tried to do a volunteer model, and it didn't work. Then they pivoted to a commercial model, but fundamentally they still believe in a free tier. But they have to cripple that free tier pretty thoroughly, and even still people abuse it.
I have a product I'm working on, that some people are apparently willing to spend lots of money on. Ideally I would have some kind of low tier, so that people without lots of money would be able to use it too. But I can't figure out a way to segment the product so that everybody pays what they can afford without bad apples abusing the low tier and ruining it for everybody. The result is that I may end up only selling it to customers with deep pockets, even though the product is much more broadly applicable.
They made the process of paying for the software a laborious pain in the ass. They are desperately trying to extract money from those who can pay, which sadly drives away those who can pay but don't want an involved process.
I have worked at lots of companies where I had a monthly budget of 10k+ that I could spend on whatever I wanted, but if I wanted any sort of complex deal (can't just put on CC with a line item) -- had to bring in legal and other groups -- instantly killed any interest.
"Licensing is based on the data needed (e.g. all of it vs subset), how it is used (e.g. internal only, external, product integration), etc."
What a goddamn horror show. I simply want a product, I want to pay for it, and I want to use it. Turning on Dropbox for Business was a decision made in about 5 minutes... "You all like it, already using it, awesome! I will get team setup." -- 5 minute later I had given Dropbox $3800.
I really think they are getting in their own way for no benefit. They have created a very high barrier to EVEN HAVING A DISCUSSION about buying the product. So, if I don't know exactly how will use it -- I can't purchase it. Stupidity.
Publishes data in public. Can't get people to pay for it. Blames people for theft. The real thieves are the ones separating people from their wallets over data that is available public and censuring it to those who won't pay.
Perhaps it won't come as a surprise but I've been toying with this question for a couple of decades now. Specifically what are the economics of information? In the 'goods' economy there are some interesting mechanisms that inform the question of value, these include but are not limited to, the cost to acquire materials, develop expertise in manufacturing, and managing the supply lines between raw material to finished good. Accountants will talk about the "Cost of goods sold" as a grouping function for these costs. In the 'information' economy the manufacturing part it pretty trivial, you just replicate copies, but the assembling part can be quite difficult. This leads to an interesting inversion where it can cost a lot to assemble something and nothing to 'manufacture' it. And that doesn't even begin to touch on what it is about information that makes it valuable in the first place.
What is the difference in value between a CD with the latest release of Ubuntu burned on it, and the download? download and a bootable Flash drive?
There is a great experiment you can run which goes like this; At one end of an athletic field, place a chess board with a queen on it on one of the squares. At the other end of the field have a table where people can get a quest. Offer to pay a person $5 if they will walk to the end of the field, note where the queen is, and come back and tell the quest giver. At the mid point of the field set up an information seller. They offer to sell you the location of the queen for anywhere between 20 and 80% of the reward price.
This simple experiment lets you see all sort of mechanisms in play that control information value. On the one hand you can see the range of values people apply to their own time (acquisition cost), their willingness to retain value (do they then go back mid-queue at the sign up table and start offering to sell the information for some fraction of the price to anyone?) At what threshold to people start trying to break the rules (a notion that is similar to price inelasticity but has a component like the 'black market demand').
> We did an experiment where we would 'grep the web' for you, basically run a regex over a multi-billion page crawl, give you the first 50 results for "free" and you could buy the complete set. I think we sold exactly one of those.
Sounds like you didn't advertise it right. I've been looking for a "grep the web" service for a while.
For that $295 I will get a list of all domains using a rival technology... a list of sales leads.
I'm still short of what I want to have... a prioritised list of sales leads.
So I will write my own scraper to go through those results and scrape every one of those so that I can pull some info from the HTML page to tell me how large those customers might be. I'm not aiming for the largest (easy to find, costly to win), nor the smallest (time consuming and pointless to win), but the median.
I would definitely pay for a "grep the web" that allowed me to match pages by text signatures in the HTML, and then extract part of the DOM as values, and return the list of "url + extracted values" for the matching hits.
I'd consider that to be worth similar amounts to what BuiltWith are chaging, but I'd add more and would go up to $500 assuming that the results come with the extracted values as a CSV file of some kinda and the quality and completeness of the report is high.
People will pay for "grep the web", especially if you sell it to them as something they know they really want: "sales leads".
Crawling others' websites then selling ads. <--- Blekko
Scraping others' websites then selling ads. <--- ????
Selling access to user-generated content. <--- OSVDB
Amusing to watch these folks argue about ethics.
Who owns the copyrights in this data? Surely not the one who is demanding that you pay a license fee. These "services" are middlemen, plain and simple.
This might be why McAfee was wondering about how much manual curation is done.
Maybe that is the only possible theory of how OSVDB could assert any rights in the data (and only in a select few jurisdictions).
If they're a middleman, and it's such an easy "service", why didn't McAfee just bypass them and get the data from the original sources, rather than do the wrong thing?
Since you work at Blekko, there are more points to discuss that were not addressed in the article.
For example:
i) Are search engines web scrapers?
ii) Should search engines pay the scraped sites if they are charging to access their indexed data? probably some of the scraped sites has a specific license forbidding the search engine to sell their information in any way.
iii) Regarding Internet policies, is it fair/unfair that a site has a robots.txt configuration to avoid being indexed by a search engine other than Google? I would call this "search neutrality".
i) no, search engines articulate the web. They are essentially 'pre-paid' computation.
Take the classic example, a search for 'bilbo baggins'. What that is, is a request to identify documents on the web that have referred to Bilbo Baggins and return their locations.
1) It is absolutely true that you could sit down at your computer and look at each site, from aol to zillow, read all their pages, and note the ones that mention Bilbo. Then you could go back and order the list by the ones that had more of the information you were looking for to the ones with less useful information. Along the way you would find some sites that would not open up to you unless you had an account, those you could not visit.
2) A search engine can look at all the sites, it can note which sites mention Bilbo along with a bunch of other terms and can essentially "pre-compute" that list you were looking for. Along the way it will find sites that, through their robots.txt file, will say "We'd rather you not look here." and it will respect that, thus not indexing those sites.
In both cases figuring out which web pages have information about Bilbo on them is creating 'new' information out of existing data. You can do it on your own and it will cost you time, or you can do it with a few thousand machines and it will cost you money. Either way you get a list of possible sites.
That list forms a distribution, where there are a lot of sites that don't care one way or the other if you read them, sites that won't show up because they asked to be excluded, and sites that will show up because they paid to be included. Some sites really want you to find them, some sites really don't. A good search engine caters to both types.
ii) If a search engine was taking the page, copying it, and then showing that instead of showing the page (this is what got Google's news product in trouble) then its pretty clear that they should not do that. But in terms of location information? The sites themselves derive a huge economic benefit from being in the index that isn't reflected at all back to the search engine that sent traffic there [1], so on a pure economic basis the search engine is on the losing side of that transaction. However, the marginal cost of additional transactions is small (search engines are general purpose) so they make a small amount on large volumes.
To put your question in more specifics, where is the economic value in the list; bobs middle earth atlas, wikipedia entry on bilbo baggins, middle earth web ring, imdb pages on characters in the "Lord of the Rings" movie.
Is it that Bob has an atlas of Middle Earth? Or is it the list itself? Who made the list? Bob or the search engine? (or some human curator of a bookmarks page[2])
iii) It is completely up to the site to allow or disallow access to its content by search engines. Some sites do only allow themselves to be indexed by Google and they find they get less search engine directed traffic that way. Some sites don't allow anyone to index them and they get no traffic (sometimes they are surprised by this, sometimes they don't care, sometimes they are angry that the only way for people to find them is to be in a search engine index)
[1] Google broke that by creating AdSense for Content and created a pretty interesting conflict of interest for themselves.
[2] Good luck finding a book marks page these days :-)
> ii) Should search engines pay the scraped sites if they are charging to access their indexed data? probably some of the scraped sites has a specific license forbidding the search engine to sell their information in any way.
Any reputable search engine will respect robots.txt.
I'm assuming that the vast majority of the time, it's a case of an individual or small group that wants the information, not a real corporation. Usually they have no intention of paying for anything or playing by the rules.
I have absolutely no explanation regarding McAfee though, considering the billions McAfee and its parent company, Intel, makes in revenue yearly.
> Look at folks like 80legs or other 'distributed' scrapers. They exist almost solely to subvert these service terms.
Not sure I understand, is there a whole ecosystem of web businesses that feed off for free of your search engine, or you meant from several legit search engines or ... ?
This is exactly the argument MPAA makes against people torrenting movies and tv shows, using words such as "criminals", "evil", in an attempt to polarize people who are simply consuming information over the internet in the most efficient way possible, without any 3rd party oversight and censorship, the way information on the internet is always bound to be, the way equilibrium is achieved.
The argument that if you really want this information, they should pay for it doesn't work. Basically, you are punishing people for automating manual labor. The fact that you can hire bunch of people and tell them to manually copy data from the website and achieve the same result means that it shouldn't be any different from automating the process itself.
Terms of service on a website is not the law. If there is something you don't want people to have access to, then don't publish it online at all.
One of the things that stands out is called out in this article. The people involved really want this information, so much that they are willing to expend time and effort to construct scraping bots and what have you . Why not just buy it? How is it that someone gets a request from their boss to get some information, but their boss expects them to get it for free? Can you imagine if they said, "We need pens, pencils, notebooks, staplers, the works for the office here. Oh and you can't spend any money getting that stuff, just get it here." Would they construct some elaborate raid on a nearby Office supplies store using a mercenary army of criminals? Why do that with information?
We did an experiment where we would 'grep the web' for you, basically run a regex over a multi-billion page crawl, give you the first 50 results for "free" and you could buy the complete set. I think we sold exactly one of those.
It is a weird thing, the OP captured it perfectly.
[1] Its a violation of the terms of service.