If I were in OSVDB's shoes, I would call these people out in an email and ask th...

userbinator · on May 8, 2014

On the other hand, large (even non-profit) organisations are precisely the ones who have the resources to scrape steathily and widely, as would a loosely-organised community of users... it's not hard to come up with algorithms to respect the rate limits, balancing the load across multiple IPs and accounts, and producing access patterns that don't look any different from the rest of the site traffic.

andrewfong · on May 8, 2014

Large organizations also tend to have risk-averse lawyers. I'm not a fan of the CFAA, but if what Aaron Swartz did was illegal, then so is this.

rtpg · on May 8, 2014

like much security, it's not about making it impossible, it's about making it a lot less convenient/a bit harder.

At one point the effort to circumvent would cost more in man-hours than just buying the product.

Mikushi · on May 8, 2014

You can get scrapping libraries fairly easily. In my more shady past I developed a library like that and shared it, HTTP client library with automatic proxy rotation and rate limiting friendly.

When I used it (which was almost a decade ago) never ran into problems, plug a list of 10,000 proxy and scrap away.

Not condoning that, which is a bit hypocrite of me, at the time I was mostly doing what I was told and I thought I was clever. Now that I'm in a position to have a positive impact, I do buy data and pay appropriate licence fees on all software/data purchase, which still baffles some of my programmers who constantly ask "why not crack it?", "you know I found a .zip on Google with the data, why buy it?", and so forth.

I don't know what in programmer culture makes it so hard for us to pay for something, some people put some effort behind that software / data collection, and it's only fair to pay them.

rtpg · on May 8, 2014

I know that paying for things was annoying when I didn't have money. Now that I have cash, I'm willing to pay (reasonable amounts of ) money for digital things.

I still hesitate when it comes to thousand dollar licenses when it's for my personal use , though.

test1235 · on May 8, 2014

Speaking as devil's advocate, it might just be more convenient to steal the data.

Maybe I want to use your data casually once, and I don't want to sign up and give you all my contact details and subscribe to your annual plan with all the other optional extras.

Tough shit, you say? I'll just steal it then, and not because I can't afford it, but because you're making it hard to pay.

notastartup · on May 8, 2014

but how is it stealing when one does not lose inventory? If one person scrapes a page, did you lose the source code? Does it not become available for the next visitor? What possible loss do you incur that is directly tied to your data? When you make data public with the intent of being readily accessible by the public, how can you claim theft when you are achieving what you set out to do? Does the accelerated rate of access suddenly become a theft? Does one need to pay a third party to avoid the pain associated with manual hand labor for simply hosting the data which is available to the public? Help me understand.

jerf · on May 8, 2014

"but how is it stealing when one does not lose inventory?"

Scraping is not necessarily a no-victim situation. Even today after this stuff has gotten cheaper, you're costing them bandwidth fees, and likely increasing their server storage and CPU fees if it's on a metered hosting service, which is quite likely nowadays. If you degrade their site's functionality, you may chase away paying customers.

We need not hypothesize crazy third-order effects; you are taking money out of their pockets by the act of scraping itself, independent of the question of the value of the content.

"What about Google? etc." - robots.txt-honoring scrapers that don't hammer the sites at least have a plausible claim to permission. Scrapers are quite likely to be ignoring the robots.txt.

tripzilch · on May 11, 2014

> Scraping is not necessarily a no-victim situation. Even today after this stuff has gotten cheaper, you're costing them bandwidth fees, and likely increasing their server storage and CPU fees if it's on a metered hosting service, which is quite likely nowadays. If you degrade their site's functionality, you may chase away paying customers.

While technically correct, you are conflating the issues, because in none of the cases (that I've seen mentioned so far in this thread) the problem is with bandwidth/storage/CPU costs of retrieval to any significant extent.

Instead, it appears that almost all of the costs are incurred before retrieval: curating, sorting, etc.

I'm not arguing that it's okay, but it's just as much not stealing / thievery as downloading movies or music isn't.

newaccountfool · on May 8, 2014

Even if you limit the rate, they will just create multiple accounts on multiple servers. There is no way you will make them pay for it as you could find it hard trying to prove it's them scraping in the first place.

gadders · on May 8, 2014

If I were in their shows I would track their IPs and send them bogus data along the lines of "Please pay for a commercial license."

dspillett · on May 8, 2014

You need to be careful sending bogus data: in some jurisdictions this could be argued to be deliberate targeted commercial sabotage. You would no doubt eventually win any resulting legal argument, assuming you could afford to carry the argument on to that conclusion. Sending no data, or limited data, would be safe though.

A better method would be to set "default" pricing (something high but not ridiculous, that could easily be negotiated downwards if they contact you) and make access beyond a few requests a click-through (or better: have them respond to an email before progressing further) where they agree to that pricing if they are using the information commercially.

andreasvc · on May 8, 2014

You don't have to return bogus data, you can return an HTTP error code: 402 payment required.

dspillett · on May 8, 2014

Exactly.

The problem I see is with giving out bad data while leading people to believe they have obtained useful information (that they then embarrass themselves by using/re-distributing).

I did misread the grand-parent post though: his was suggesting the bogus data was the message, and I read it as handing out fake data for the scraper along with the message to be seen should a human be looking.

andreasvc · on May 8, 2014

Yeah I can see why fake data could give you a harder time in terms of lawsuits. Similar to people who fight hotlinking of their content by replacing images with something offensive.

pbhjpbhj · on May 8, 2014

>"You need to be careful sending bogus data: in some jurisdictions this could be argued to be deliberate targeted commercial sabotage." //

That sounds pretty ludicrous, do you have anything to back it up - caselaw, settlement report? It would be analagous to serving a fake image to combat hotlinking; or a fake page to combat framing.

dspillett · on May 8, 2014

Something that is obviously fake or otherwise different (like the image or frame break-out examples) would also be fine.

But leading someone to believe they have correct data when what they have is potentially embarrassing when used could be something they'd take objection to. Even if not there are two other points of risk: your reputation if something goes wrong and you accidentally give bad data to your paying clients and your reputation if someone, paying or otherwise, shows off the bad data as "the sort crap these people try to sell".

I don't have any specific references, but it is something I would be careful of as there have certainly been similarly ludicrous (IMO) cases on unrelated matters in the past (yes the right side would win, assuming they can afford to).

I may be being too cynical here, then again maybe not...

I did misread the grand-parent post though and this isn't what he was talking about. He was suggesting the bogus data was the message, and I read it as handing out fake data for the scraper along with the message to be seen should a human be looking.

falcolas · on May 8, 2014

Hasn't the mapping industry already set a precedent in this field? They provide maps with small, wrong roads/trails specifically to identify who is stealing their data.

dspillett · on May 8, 2014

Pretty much, but precedents set in the "real world" don't always carry over to "on the Internet", even when the correlation is stark, obvious, and indisputable to most people's eyes.