Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there a way to reliably block Google and AI crawlers?
 help



If you use Cloudflare to proxy your site, there is a button to click that blocks the AI crawlers (even the free tier). It is almost as if the AI crawlers are a DDoS attack. You can't really do it any other way, since many don't respect robots.txt. At least until someone comes up with crowdsourced blacklists with few false positives.

"You can't really do it any other way"

Any custom solution by a half-competent programmer filters out all web crawlers. I'm running a semi-public website for years and nothing gets past


Yeah, I feel like unless you run a site large enough for google monkeys to write a special case for your site specifically, why not just password protect the entire site but put the password on the login page? Or any other rudimentary captcha I suppose - like the old days.

Doesn't keep out anyone even mildly interested in your site specifically, including scrapers, but at least it blocks googlebot etc.


Funny edge case when you can’t read the password because you need it for access

You have heuristics, blacklists and captures. Anything else to add? Those three can all turn away legitimate traffic from public sites. Spambots have been pretending to be legitimate users for decades, and they tend to be pretty dumb. Cloudflare and other large hosts get to do heuristics pretty well, as they can aggregate data from millions of sites rather than the few an individual might run. And even they block and force captures on legitimate users, per complaints you hear here regularly.

We have adblockers which rely on open sourced lists of rules. Could we apply something similar to crawlers. Website owners provide a list of IP addresses that accessed them, determine which ones are likely robots and then update the list of websites to block that are likely crawlers. If everyone works together you could probably fingerprint the crawlers as well and block based on the fingerprint. Might increase the cost of crawlers a little won't be fully reliable.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: