Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The weird thing about this is that the only company I've seen doing problematic user-agent handling in recent years is Google themselves.

I frequently consume web articles with a combination of newsboat + Lynx, and it's astounding how many websites throw up HTTP 403 messages when I try to open a link. They're obviously sniffing my user agent because if I blank out the string (more accurately, just the 'libwww-FM' part, then the site will show me the correct page.

I'm pretty sure that the webmasters responsible for this are using user agent string blocking as a naive attempt to block bots from scraping their site, but that assumes that the bots that they want to block actually send an accurate user agent string the first place.



> I'm pretty sure that the webmasters responsible for this are using user agent string blocking as a naive attempt to block bots from scraping their site, but that assumes that the bots that they want to block actually send an accurate user agent string the first place.

That is exactly what they are doing, and it works really well.

We blocked user agents with lib in them at reddit for a long time.

Any legit person building a legit bot would know to fake the agent string.

The script kiddies would just go away. It drastically reduced bot traffic when we did that. Obviously some of the malicious bot writers know to fake their agent string too, and we had other mitigations for that.

But sometimes the simplest solutions solve the majority of issues.


> Any legit person building a legit bot would know to fake the agent string.

What, that's totally backwards. Anyone using a bot to do things that might get blocked by publishers fakes the string, legit purposes should really show who / what they are.


It actually is encouraging people to have useful user agents. By default most people end up with a user agent that's something like "libcurl version foo.bar.baz", which isn't actually a description of who or what they are; given the prevalence of curl, it really just tells you that it's a program that uses http.


We only blocked agent strings with "lib" in them. You could change the agent to "WebScraperSupreme.com" and it would have been fine (and in fact some people did do that).


Yes perhaps. But it caused problems for regular users like this fellow. I also have tried various 'download via script' for web pages for offline use. I thought I had a problem on my end, I never realized I could have been getting blocked.


Hard to argue with the economics of that mitigation though. The abuse:legitimate use ratio is probably pretty high. Getting rid of user agent strings will bring back the scaling problems, as they should probably be addressed directly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: