Hacker Newsnew | past | comments | ask | show | jobs | submit | zabjh's commentslogin

I'm chiming in here since my employer has a few web archives from the IA and some other organizations.

That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.

But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.

There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.


OT but why does postbank.de have over a million pages? Is there much|anything worth crawling there?


We never checked the details there. You can use https://web.archive.org/details/www.postbank.de and https://web.archive.org/web/*/postbank.de/* if you want to go exploring.

I assume a lot will be incorrect links and automatically generated pages as is often the case.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: