I'm chiming in here since my employer has a few web archives from the IA and some other organizations.
That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.
But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.
There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.
That 10x average seems to be a bit off considering our data, which is of course spotty since it's crawled by a third party.
But to give some numbers, in one of our experiments we filtered web sites from the archive for known entities and got 307,426,990 unique URLs that contained at least two of those entities (625,830,566 non unique) and in there were only 5,331,272 unique hosts. That archive contains roughly 3 billion crawled files (containing not only HTML, but also other MIME types) and covers mostly the German web over a few years.
There are a lot of hosts that have millions of pages. To name a few: Amazon, Wordpress, Ebay, all kinds of forums, banks even. For instance, www.postbank.de has over a million pages and they were not re-crawled nearly that often.