"Towards Web-scale Web Archaeology"
Shun-Tak A. Leung, Sharon E. Perl, Raymie Stata and Janet L. Wiener
Report #174, September 10, 2001. 

Web-scale Web research is difficult. Information on the Web is vast in
quantity, unorganized and uncatalogued, and available only over a
network with varying reliability. Thus, Web data is difficult to
collect, to store, and to manipulate efficiently.

Despite these difficulties, we believe performing Web research at Web-scale 
is important. We have built a suite of tools that allow us to experiment 
on collections that are an order of magnitude or more larger than are 
typically cited in the literature. Two key components of our current tool 
suite are a fast, extensible Web crawler and a highly tuned, in-memory 
database of connectivity information. A Web page repository that supports 
easy access to and storage for billions of documents would allow us to 
study larger data sets and to study how the Web evolves over time.