with Raymie Stata, Janet Wiener, and others. Compaq Systems Research Center Summer, 2000. |
1 Introduction
There is considerable interest in analyzing the manner in which web-pages are linked together. The link-structure of the web-graph is part of the page ranking criteria used by some popular web-search sites. It has also been used in research on the evolution and form of the web. The Connectivity Server (CS) provides fast, random access to this information. In order to accomplish this, it has to compress the graph information to fit in memory. The current version (CS2) can store link information for approximately 200 million pages (about a 3-week web crawl) in 8GB of RAM. During the summer, we assessed different methods for improving the compression in CS2. Implementing several alternatives, we produced CS3, which improves compression by a factor of two, allowing us to double the number of links we are able to store in a given memory size. 2 The Connectivity Server The CS densely allocates positive numeric IDs to all the web pages in the database. Link information is returned as lists of IDs forming an adjacency list for an ID being queried.ID order approximates alphabetical order of the URLs identifying pages, with the exception that pages are first separated into groups corresponding to having `many', `intermediate' or `few' links. This latter division ensures that well-referenced pages have small and proximate IDs (which gives better encoding), and also improves index compression (we are able to make assertions about the number of links of a page). CS2 also provides other indexing and lookup functionality which is not affected by the changes we made. There are two models we can use to reduce the amount of data required to store the web-graph. The first model makes predictions about the set of links on a page based on other pages' links. This gives us inter-row compression. The second model makes predictions about individual links on a page, given other links on the same page. This gives us intra-row compression. One of the main contributions of CS3 is the addition of inter-row compression to CS2, which already included intra-row compresion. CS3 also improves the overall compression by using escaped Huffman codes to encode the data, in place of a length and value code. 3 Realization We formulated several compression schemes based on different models and evaluated them by predicting the compression for some of our datasets. We chose the scheme sketched below because it seemed the most flexible (possible to adjust the speed vs. compression tradeoff easily) and efficient (fast). For each set of links on a page to be compressed:
There are several parameters that can be easily changed in the above scheme. These include the size of the window to examine, and the length of chains to allow. Increasing the size of the window results in better compression, though with dimishing effect; we experimented with values in the range 2-32. Allowing longer chains improves compression, but adversely affects speed. Depending on the database chosen, we obtained 44%-54% compression as compared to the original compression scheme in CS2. We also observed that the new compression does not cause any significant slowdown of the program. We are now able to store link information for approximately 400 million pages in 8 GB of RAM.
©2001 Rajiv Wickremesinghe <rajiv@cs.duke.edu> Last modified: Mon Jan 8 13:29:53 2001 by rajiv. |