Digital's Web Proxy Traces

This document begins by offering a quick introduction to using web proxy traces that were taken at Digital Equipment Corporation, and then discusses these traces in greater detail.

A Quick Introduction

The key things you need to know to work with these traces are:

Current Traces -- Currently available is one set of traces taken over a period of a few weeks, including approximately 24,477,674 references. These traces are from an instrumented version of the squid proxy, and record all external web requests generated from within Digital Equipment Corporation during that period.
- A list of the currently-available traces. Please read the fine print at the bottom of that page!
- More on the environment traced
Post-processing -- After collection the trace data was post-processed. The goal of this post-processing was to transform four of the larger more unwieldy fields (server's name, URL's path, query [if any], and client's IP address) into four simple 32 bit integers. This transformation was done to allow easy manipulation of trace records, and to maintain privacy.
- More on post-processing
Trace Files -- Each trace file starts with an 8 KB ASCII text header, with details of the post-processing including total counts for the number of unique clients, servers, URL paths, URL queries, and total number of events. This header is then followed by trace event records stored as binary data. (Note: these traces were processed on an Alpha, a little endian machine.)
- More on trace file headers
Record Fields -- The following items are available in each trace event record:
- the time at which the request was seen
- the duration of the request
- the amount of time the proxy spent communicating with the server
all in microseconds, and
- the object's last-modified date, if available
- a unique integer ID for the client
- a unique integer ID for the server
- the server port number
- a unique integer ID for the path
- a unique integer ID for the query, if any
- a code specifying the object type (based on the path name extension, if any)
- the object size in bytes
- the method used
- the protocol used
All of these fields are encoded as integers of various sizes.
- More on record fields and their values
Reading Traces -- We provide the program proxytrace2txt, as a tool to convert the trace files into human-readable form, and as a template for developing simulation programs that use these traces.
- More on proxytrace2txt

More Details

The rest of this document is organized as follows:

Environment in which traces were taken
Post-processing of the traces
1. Unique ID & fingerprint
2. Trace file header
Trace fields and values
Reading traces

Environment in Which Traces Were Taken

We modified the 1.0.beta17 version of the squid proxy to output specific details of its actions at various points. This instrumented proxy software was then used on two machines that act as web proxies for the Digital's internal network. Because the DNS entry for the proxy was set up so that the DNS servers would randomly return either of the two addresses with equal probability, proxy clients split their requests between the two proxies. The trace stream represented in these files is a combination of requests traced at both machines, and thus represents all external web-related requests generated from within Digital.

Post-processing of the Traces

In post-processing the instrumented proxy output, the following steps were taken:

parse -- Individual events from the instrumented proxy (e.g. ###time_stamp### accept TCP connection from client xxx.xxx.xxx.xxx) were correlated to generate an intermediate format with one line for each client request.
clean -- Four fields were converted from strings (or IP addresses) to unique ID numbers (32 bit integers). In this conversion, a keyed MD5 fingerprint of the string is made and stored. This fingerprint can then the used to correlate specific field values across trace sets. The files containing the mapping from unique IDs to 128 bit fingerprints are available as well. This step was also taken to protect the privacy of internal web activity; we will not make available the mappings between the original strings or addresses and the MD5 fingerprints.
- More details are discussed in Unique ID & fingerprint
text to binary conversion -- This last step takes what has been up to now ASCII text data and writes it as binary data, preceded by an 8 KB header containing a summary of the previous steps, as well as cumulative count information and a trace version number.

Unique ID & fingerprint

As explained above, four fields are converted to unique ID numbers. These numbers simply represent an enumeration of the values seen for that field. For example, if the first value seen for the server field is www.Joe_server.com, then this value will have the unique ID number of 1, and every time a request to this server is seen the server number field will have the value 1. Although a trace set may be kept in several files (usually one file per day) these unique ID numbers are consistent across an entire set of traces.

Additionally, when these fields are transformed to unique IDs a 128 bit fingerprint is used to specifically identify the string. (see Trace Fields and Values for more details on the string representing these fields). Files mapping each fingerprint to each unique ID number are also available. The purpose of these files is to allow correlation between different trace sets without compromising the specifics (or privacy) of the site traced's references.

Trace File Header

The ASCII text header for each file details the output from post-processing. Each such header contains the maximum value that each of the unique IDs has reached in this trace file. This number is basically a count of the total number of values seen for that field in the traced stream ending with this file. For example, if the client field has a maximum of 100, then we can say that by the end of this trace file, the entire stream traced has seen requests from 100 different clients.

Trace Fields and Values

The structure of each trace record is shown in the type definition below. The type u_4bytes is an unsigned four byte integer.

    typedef struct  _TEntry {
	u_4bytes			event_duration;
	u_4bytes			server_duration;
	u_4bytes			last_mod;
	u_4bytes			time_sec;
	u_4bytes			time_usec;

	u_4bytes			client;
	u_4bytes			server;
	u_4bytes			port;
	u_4bytes			path;
	u_4bytes			query;
	u_4bytes			size;

	unsigned short			status;
	unsigned  char			type;
	unsigned  char			flags;
 	method_t			method;
	protocol_t			protocol;
    } TEntry;

The duration fields are measured in microseconds. The event duration is the time from when the proxy first accepts a connection from the client to when the proxy successfully closes that connection. The server_duration field consists of the time the proxy is connected to the web server. Note this duration can take the unsigned equivalent of -1 in the case that no connection to this server was made.

The time fields represent the time at which the proxy accepted a connection from the client, in microseconds since the UNIX epoch.

The fields client, server, path, query are all unique ID numbers. These numbers are sequential from 1 to the last unique value for that field.

The ``path'' is the portion of the requested URL following the server name (and possible port number), up to the end of the URL or the first ?, if one appears. If the URL contains no path, a default path ID is used, of / is used and a flag is set (see below.)

If a ? appears in the URL, the string used for the query is everything after (and including) the first ? in the URL.

The port is the TCP port used for the connection.

The size is the size of the object in bytes.

Status is the HTTP status code returned from the web server.

The values for the field flags are as follows:

Symbol Value Meaning

NO_PATH_FLAG 1 path consists of only /
PORT_SPECIFIED_FLAG 2 A specific TCP Port was specified

NULL_PATH_ADDED_FLAG 4 The requested URL had no path

QUERY_FOUND_FLAG 8 A ? was found in the requested URL

EXTENSION_SPECIFIED_FLAG 16 The path field specified an extension

CGI_BIN_FLAG 32 The string cgi_bin was found in the path

Symbol	Value	Meaning
NO_PATH_FLAG	1	path consists of only `/`
PORT_SPECIFIED_FLAG	2	A specific TCP Port was specified
NULL_PATH_ADDED_FLAG	4	The requested URL had no path
QUERY_FOUND_FLAG	8	A `?` was found in the requested URL
EXTENSION_SPECIFIED_FLAG	16	The path field specified an extension
CGI_BIN_FLAG	32	The string `cgi_bin` was found in the path

Any extension provided at the end of the path is used to determine a object type. The type field is set depending on what string the extension is. Various forms of well-known extensions are accepted (e.g. .htm is considered type HTML, and .JPEG, .jpeg, .JPG, and .jpg are all considered the same.)

Symbol Value

NO_EXTENSION 0

HTML_EXTENSION 1

GIF_EXTENSION 2

CGI_EXTENSION 3

DATA_EXTENSION 4

CLASS_EXTENSION 5

MAP_EXTENSION 6

JPEG_EXTENSION 7

MPEG_EXTENSION 8

OTHER_EXTENSION 9

Symbol	Value
NO_EXTENSION	0
HTML_EXTENSION	1
GIF_EXTENSION	2
CGI_EXTENSION	3
DATA_EXTENSION	4
CLASS_EXTENSION	5
MAP_EXTENSION	6
JPEG_EXTENSION	7
MPEG_EXTENSION	8
OTHER_EXTENSION	9

Method and protocol are copied directly from Squid's source code, and are as follows:

    typedef enum {
	METHOD_NONE = 0,
	METHOD_GET,
	METHOD_POST,
	METHOD_HEAD,
	METHOD_CONNECT
    } method_t;
    
    typedef enum {
	PROTO_NONE = 0,
	PROTO_HTTP,
	PROTO_FTP,
	PROTO_GOPHER,
	PROTO_WAIS,
	PROTO_CACHEOBJ,
	PROTO_MAX
    } protocol_t;

Code to Read Traces

The proxytrace2txt program is a simple C program (with an associated header file) that opens either a regular or compressed (file with extension ".gz") trace data file, and prints out the trace records as seen. A packaged copy of the code for this program is provided as a tool to examine the trace data, and as a template from which to develop simulation programs that use this data.

This page and the documented programs were done by Tom M. Kroeger as a part of my summer internship at Digital's Western Research Lab, under the supervision of Jeff Mogul. The instrumentation of the squid proxy was done by Carlos Maltzahn.