The key things you need to know to work with these traces are:
proxytrace2txt
, as a tool to
convert the trace files into human-readable form, and as a
template for developing simulation programs that use these traces.
The rest of this document is organized as follows:
We modified the 1.0.beta17 version of the squid proxy to output specific details of its actions at various points. This instrumented proxy software was then used on two machines that act as web proxies for the Digital's internal network. Because the DNS entry for the proxy was set up so that the DNS servers would randomly return either of the two addresses with equal probability, proxy clients split their requests between the two proxies. The trace stream represented in these files is a combination of requests traced at both machines, and thus represents all external web-related requests generated from within Digital.
In post-processing the instrumented proxy output, the following steps were taken:
###time_stamp### accept TCP connection
from client xxx.xxx.xxx.xxx
)
were correlated to generate an intermediate format with
one line for each client request.
Additionally, when these fields are transformed to unique IDs a 128 bit fingerprint is used to specifically identify the string. (see Trace Fields and Values for more details on the string representing these fields). Files mapping each fingerprint to each unique ID number are also available. The purpose of these files is to allow correlation between different trace sets without compromising the specifics (or privacy) of the site traced's references.
The ASCII text header for each file details the output from post-processing. Each such header contains the maximum value that each of the unique IDs has reached in this trace file. This number is basically a count of the total number of values seen for that field in the traced stream ending with this file. For example, if the client field has a maximum of 100, then we can say that by the end of this trace file, the entire stream traced has seen requests from 100 different clients.
The structure of each trace record is shown in the type definition
below. The type u_4bytes
is an unsigned four byte integer.
typedef struct _TEntry { u_4bytes event_duration; u_4bytes server_duration; u_4bytes last_mod; u_4bytes time_sec; u_4bytes time_usec; u_4bytes client; u_4bytes server; u_4bytes port; u_4bytes path; u_4bytes query; u_4bytes size; unsigned short status; unsigned char type; unsigned char flags; method_t method; protocol_t protocol; } TEntry;
The duration fields are measured in microseconds. The event duration is the time from when the proxy first accepts a connection from the client to when the proxy successfully closes that connection. The server_duration field consists of the time the proxy is connected to the web server. Note this duration can take the unsigned equivalent of -1 in the case that no connection to this server was made.
The time fields represent the time at which the proxy accepted a connection from the client, in microseconds since the UNIX epoch.
The fields client, server, path, query are all unique ID numbers. These numbers are sequential from 1 to the last unique value for that field.
The ``path'' is the portion of the requested URL following the server
name (and possible port number), up to the end of the URL or the first
?
, if one appears.
If the URL contains no path, a default path ID is used,
of /
is used and a flag is set (see
below.)
If a ?
appears in the URL, the string used for the query
is everything after (and including) the
first ?
in the URL.
The port is the TCP port used for the connection.
The size is the size of the object in bytes.
Status is the HTTP status code returned from the web server.
The values for the field flags are as follows:
Symbol | Value | Meaning |
---|---|---|
NO_PATH_FLAG | 1 | path consists of only / |
PORT_SPECIFIED_FLAG | 2 | A specific TCP Port was specified |
NULL_PATH_ADDED_FLAG | 4 | The requested URL had no path |
QUERY_FOUND_FLAG | 8 | A ? was found in the requested URL |
EXTENSION_SPECIFIED_FLAG | 16 | The path field specified an extension |
CGI_BIN_FLAG | 32 | The string cgi_bin was found in the path |
Any extension provided at the end of the path is used to
determine a object type.
The type field is set
depending on what string the extension is.
Various forms of well-known extensions are accepted
(e.g. .htm
is considered type HTML,
and .JPEG
, .jpeg
, .JPG
,
and .jpg
are all considered the same.)
Symbol | Value |
---|---|
NO_EXTENSION | 0 |
HTML_EXTENSION | 1 |
GIF_EXTENSION | 2 |
CGI_EXTENSION | 3 |
DATA_EXTENSION | 4 |
CLASS_EXTENSION | 5 |
MAP_EXTENSION | 6 |
JPEG_EXTENSION | 7 |
MPEG_EXTENSION | 8 |
OTHER_EXTENSION | 9 |
Method and protocol are copied directly from Squid's source code, and are as follows:
typedef enum { METHOD_NONE = 0, METHOD_GET, METHOD_POST, METHOD_HEAD, METHOD_CONNECT } method_t; typedef enum { PROTO_NONE = 0, PROTO_HTTP, PROTO_FTP, PROTO_GOPHER, PROTO_WAIS, PROTO_CACHEOBJ, PROTO_MAX } protocol_t;
The proxytrace2txt
program is a simple C program (with an associated header file) that opens either a regular
or compressed (file with extension ".gz") trace data file, and prints
out the trace records as seen. A packaged copy of the code for this program is provided as a
tool to examine the trace data, and as a template from which to
develop simulation programs that use this data.