Systems Research Center
130 Lytton Avenue
Palo Alto, CA 94301
http://www.research.digital.com/SRC/
In this paper we introduce a programming language for Web document processing called WebL. WebL is a high level, object-oriented scripting language that incorporates two novel features: service combinators and a markup algebra. Service combinators are language constructs that provide reliable access to web services by mimicking a web surfer's behavior when a failure occurs while retrieving a page. The markup algebra extracts structured and unstructured values from pages for computation, and is based on algebraic operations on sets of markup elements. WebL is used to quickly build and experiment with custom web crawlers, meta-search engines, page transducers, shopping robots, etc.
The input phase involves fetching one or more web pages for processing. During this phase we have to contend with the web's geographic distribution and architectural inefficiencies. For example, one or more of the following situations might apply when retrieving a page from a web service:
The processing phase of a typical web computation involves extracting data values from pages and performing computations on these data values. We assume pages to be marked-up in either XML or HTML, so as to exploit the structural content of the page. Our data extraction technique is based on a markup algebra that performs operations on sets of elements in a page (see section 4).
The output phase of a typical web computation covers the generation of web documents from values computed during the processing phase, and storing them back on the web (for example, by publishing the page on a web server).
Figure 1 depicts this general model of a web computation. Web pages flow through a pipeline of service combinators for fetching pages, a markup parser, the markup algebra for extracting (or "searching" for) data values from (on) a page, computing on those values, and page manipulation. Searching, computing and manipulation is repeated as often as needed. Finally the page is regenerated from its internal representation by the markup generator, and stored back on the web.
Some of the applications we constructed with WebL so far include:
In the remainder of the paper we will concentrate on the two novel aspects of WebL, namely service combinators (section 3) and the markup algebra (section 4). These sections are followed with related work (section 5) and conclusions (section 6). An appendix lists example programs.
For the remainder of this section S and T to denote operands (called services), which may contain primitives to fetch pages or general WebL computations.
Services
The getpage function fetches with the HTTP GET protocol the resource associated with the string URL. The result returned is a page object that encapsulates the resource. The function fails if the fetch fails. The second and third arguments to getpage are optional when specified, they provide the server with query arguments and HTTP headers respectively. A similar function called postpage uses the HTTP POST protocol, used to fill in web-based input forms.
// This program looks up the word "java" on the
// AltaVista search engine.
page := getpage("http://www.altavista.digital.com/cgi-bin/query",
[. pg="q", what="web", q="java" .])
WebL's data extraction language addresses these problems with the notion of a markup algebra. The markup algebra is based upon the concepts of pieces, piece-sets and algebraic operators that are applied to piece-sets.
First, we define a piece as a contiguous text region in a document, identified by the starting and the ending position of the region. For this paper we can imagine positions as indices that indicate a character offset in the page, which makes it easy to determine by numerical comparison the relationship between two regions, such as whether two pieces overlap, are contained in each other, or follow each other. The length of a piece is defined as the difference between the starting and ending position. (Our actual WebL implementation uses a more complicated data structure for pieces that simplifies searching and page modification.) We further define a piece-set as a collection of pieces. Pieces within piece-sets may overlap, be nested, or may belong to different pages. However, unlike mathematical sets that do not impose a particular ordering on their elements, piece-sets are always in a canonical representation in which pieces are ordered accordingly to their starting position, and then their ending position in the document. This allows iterating over pieces in a set in the sequence they appear in the document, and also to pick the n'th occurrence of a pattern (by indexing into the piece-set). Both pieces and piece-sets are mapped to special objects in WebL, which means that they can have attributes and be manipulated by program.
A common way to create a piece-set is to search for all the HTML or XML elements with a specific name (we call this a structured search). For example, the following program returns all the anchors (hyperlinks) that occur on the DIGITAL homepage by calling a method called Elem of the page object P:
Another way to create a piece-set is to search for character patterns, ignoring the markup (we call this unstructured search or pattern search). The Pat method of a page object extracts all the occurrences of a Perl 5-style regular expression [Fri97] in the text of a page. The following example extracts the occurrences of the word "Digital" or "digital" in the Digital home page.
Finally, we define a set operator S ¤ T as an algebraic operation ¤ between two piece-sets S and T that returns a third piece-set as a result. For the remainder of this section, S and T denote piece-sets, the elements of S and T are referred to as s and t, and P stands for a page object. WebL divides set operators into groups of basic set manipulation operators, positional set operators, and hierarchical set operators, which will be discussed in the following sections. In the interest of conciseness, we will not describe the negated operators (those starting with an exclamation point), as their behavior is easy to deduce.
Union | S + T |
Intersection | S * T |
Exclusion | S - T |
Basic set operators are used for basic set manipulation. They contain a set union operator, a set intersection operator, and a set exclusion operator. The set union operator merges the two sets S and T and eliminates duplicate pieces. The set intersection operator returns the set of all pieces that are contained both in S and T, and the set exclusion operator calculates the set of pieces that are contained in S but not in T. As an example, the following program retrieves all the level one and level two headings in a page:
S before T | S !before T |
S after T | S !after T |
S directlybefore T | S !directlybefore T |
S directlyafter T | S !directlyafter T |
S overlap T | S !overlap T |
Positional operators provide functionality to query on the locality property of pieces, such as searching for pieces that are located above or below other pieces in the linear text flow of the document.
The before operator computes the set of pieces in S that are located before some piece in T. We define a piece s to be located before a piece t, if the ending position of s precedes the starting position of t. Correspondingly, the after operator returns the set of the pieces in S that are located after some piece in T. Although being very effective, these two operators are not always sufficient. As an example, in some cases we might not be interested in all the occurrences of a link after a special keyword, but only in the very first occurrence of a link after the special keyword. In this case, we use the stronger operators directlybefore and directlyafter that return the set of only the closest pieces in S that follow or precede some piece in T. We also call the latter non-transitive versions of the before and after operator. The following example depicts the differences between these operators:
S in T | S !in T |
S contain T | S !contain T |
S directlyin T | S !directlyin T |
S directlycontain T | S !directlycontain T |
In contrast to positional operators that provide functionality to express locality relationships between pieces, hierarchical operators provide functionality to express containment and inclusion relationships between piece.
The in operator returns the set of pieces in S that are contained in some piece in T. We define a piece s to be contained in a piece t, if the starting position of s follows or is equivalent to the starting position of t, and the length of s is smaller or equal than the length of t. Equivalently, the contain operator returns the set of pieces in S that contain some piece in T. As an example, to search for all the rows in the third table of a page, we write,
rows := P.Elem("TR") in P.Elem("TABLE")[2]
and to search for all the level two headings that mention the word UCI we write
titles := P.Elem("H2") contain P.Pat("UCI")
As well as for positional operators, we define two stronger, non-transitive operators directlyin and directlycontain that address direct containment and direct inclusion properties. They return the set of only the first pieces in S that contain or are contained in some piece in T. The following example depicts the differences:
// retrieves all the subsections -> {LI1, LI2,
LI3, LI4, LI5, LI6}
subsections := P.Elem("LI") in P.Elem("UL")[0]
However, in many cases we are not interested in nested lists and would only like to retrieve the list items of the top-level list. Therefore we use the directlyin operator and write:
// retrieve only the toplevel subsections ->
{LI1, LI2, LI3, LI6}
subsections := P.Elem("LI") directlyin P.Elem("UL")[0]
In practice, the most widely employed technique for searching in text documents is pattern matching using regular expressions [Fri97, IEEE92]. Regarding structured text search, the limitations of regular expressions are twofold: they completely lack information about the structure of the document and they apply a "leftmost longest match" rule which is often inappropriate for nested data structures. Searching for a table, for example, only returns a correct match if there is only one table in the document. A discussion of this problem is found in [CC97].
Several improved approaches to extracting information from semi-structured text documents have recently been proposed. The most prominent techniques are based on tree matching, grammar parsing, and set algebras.
In tree matching, the search problem is reduced to searching a subtree (i.e. pattern) in a parse-tree (i.e. view). The main disadvantage of tree matching is the lack of orthogonality and compositionality regarding different views (i.e. different parse trees). Queries that search for character patterns cannot be mixed with searches for special structures in the document. In addition, many of the tree matching problems cannot be solved in linear time, but have polynomial runtime. Some problems (such as unordered path inclusion) are even NP-complete [Kil92]. Several recent programming and searching languages are based on tree matching, among them the programming language Turquois [MM97].
Context free grammars pursue an approach, in which the search pattern is specified as a context free grammar [ST96]. The result of a search query are all the substrings in the document that are accepted by the specified grammar. On the one hand, context free grammars are very expressive in that they allow the definition of recursive search queries. On the other hand, they suffer from the same problems as tree matching: they do not allow expressing view-spanning and overlapping queries and require polynomial runtime.
Lately, several new techniques have been published that are based on a set algebra [ST92, JK95, CCB94]. The Standard Document Query Language (SDQL) of the Document Style Semantics and Specification Language or DSSSL [DSSSL96] introduces the concept of nodes and node-lists, which are loosely related to our pieces and piece-lists. Some of the WebL operators are provided and the user can also program new ones in a Lisp-like language. The data structure SDQL operates on called a grove is essentially a tree of nodes corresponding to elements in the document, and thus multiple views and overlapping elements cannot be modelled. PAT expressions [ST92] use a set-at-a-time algebra for manipulating sets of match-points and sets of regions. In contrast to the WebL search algebra, PAT expressions do not support an orthogonal and unified model. Sets of match-points and sets of regions cannot be arbitrarily composed and, in regard to document transformation, match points are not very practical since only the starting position of a match is recorded. However, most of these problems can be avoided. Clarke, Cormack, and Burkowski propose a compositional structured text algebra that is based on the notion of sets and ranges [CCB94]. Apart from the WebL set-algebra, this is the only other approach that supports overlappings between views. Unfortunately, the idea has not completely been taken to the end. Although the model supports overlappings the language does not (remember that WebL has an explicit overlap operator). Additionally, nestings are avoided by selecting the minimal segments from those set elements that nest. Concerning runtime complexity, all of the set algebra problems can be solved in linear time if no two elements in a set overlap [NY96]. In the worst case, if all the elements in the set overlap with each other, the runtime complexity is quadratic in the number of elements in the set. Considering the unlikelyness of such an event and the importance of overlappings, this is a price that we are willing to pay in WebL.
In contrast to the above high-level search languages, their are also efforts to specify low-level programming API's that provides users with the functionality for document navigation and manipulation, such as navigating through the document parse tree, or modifying HTML and XML elements. The most prominent activity in this area is W3C's document object model [DOM97]. In contrast to WebL, DOM is restricted to manipulating and searching single HTML and XML elements, it does not provide a notion of character patterns, does not support multiple overlapping views, and inherently cannot perform computation.
There are also several recent proposals for automating tasks on the web. The Web Interface Definition Language or WIDL [MA97] enables automation by mapping web content into program variables using declarative descriptions of resources. WIDL provides features to submit queries and to extract features from the resulting pages. WIDL does not determine itself how search is to be done, but rather uses the Java Page Object Model [JS] or the Document Object Model [DOM97]. Page manipulation is not supported. WebSQL [AMM97] is a declarative query language for extracting information from the web. The language emphasis is on extracting connectivity information from pages (for example to locate pages that are two hops away from a specific page). WebSQL regards HTML documents as monolithic objects, and therefore its analyses are limited to simple text matching techniques. The Internet Fish Construction Kit (IFISH) is a tool to build dynamic information gatherers on the web [LaM97]. Internet Fish use "info-chunks", possibly extracted from web pages, or created by other independent fish, to place new info-chunks on a shared black-board. The basic idea is that many fishes specialized for specific tasks (for example looking for telephone numbers in a page) make it easier to extract information from web pages that continually change. IFISH is mainly concerned with the fish control structure and not so much with the page fetching and data extracting steps.
[AMM97] | Gustavo O. Arocena, Alberto O. Mendelzohn, and George A. Mihaila. Applications
of a Web Query Language. Proceedings of WWW6, 1997, Santa Clara, California.
http://atlanta.cs.nchu.ed u.tw/www/PAPER267.html |
[Car94] | Luca Cardelli. Obliq: A language with distributed scope. Research
Report 122, Digital Equipment Corporation Systems Research Center, Palo
Alto, California. June 1994.
ftp://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-12 2.html |
[CC97] | C. L. A. Clarke and G. B. Cormack. On the Use of Regular Expressions for Searching Text. ACM Transactions on Programming Languages and Systems, 19(3), pp 413-426. March 1997. |
[CCB94] | C. L. A. Clarke, G. V. Cormack, and F. J. Burkowski. An Algebra for Structured Text Search and a Framework for its Implementation. Department of Computer Science, University of Waterloo, Canada, Technical Report CS-94-30. August 1994. |
[CD97] | Luca Cardelli and Rowan Davies. Service combinators for Web Computing.
Research Report 148, Digital Equipment Corporation Systems Research Center,
Palo Alto, California. June 1997.
ftp://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-14 8.html |
[CDF97] | Channel Definition Format (CDF). Published by Microsoft Corp. September,
1997.
http://www.microsoft.com/stand ards/cdf.htm |
[DOM97] | W3C DOM working group. Document Object Model Specification.
October 1997.
http://www.w3.org/TR/WD-DOM/ |
[DSSL96] | Document Style Semantics and Specification Language (DSSSL),
ISO/IEC 10179:1996.
http://www.jclark.com/dsssl/ |
[Fri97] | Jeffrey E. F. Friedl. Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools (Nutshell Handbook). O'Reilly and Associates, 1997 |
[IEEE92] | IEEE 1992. Standard for information technology - Portable Operating System Interface (POSIX) - Part 2(Shell and utilities) - Section 2.8 (Regular expression notation). IEEE Std 1003.2, Institute of Electrical and Electronics Engineers, New York 1992. |
[JK95] | J. Jaakkola and P. Kilpelainen. SGREP. University of Helsinki,
Department of Computer Science, 1995.
http://www.cs.helsinki.fi/ ~jjaakkol/sgrep.html |
[JS] | Netscape Corp. JavaScript Guide.
http://developer.netscape.com/library/documentation/communicator/jsgu ide4/index.htm |
[Kil92] | P. Kilpelainen. Tree Matching Problems with Applications to Structured Text Databases. Ph. D. Dissertation, Department of Computer Science, University of Helsinki, Report A-19992-6, Helsinki, Finland. November 1992. |
[LaM97] | Brian A. LaMacchia. The Internet Fish Construction Kit. Proceedings
of WWW6, 1997, Santa Clara, California.
http://atlanta.cs.nchu.ed u.tw/www/PAPER138.html |
[MA97] | Phillip Merrick, Charles Allen. Web Interface Definition Language
(WIDL). Published by webMethods Inc. September 1997.
http://www.w3.org/TR/NOTE-widl |
[MM97] | R. C. Miller and B. A. Myers. Creating Dynamic World Wide Web Pages By Demonstration. Carnegie Mellon University School of Computer Science Tech Report CMU-CS-97-131. May 1997. |
[NY96] | G. Navarro and R. Baeza-Yates. A Class of Linear Algorithms to Process Sets of Segments. In Proceedings of PANEL'96, Volume 2, pp. 671-682, 1996 |
[ST92] | A. Salminen and F. W. Tompa. PAT expressions: an algebra for text search. Acta Linguistica Hungarica, Vol. 41 (1-4), pp. 277-306, 1992-93. |
[ST96] | A. Salminen and F. W. Tompa. Grammars++ for Modelling Information in Text. Department of Computer Science, University of Waterloo, Canada, Technical Report, CS-96-40. November 1996. |
[Wir82] | Niklaus Wirth. Programming in Modula-2 (Texts and Monographs in Computer Science). Springer Verlag, 1982. |
stockQuote := fun(symbol) page := getpage("http://fast.quote.com/fq/quotecom/quote", [. symbols=symbol .]); (page.Elem("B") in (page.Elem("TABLE") contain page.Pat("Stock Quotes"))[0])[1].Text() end; s := stockQuote("DEC")
shopAmazon := fun(title, authorfirst, authorlast) books := []; params := [. .]; params["author"] := authorfirst + " " + authorlast; params["author-mode"] := "full"; params["title"] := title; params["title-mode"] := "word"; params["subject"] := ""; params["subject-mode"] := "word"; page := postpage("http://www.amazon.com/exec/obidos/ats-query/", params); items := page.Elem("dd"); every book in items do info1 := substring(book.Text(), `\w*([^/]*) (/ ([^/]*))?(/ [^\d]*(\d*))?`)[0]; info2 := substring(book.Text(), `Our Price: \$(\d*.\d*)`); if (size(info2) > 0) and (info1[3] != "Audio Cassette") then books = books + [[. title = (page.Elem("a") directlybefore book)[0].Text(), link = (page.Elem("a") directlybefore book)[0].href, author = info1[1], type = info1[3], year = (select(info1[5],-2,-1) ? "N/A"), price = info2[0][1] .]] end end; books end