About WASD Searching
 *WASDquery

The WASD hypertext environment provides real-time searching of plain-text and HTML documents. It is a simple-string search, not a GREP-style search, although it may contain the wildcard characters "*", matching zero or more characters, and "%", matching any single character. (When no wildcard characters delimit the search string it behaves as if "*" characters were present.) It is designed to provide a simple mechanism for locating documents containing a keyword, not for document analysis. The search string may contain spaces.

Only files considered plain-text or HTML will be searched. Others may be specified, or be selected from wildcard file specification, but they will not actually have their contents searched.

  Plain-Text Search

A search of a plain-text file is straight-forward. Each line in the file is searched for the required string. The first time it is encountered is considered a hit. The line is not searched for any further occurances.

Searches of plain text files allow the subsequent selection of partial documents (i.e. the retrieval of only a number of lines around any actual hit). This allows the user to selectively extract a portion of a document, avoiding the need to explicitly scan through to the section of interest.

  HTML Search

A search of an HTML file is a little more complex. As might be expected, only text presented as part of the document is searched, markup text is ignored. That is, all text not part of an HTML tag construct is extracted and searched. For example, out of the following HTML fragment

  <!-- an example HTML document -->
  <P>
  The document entitled <A HREF="example.html">"Example Document"</A>
  provides only an <I>overview</I> of the full capabilities of HTML.
only the following text would actually be searched
  The document entitled "Example Document" provides only an overview
  of the full capabilities of HTML.

The HTML character entities "&amp;", "&gt;", "&lt;", "&nbsp;", "&quot;" and "&#nnn;" are converted to the representative character before matching.

The mechanism for partial document retrieval available with plain-text files is not present with HTML documents. HTML files are generally holistic documents (i.e. must be treated as a whole), with the formatting of current sections often very dependent on the formatting of previous sections. This makes extracting a subsection perilous without extensive syntactical analysis. On the positive side, HTML documents tend to be already divided into meaningful subdocuments (files), making retrieval of a hit naturally more-or-less within context.

HTML problem, unbalanced <>

Occasionally HTML search results report:

HTML problem, unbalanced <> in /whatever/path/to/file.html

This indicates that at the end-of-file the search engine had encountered one more or less "<" than ">", when parsing out the markup tags. The search results become a little suspect because it is then uncertain when it was within a markup tag or within document text. The document should be investigated for errors.

  Search Statistics

Appended to the page of search results are some statistics on how many files were searched, how many were not (if any were non-text or non-HTML), the number of hits, and some server system statistics. This is an example:

 135 files searched (40 not) with 79 files hit, for a total of 323 hits.
 Elapsed: 00:06.95  CPU: 00:02.27  I/O: 467  Disk: 473  Records (lines): 55701


  Query v2.5;  May 1998