About Searching


The WASD hypertext environment provides real-time searching of plain-text and HTML documents. It is a simple-string search, not a GREP-style search. It is designed to provide a simple mechanism for locating documents containing a keyword, not for document analysis. The search string may contain spaces.

Note:

Searching is a notoriously CPU and I/O intensive activity. Longer searches progressively decrease scheduling priority by one every 10 seconds, from normal to zero, helping to reduce impact on any interactive users of the server system. The search algorithm itself is efficient, but searching will take longer on a more heavily loaded system because of this mechanism.

Only files considered plain-text or HTML will be searched. Others may be specified, or be selected from wildcard file specification, but they will not actually have their contents searched.

1 - Plain-Text Search

A search of a plain-text file is straight-forward. Each line in the file is searched for the required string. The first time it is encountered is considered a hit. The line is not searched for any further occurances.

Searches of plain text files allow the subsequent selection of partial documents (i.e. the retrieval of only a number of lines around any actual hit). This allows the user to selectively extract a portion of a document, avoiding the need to explicitly scan through to the section of interest.

2 - HTML Search

A search of an HTML file is a little more complex. As might be expected, only text presented as part of the document is searched, markup text is ignored. That is, all text not part of an HTML tag construct is extracted and searched. For example, out of the following HTML fragment

  <!-- an example HTML document -->
  <P>
  The document entitled <A HREF="example.html">"Example Document"</A>
  provides only an <I>overview</I> of the full capabilities of HTML.
only the following text would actually be searched
  The document entitled "Example Document" provides only an overview
  of the full capabilities of HTML.

The mechanism for partial document retrieval available with plain-text files is not present with HTML documents. HTML files are generally holistic documents (i.e. must be treated as a whole), with the formatting of current sections often very dependent on the formatting of previous sections. This makes extracting a subsection perilous without extensive syntactical analysis. On the positive side, HTML documents tend to be already divided into meaningful subdocuments (files), making retrieval of a hit naturally more-or-less within context.

2.1 - Unbalanced < >

Occasionally HTML search results report:

  HTML problem, unbalanced <> in
     "/hyperdata/html/html-quick/entities.html"

This indicates that at the end-of-file the search engine had encountered one more or less "<" than ">", when parsing out the markup tags. The search results become a little suspect because it is then uncertain when it was within a markup tag or within document text. The document should be investigated for errors.

3 - Search Statistics

Appended to the page of search results are some statistics on how many files were searched, how many were not (if any were non-text or non-HTML), the number of hits, and some server system statistics. Example format:

243 files searched (57 not) with 56 files hit, for a total of 128 hits.
Elapsed: 00:22.29 CPU: 03.86 I/O: 613 Disk: 571 Records (lines): 27080