About Searching

About Searching




I

The WASD hypertext environment provides real-time searching ofNplain-text and HTML documents. It is a simple-string search, not a GREP-styleLsearch. It is designed to provide a simple mechanism for locating documentsOcontaining a keyword, not for document analysis. The search string may containspaces. 

Note:

LSearching is a notoriously CPU and I/O intensive activity. Longer searches Iprogressively decrease scheduling priority by one every 10 seconds, from Inormal to zero, helping to reduce impact on any interactive users of the Mserver system. The search algorithm itself is efficient, but searching will Gtake longer on a more heavily loaded system because of this mechanism.
 ?

Only files considered plain-text or HTML will be searched. MOthers may be specified, or be selected from wildcard file specification, but4they will not actually have their contents searched.

1 - Plain-Text Search

M

A search of a plain-text file is straight-forward. Each line in the fileJis searched for the required string. The first time it is encountered is Cconsidered a hit. The line is not searched for any further occurances.K

Searches of plain text files allow the subsequent selection of partial Jdocuments (i.e. the retrieval of only a number of lines around any actual Lhit). This allows the user to selectively extract a portion of a document, Iavoiding the need to explicitly scan through to the section of interest. 

2 - HTML Search

M

A search of an HTML file is a little more complex. As might be expected,Gonly text presented as part of the document is searched, markup text isHignored. That is, all text not part of an HTML tag construct isJextracted and searched. For example, out of the following HTML fragment .

  <!-- an example HTML document -->  <P>Q  The document entitled <A HREF="example.html">"Example Document"</A>P  provides only an <I>overview</I> of the full capabilities of HTML.
2only the following text would actually be searchedI
  The document entitled "Example Document" provides only an overview#  of the full capabilities of HTML.
J

The mechanism for partial document retrieval available with plain-textKfiles is not present with HTML documents. HTML files are generally Fholistic documents (i.e. must be treated as a whole), with the Iformatting of current sections often very dependent on the formatting of Hprevious sections. This makes extracting a subsection perilous without Nextensive syntactical analysis. On the positive side, HTML documents tend to Obe already divided into meaningful subdocuments (files), making retrieval of a +hit naturally more-or-less within context. #

2.1 - Unbalanced < >



(Occasionally HTML search results report:+

  HTML problem, unbalanced <> in/     "/hyperdata/html/html-quick/entities.html"
L

This indicates that at the end-of-file the search engine had encounteredKone more or less "<" than ">", when parsing out the markup tags. TheOsearch results become a little suspect because it is then uncertain when it wasDwithin a markup tag or within document text. The document should beinvestigated for errors.

3 - Search Statistics

J

Appended to the page of search results are some statistics on how manyNfiles were searched, how many were not (if any were non-text or non-HTML), theCnumber of hits, and some server system statistics. Example format:

K243 files searched (57 not) with 56 files hit, for a total of 128 hits. D
Elapsed: 00:22.29 CPU: 03.86 I/O: 6138 Disk: 571 Records (lines): 27080