XML Searching
http://xml.apache.org/http://www.apache.org/http://www.w3.org/

Main
User Documentation

Concepts
Overview
Sitemap
Views
Caching
Actions
Matchers and Selectors
Entity Catalogs
MRUMemoryStore
Persistence
StoreJanitor
XMLSearching

Introduction

This document describes indexing, and searching XML documents in Apache Cocoon.

Indexing describes the process of fetching XML documents from an Apache Cocoon instance, and building an index file. Searching describes the process of querying the once built index.

Decomposition of XMLSearching

The indexing process is split up into crawling, fetching URL resource, and generating the index.

The searching process is split up into searching, and feeding search result into the Apache Cocoon pipeline.

Crawling

The crawling process is specified by

  1. Base URL to start crawling from
  2. Included, and excluded URLs
  3. Cocoon view to use for requesting links from an XML resource

Specifying the base URL determines the protocol for fetching XML resources. The implementation offers to specify http: URLs, crawling an Apache Cocoon instance deployed in a servlet-engine. Alternativly you may specify an URI, e.g.: /documents/index.html, offering to crawl the local Apache Cocoon instance only, either servlet-deployed, or in commandline-mode.

Fetching URL resource

This processing step fetches an URL resource from Apache Cocoon.

Apache Cocoon offers the feature of views. This feature is used to fetch the 'bare' content of an URL.

The above described crawling component is used by the this processing step to retrieve a link of an XML document. The link name is augmented by a cocoon view name for fetching the XML resource.

The Avalon component CocoonCrawler defines the interface of a crawler.

Generating index

A xml resource is fed into a indexing engine. Generating an index specifies which elements of an XML resources should get indexed, how the elements are stored in the indexed. Moreover the physical file location of the index is specified by this processing step.

The current implementation splits up an XML resource the following way:

  • Use an Lucene Analyzer for splitting up text
  • Each XML element is indexed using its name as Lucene field name.
  • Each XML attribute is indexed using its element name and the attribute name as field name. An attribute has following field name {element-name}@{attribute-name}.

The Avalon component LuceneCocoonIndexer defines the interface of an indexer.

The Avalon component LuceneXMLIndexer defines an interface for building an lucene index from an XML document. It uses an SAX content handler for parsing an XML document, and generating Lucene fields, the current index layout is implemented by SimpleLuceneXMLIndexerImpl, and LuceneIndexContentHandler.

Searching

This process uses a search engine for querying the index. The input of this process is a search query string, the result is the search result of the search engine.

The Avalon component LuceneCocoonSearcher defines an interface for searching a Lucene index.

Feeding Search Results

This is the final step for presenting information stored in the index. The result of search engine is feed into the Cocoon processing pipeline.

A GUI for the searching process may be developed using any java enabled script language, like JSP, or XSP. Moreover a sitemap generator component SearchGenerator is provided which transforms the search result to XML, and feeds it into the Cocoon processing pipeline.

Interdependencies

As both Avalon components LuceneXMLIndexer, and LuceneCocoonSearcher may use the same Lucene index, you must take care of the Lucene index structure in both compoents.

The current implementation uses following Lucene index layout

  • Lucene field body indexed field of the pure text of an XML document. The body field is the default field name for searching. Thus the query-string foo, and body:foo is equivalent.
  • Each XML element generates a Lucene field having the same name as the XML element name. For example searching for occurences of Cocoon inside of an XML abstract elemen, use query-string abstact:Cocoon.
  • Each XML attribute generates a Lucene field having the name {element-name}@{attribute-name}. For example searching for occurences of Cocoon inside of an XML title attribute of s1 element, use query-string s1@title:Cocoon.
  • The Lucene field url stores the URI of the indexed document. As all fields described above are only indexed information, and no XML document is stored inside the Lucene index, this field is the only reference to the XML document resource.
  • The Lucene field uid stores an unique id for implementing updating the index. This field is used for checking if the XML resource is newer than the information stored in the Lucene index.
Configuration

Configuring the indexing, and searching Avalon components is specified in the cocoon.xconf file.

Setting up the sitemap component SearchGenerator takes place in the sitemap.xmap file.

Implementation notes

The package org.apache.cocoon.components.search holds all searching relevant components. The current implementation uses Jakarta Lucene as its indexing, and searching engine.

SearchGenerator is sitemap generator and is available in the package org.apache.cocoon.generation.

The package org.apache.cocoon.components.crawler holds all crawling relevant sources.

WebApp Sample usage

The Cocoon sample webapplication has a link for generating, an index of the Cocoon documentation, and searching the Cocoon documentation.

The following list describes step by step how to make use of webapp sample page:

  1. Go to the page "Search the docs".
  2. Create an index, follow the link "create". Creating an index may take some time, as the implementation accesses the XML resources via http: protocol.
  3. Next you may query the index, by following the link "XSP", or "Cocoon Generators". Typing in a query will result in the table of hits orderer by relevance.

As a result of the creation step, there should exist an Lucene index in the directory index below the temporary working directory of the servlet engine.

The "XSP" link for searching shows an XSP implementation of invoking the Avalon component CocoonSearch. Using this approach gives fine grained control over the searching process.

The "Cocoon Generator" links defines in the sitemap using the SearchGenerator, and transforming the XML search result to HTML. This approach tries to minimize your effort of using searching, as you need to adapt the XSLT transformation step only to your needs.

Summary

This document gives an overview of the components for using an indexing, and searching engine in Cocoon. It described the component decomposition of the Cocoon XMLSearch subsystem.

Copyright © 1999-2002 The Apache Software Foundation. All Rights Reserved.