Indexing Web Sites on the Internet (284022)

MORE INFORMATION

If you understand the following considerations, you can ensure that you perform indexing in the most efficient way, without causing adverse effects to the Web sites. Understanding the following considerations can also help you when troubleshooting common indexing issues.

Controlling Crawler Access with a Robots.txt File and HTML META Tags

A Web site administrator can use a Robots.txt file to indicate where robots (Web crawlers) can go on a Web site and whether or not to exclude specific crawlers. Web servers use these rules to control access to sites by preventing robots from accessing certain areas. Microsoft SharePoint Portal Server 2001 and Microsoft Office SharePoint Portal Server 2003 always look for this file when crawling and obey the restrictions in it.

A Web site administrator can also restrict access to certain documents by using Hypertext Markup Language (HTML) META tags. These tags tell the robot whether it can include the document in the index and whether it can follow the links in the document by using the INDEX/NOINDEX and FOLLOW/NOFOLLOW attributes in the tag. For example, if you do not want the document to be crawled and you do not want links in the document to be followed, you can mark a document with following tag:

SharePoint Portal Server always obeys the HTML rules of robots exclusion when SharePoint Portal Server crawls Web sites. Note that robots exclusions are counted as rule exclusions (which are not visible in the gatherer log viewer by default) by SharePoint Portal Server. See the "Gatherer Log Information" section of this article for additional information about how to view the gatherer logs.

Robots.txt files specify restrictions for each User Agent. Change the User Agent string to identify your site when crawling the Internet. By default, the string for SharePoint Portal Server is:

Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 4.0 Robot) Microsoft

To add your identifier, you need to modify the registry.

WARNING: If you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. Microsoft cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk.

To add your identifier, add the registry key that is appropriate for your version of Sharepoint Portal Server:

For Microsoft SharePoint Portal Server 2001, add the following key:

HKEY_LOCAL_MACHINE\Software\Microsoft\Search\1.0\Gathering Manager\UserAgent

For Microsoft Office SharePoint Portal Server 2003, add the following key:

HKEY_LOCAL_MACHINE\Software\Microsoft\SPSSearch\Gathering Manager

Following Complex Links

By default, SharePoint Portal Server does not follow complex links (links that contain commands following a question mark in the URL; for example, http://www.mysite.com/default.asp?url=/somedir/somefile.htm). If the site that you are crawling contains complex links that you want to follow, you must create a site path rule for the site:

In the Management/Content Sources folder, click Additional Settings.
Click Site Paths, and then click New.
Type the URL for the site, make sure that you place a wildcard character at the end, and then click Include this path.
Click Options, and then click Enable complex links. If the selection is unavailable (appears dimmed), make sure that you typed a properly formed URL with a trailing wildcard character at the end in step 3 (for example, http://www.microsoft.com/*).

Crawling Password-Protected Web Sites

There is no way to specify credentials in the URL that you specify for a Web site content source. If you want to crawl a password-protected site, create a site path rule. Follow steps 1 through 3 in the "Following Complex Links" section of this article to create the site path rule. Click the Options tab, click the Account tab, and then provide the user name and password.

Understanding That a File Type That Is Referenced in a Link May Be Excluded

Each workspace maintains a file type inclusion list, and when content sources are indexed, only those file types are indexed. If a Web site link references an excluded file type, the link is not followed and is logged as a rule exclusion. One example is a link such as:

http://www.mysite.com/Index.cfm?ArticleID=q284022

Unless .cfm is added to the file type inclusion list, the link is not followed.

Adding Proxy Server Settings

If your network uses a proxy server to access the Internet, you must provide the proxy server information so that the SharePoint Portal Server crawler can use that information. This information is configured on the Proxy Server tab of the server's properties in the SharePoint Portal Server Administration console. This information is used only by search, and if you change the information in that location, you do not affect any settings that are configured in Microsoft Internet Explorer.

Configuring Host Hops

When you create a Web site content source, you choose to index either This page or This site. When you select This site, all of the links to pages within that site are followed, but links to other Web sites are not. This can cause an immediate index failure if you attempt to index a Web site in which the default page performs an immediate redirect to another site (for example, if you connect to http://my_site.com and you are redirected to a default page at http://my_alternate_site.com). In this case, or if you want the crawler to follow links to other sites, you must configure a custom host hops setting. You can set host hops on the Configuration tab of the Web site content source properties.

IMPORTANT: Limit the number of site hops to the absolute minimum number necessary. When you perform an Internet crawl, you might index millions of documents in just a few site hops. If you set the number of site hops on a Web site content source to unlimited (by clicking Custom, and then clicking to clear the Limit site hops and Limit page depth check boxes), you must include a site path rule that specifically includes that content source in the index. Otherwise, the content source is automatically excluded from the index to prevent unlimited crawling. The site path rule strategy that is recommended when you are crawling Internet sites is to create an exclusion rule for the entire HTTP URL space (http://*), and then create inclusion rules for only those sites that you want to index.

Being a Considerate Crawler

When you crawl someone else's Web site, you increase the load on that server. You can use site hit frequency rules to avoid overloading a Web site that you are indexing. Site hit frequency rules specify how frequently documents are requested from a site and how many documents are requested. Site hit frequency rules are configured on the Load tab of the server's properties in the SharePoint Portal Server Administration console.

Gatherer Log Information

When you perform an index update, crawling activity is recorded in gatherer logs. The easiest way to view the gatherer logs is to use the gatherer log Web page viewer. To gain access to the log viewer, click the Click here for detailed log link in the Content Sources folder. By default, only error messages are displayed in the log viewer. If you want to view rule exclusions and successes for troubleshooting purposes, you can enable this ability on the Logging tab of the workspace properties in the SharePoint Portal Server Administration console. It is recommended that you not enable these logging settings unless you are actively troubleshooting because logging the additional information greatly increases the log file size.

You can also view success and rule exclusions by using the Gthrlog.vbs utility, a command-line utility that is located in the Support\Tools folder on the SharePoint Portal Server CD-ROM.