MORE INFORMATION
If you understand the following considerations, you can
ensure that you perform indexing in the most efficient way, without causing
adverse effects to the Web sites. Understanding the following considerations
can also help you when troubleshooting common indexing issues.
Controlling Crawler Access with a Robots.txt File and HTML META Tags
A Web site administrator can use a Robots.txt file to indicate
where robots (Web crawlers) can go on a Web site and whether or not to exclude
specific crawlers. Web servers use these rules to control access to sites by
preventing robots from accessing certain areas. Microsoft SharePoint Portal Server 2001 and Microsoft Office SharePoint Portal Server 2003 always look for this file when crawling and obey the restrictions in it.
A
Web site administrator can also restrict access to certain documents by using
Hypertext Markup Language (HTML) META tags. These tags tell the robot whether
it can include the document in the index and whether it can follow the links in
the document by using the INDEX/NOINDEX and FOLLOW/NOFOLLOW attributes in the
tag. For example, if you do not want the document to be crawled and you do not
want links in the document to be followed, you can mark a document with
following tag:
<META name="robots" content= "noindex, nofollow">
SharePoint Portal Server always obeys the HTML rules of robots
exclusion when SharePoint Portal Server crawls Web sites. Note that robots
exclusions are counted as rule exclusions (which are not visible in the
gatherer log viewer by default) by SharePoint Portal Server. See the "Gatherer
Log Information" section of this article for additional information about how
to view the gatherer logs.
Robots.txt files specify restrictions for
each User Agent. Change the User Agent string to identify your site when
crawling the Internet. By default, the string for SharePoint Portal Server is:
Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 4.0 Robot) Microsoft
To add your identifier, you need to modify the
registry.
WARNING: If you use Registry Editor incorrectly, you may cause serious
problems that may require you to reinstall your operating system. Microsoft
cannot guarantee that you can solve problems that result from using Registry
Editor incorrectly. Use Registry Editor at your own
risk.
To add your identifier, add the registry key that is appropriate for your version of Sharepoint Portal Server:
For Microsoft SharePoint Portal Server 2001, add the following key:
HKEY_LOCAL_MACHINE\Software\Microsoft\Search\1.0\Gathering Manager\UserAgent
For Microsoft Office SharePoint Portal Server 2003, add the following key:
HKEY_LOCAL_MACHINE\Software\Microsoft\SPSSearch\Gathering Manager
Following Complex Links
By default, SharePoint Portal Server does not follow complex
links (links that contain commands following a question mark in the URL; for
example, http://www.mysite.com/default.asp?url=/somedir/somefile.htm). If the
site that you are crawling contains complex links that you want to follow, you
must create a site path rule for the site:
- In the Management/Content Sources folder, click Additional Settings.
- Click Site Paths, and then click New.
- Type the URL for the site, make sure that you place a
wildcard character at the end, and then click Include this
path.
- Click Options, and then click Enable complex links. If the
selection is unavailable (appears dimmed), make sure that you typed a properly
formed URL with a trailing wildcard character at the end in step 3 (for
example, http://www.microsoft.com/*).
Crawling Password-Protected Web Sites
There is no way to specify credentials in the URL that you
specify for a Web site content source. If you want to crawl a
password-protected site, create a site path rule. Follow steps 1 through 3 in
the "Following Complex Links" section of this article to create the site path
rule. Click the
Options tab, click the
Account tab, and then provide the user name and password.
Understanding That a File Type That Is Referenced in a Link May Be Excluded
Each workspace maintains a file type inclusion list, and when
content sources are indexed, only those file types are indexed. If a Web site
link references an excluded file type, the link is not followed and is logged
as a rule exclusion. One example is a link such as:
http://www.mysite.com/Index.cfm?ArticleID=q284022
Unless .cfm is added to the file type inclusion list, the link is
not followed.
Adding Proxy Server Settings
If your network uses a proxy server to access the Internet, you
must provide the proxy server information so that the SharePoint Portal Server
crawler can use that information. This information is configured on the
Proxy Server tab of the server's properties in the SharePoint Portal Server
Administration console. This information is used only by search, and if you
change the information in that location, you do not affect any settings that
are configured in Microsoft Internet Explorer.
Configuring Host Hops
When you create a Web site content source, you choose to index
either
This page or
This site. When you select
This site, all of the links to pages within that site are followed, but
links to other Web sites are not. This can cause an immediate index failure if
you attempt to index a Web site in which the default page performs an immediate
redirect to another site (for example, if you connect to
http://
my_site.com and you are redirected to a
default page at http://
my_alternate_site.com). In
this case, or if you want the crawler to follow links to other sites, you must
configure a custom host hops setting. You can set host hops on the
Configuration tab of the Web site content source properties.
IMPORTANT: Limit the number of site hops to the absolute minimum number
necessary. When you perform an Internet crawl, you might index millions of
documents in just a few site hops. If you set the number of site hops on a Web
site content source to unlimited (by clicking
Custom, and then clicking to clear the
Limit site hops
and
Limit page depth check boxes), you must include a site
path rule that specifically includes that content source in the index.
Otherwise, the content source is automatically excluded from the index to
prevent unlimited crawling. The site path rule strategy that is recommended
when you are crawling Internet sites is to create an exclusion rule for the
entire HTTP URL space (http://*), and then create inclusion rules for only
those sites that you want to index.
Being a Considerate Crawler
When you crawl someone else's Web site, you increase the load on
that server. You can use site hit frequency rules to avoid overloading a Web
site that you are indexing. Site hit frequency rules specify how frequently
documents are requested from a site and how many documents are requested. Site
hit frequency rules are configured on the
Load tab of the server's properties in the SharePoint Portal Server
Administration console.
Gatherer Log Information
When you perform an index update, crawling activity is recorded
in gatherer logs. The easiest way to view the gatherer logs is to use the
gatherer log Web page viewer. To gain access to the log viewer, click the
Click here for detailed log link in the Content Sources folder. By default, only error
messages are displayed in the log viewer. If you want to view rule exclusions
and successes for troubleshooting purposes, you can enable this ability on the
Logging tab of the workspace properties in the SharePoint Portal Server
Administration console. It is recommended that you not enable these logging
settings unless you are actively troubleshooting because logging the additional
information greatly increases the log file size.
You can also view
success and rule exclusions by using the Gthrlog.vbs utility, a command-line
utility that is located in the Support\Tools folder on the SharePoint Portal
Server CD-ROM.