All incremental crawls against the MCMS 2002 site are performed as full crawls (832432)



The information in this article applies to:

  • Microsoft Content Management Server 2002 SP1a
  • Microsoft Content Management Server 2002 SP1
  • Microsoft Content Management Server 2002
  • Microsoft Office SharePoint Portal Server 2003
  • Microsoft SharePoint Portal Server 2001 SP2

SYMPTOMS

When you use Microsoft SharePoint Portal Server or Microsoft Office SharePoint Portal Server 2003 as a search engine to create a search result catalog against a Microsoft Content Management Server (MCMS) 2002 Web site, and you then update the search result catalog incrementally (that is, you perform a SharePoint Portal Server incremental crawl), all incremental crawls that are performed against the MCMS 2002 site are performed as full crawls.

RESOLUTION

To resolve this issue, add code to your MCMS 2002 page templates so that SharePoint Portal Server receives the Last-Modified date and time stamp and the Microsoft Internet Information Services (IIS) response code that SharePoint Portal Server must have to determine whether the posting must be re-catalogued. To do this, you must remove the output cache directive in the MCMS 2002 template code. The output cache directive is typically declared at the beginning of the MCMS 2002 template code-behind file (this file is Aspx.cs or Aspx.vb). After you remove the output cache directive from the template, you can still use downlevel caching with the sample code that this article contains.

The code first retrieves the If-Modified-Since HTTP header value from the conditional HTTP GET request. After the code retrieves the value, the code obtains the last modified value of the posting, compares the two date and time stamps, and then returns the corresponding IIS return status code to the client. At the end of the code, your site can still use the output case while the output cache directive is removed from the template. Sample Code
//Declare the variables that you need.
System.DateTime LastModifiedTime, MyModifiedTime, IncrementalIndexTime;
System.String MyString;
bool Return304 = false;

//Get the last modified time for the current MCMS posting.
LastModifiedTime = CmsHttpContext.Current.Posting.LastModifiedDate;
//Converting the time format for comparison
MyModifiedTime = CmsHttpContext.Current.Posting.LastModifiedDate.ToUniversalTime();
//Retrieving the If-Modified-Sinced HTTP header value from the HTTP GET request
MyString = HttpContext.Current.Request.Headers.Get("If-Modified-Since");

//Check to see if it is a conditional HTTP GET.
if (MyString != null)
{
 //This is a conditional HTTP GET request. Compare the strings.
 try
 {
  IncrementalIndexTime = Convert.ToDateTime(MyString).ToUniversalTime();
  
  if(IncrementalIndexTime.ToString() == CmsHttpContext.Current.Posting.LastModifiedDate.ToString())
  {
   Return304 = true;
  }
 }
 catch
 {
 }
}
if(Return304 == true)
{
 Response.StatusCode = 304;
 Response.End();
}

if(CmsHttpContext.Current.Mode==Microsoft.ContentManagement.Publishing.PublishingMode.Published)
{
 //This is the code that causes ASP.NET to send the header.
 Response.Cache.SetLastModified(CmsHttpContext.Current.Posting.LastModifiedDate.ToLocalTime());
 //The following lines enable downlevel caching in proxy servers or browser cache.
 Response.Cache.SetCacheability(System.Web.HttpCacheability.Public);
 //Set the expiration time for the downlevel cache (5 minutes is used in this sample).
 Response.Cache.SetExpires(System.DateTime.Now.AddMinutes(5));
 Response.Cache.SetValidUntilExpires(true);

MORE INFORMATION

A SharePoint Portal Server incremental crawl relies on two factors that the IIS server returns:
  • A response status code of either 304 (Not Modified) or 200 (OK) to the condition HTTP GET request from SharePoint Portal Server.
  • A Last-Modified date and time stamp for the posting. The Last-Modified date and time stamp is found in the Last-Modified HTTP header.
When SharePoint Portal Server starts an incremental crawl, SharePoint Portal Server sends HTTP GET requests to all the postings on the Web site. If a record shows that the posting has been previously catalogued, SharePoint Portal Server sends out a condition HTTP GET request. A condition HTTP GET request is an HTTP GET request with the If-Modified-Since HTTP header. The If-Modified-Since date and time stamp is the Last-Modified date and time stamp value that is received from IIS when the posting is catalogued. With the value for the If-Modified-Since HTTP header, IIS compares the last modified date and time. If the last modified date is earlier than or equal to the value that is received from the If-Modified-Since header, IIS returns a status code of 304, and SharePoint Portal Server skips the posting. If the last modified date is not earlier than or equal to the value that is received from the If-Modified-Since header, IIS returns a status code of 200, and SharePoint Portal Server re-indexes the posting.

By design, a request to MCMS 2002 postings always yields an IIS return status of 200 because MCMS 2002 postings are generated on the fly, and there is no physical file that IIS can use to compare the last modified date and time value. Because of the by-design behavior of MCMS 2002, incremental crawls against MCMS 2002 from SharePoint Portal Server are not successful; therefore, incremental crawls against MCMS 2002 from SharePoint Portal Server cause a full index every time. This may be very time-consuming on large sites. This behavior has not been confirmed on search engines other than SharePoint Portal Server; however, this may be an issue on other search engines that also rely on the IIS return status code and the Last-Modified HTTP header value to perform incremental indexing on a Web site. If this is an issue on other search engines, you can use the solution that this article describes to resolve the issue.

When you perform a search against an MCMS 2002 Web site, you may also want to make sure that you are not using Microsoft Office Thicket files as resources or attachments on the MCMS 2002 postings. For more information, click the following article number to view the article in the Microsoft Knowledge Base:

830718 Indexing takes a long time when an HTML resource exists in MCMS


Modification Type:MajorLast Reviewed:5/31/2005
Keywords:kbhowto KB832432