com.sun.portal.providers.urlscraper
Class URLScraperProvider

java.lang.Object
  extended bycom.sun.portal.providers.ProviderAdapter
      extended bycom.sun.portal.providers.ProfileProviderAdapter
          extended bycom.sun.portal.providers.urlscraper.URLScraperProvider
All Implemented Interfaces:
Provider, ProviderEditTypes, ProviderWidths
Direct Known Subclasses:
XMLProvider

public class URLScraperProvider
extends ProfileProviderAdapter

A URLScraperProvider is a content provider that can retrieve and display content from a given URL.

URLScraperProvider acts as an HTTP client and makes a request for the content of the specified URL and then displays it in the channel.

Each URLScraper channel has its own timeout attribute. The channel will wait up to its individual timeout to receive content.

Forwarding of cookies
Each URLScraper channel has a cookiesToForwardList attribute that can be set on the in the display profile. If a cookie is allowed by this attribute, a cookie in the request coming from the browser will be forwarded to the web server specified for the URL. allCookies attribute can be set to true to allow all the cookies. A set-cookie request from that web server will be sent back to the browser. The set-cookie request is modified so that the cookie is only sent back to the portal server.

URL Rewriting
The content gathered by the channel will be rewritten if the rewriter is available. The ruleset used by the rewriter can be specified in the display profile attribute rulesetID. Relative URLs are converted to absolute URLs. For example, if your portal server is http://portal.iplanet.com/ and the web server specified in the URL is http://foo.sesta.com/ and the file contains

<IMG SRC="/images/blah.gif">

then the content sent back to browser via portal server will be rewritten as:

<IMG SRC="http://foo.sesta.com/images/blah.gif">

Because otherwise the browser will attempt to read the image from http://portal.sesta.com/images/blah.gif and will not resolve it.

SSL protected pages
In general the URLScraperProvider will work with SSL pages. The important thing to remember is that there can be no level of interaction required by the specified URL as there is no way to pass that information to the end user.

Timeouts
There are 2 timeout values to consider:

Each URLScraper channel has its own timeout attribute. The channel will wait up to its individual timeout to receive content.

Encoding
The order for determining the encoding would be HTTP header, if available (only applies to http(s) urls)
inputEncoding property, if non-blank
tag in content, e.g. meta tag in html & wml, xml header for xml, if available (only applies to HTML, XML,WML determined based on the MIMEType)
system default
MIMEType is determined from the jvm table. If not set, it is determined from the file extension.

Proxy Configuration
URLScraper channel uses a proxy to scrape the url specified if the proxy is set in jvm12.conf file for web server For Example the proxy can be set as
http.proxyHost=
http.proxyPort=

The refreshTime attribute is used for caching and will cause the URL not to be fetched again if the page is reloaded within that time.

NOTE: getEdit() and processEdit() methods are not implemented in URLScraper.


Field Summary
protected static java.lang.String[][] typeTable
          Array of File extensions mapped to the MIMETypes
 
Fields inherited from interface com.sun.portal.providers.ProviderWidths
WIDTH_FULL_BOTTOM, WIDTH_FULL_TOP, WIDTH_THICK, WIDTH_THIN
 
Fields inherited from interface com.sun.portal.providers.ProviderEditTypes
EDIT_COMPLETE, EDIT_SUBSET
 
Constructor Summary
URLScraperProvider()
          Default constructor.
 
Method Summary
protected  boolean forward(java.lang.String cookieName, boolean allCookies, java.util.List cookiesToForwardList)
           This method returns true if allCookies property is true otherwise checks if the cookie name exists in the cookiesToForward list and returns true if it does or false if it doesn't.
 java.lang.StringBuffer getContent(javax.servlet.http.HttpServletRequest req, javax.servlet.http.HttpServletResponse res)
          Get the provider's content by retrieving content from specified URL.
protected  java.lang.String getContentEncoding(java.lang.String contentType, byte[] bytes, java.lang.String MIMEType)
          Gets the charset
protected  java.lang.String getContentEncodingFromContentBytes(byte[] contentBytes)
          Gets the charset from content
protected  java.io.File getFile(java.lang.String pathname)
          This method is called by getContent() if the url returned by getURL() is a file url.
protected  java.lang.StringBuffer getFileAsBuffer(java.lang.String pathName)
          Gets the specified file as StringBuffer
protected  java.lang.StringBuffer getHttpContent(javax.servlet.http.HttpServletRequest req, javax.servlet.http.HttpServletResponse res, java.lang.String url)
          Get the provider's content by retrieving content from the specified http or https URL.
 java.lang.String getInputEncoding()
           Gets the inputEncoding to be used by content.
protected  java.lang.String getRuleSetID()
           Gets the urlScraperRulesetID to be used by rewriter.
protected  int getTimeout()
          Gets the timeout property for the provider.
protected  java.lang.String getURL()
           Gets the url property for the provider.
 boolean isPresentable(javax.servlet.http.HttpServletRequest request)
          Determines presentability for channels based on this provider.
 
Methods inherited from class com.sun.portal.providers.ProfileProviderAdapter
existsBooleanProperty, existsIntegerProperty, existsListProperty, existsListProperty, existsStringProperty, existsStringProperty, getBooleanProperty, getBooleanProperty, getBooleanProperty, getBooleanProperty, getClientProperty, getIntegerProperty, getIntegerProperty, getIntegerProperty, getIntegerProperty, getListProperty, getListProperty, getMapProperty, getMapProperty, getMapProperty, getMapProperty, getMapProperty, getMapProperty, getStringAttribute, getStringProperty, getStringProperty, getStringProperty, getStringProperty, getStringProperty, getStringProperty, getTemplate, getTemplate, getTemplatePath, isAllowed, setBooleanProperty, setClientProperty, setIntegerProperty, setListProperty, setMapProperty, setStringAttribute, setStringProperty
 
Methods inherited from class com.sun.portal.providers.ProviderAdapter
getContent, getDescription, getEdit, getEdit, getEditType, getHelp, getHelp, getName, getProviderContext, getRefreshTime, getResourceBundle, getResourceBundle, getTitle, getWidth, init, isEditable, isPresentable, processEdit, processEdit
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

typeTable

protected static java.lang.String[][] typeTable
Array of File extensions mapped to the MIMETypes

Constructor Detail

URLScraperProvider

public URLScraperProvider()
Default constructor.

Method Detail

getTimeout

protected int getTimeout()
                  throws ProviderException
Gets the timeout property for the provider.

Returns:
timeout value
Throws:
ProviderException - if there is an error getting the timeout property.
See Also:
ProviderException

getURL

protected java.lang.String getURL()
                           throws ProviderException

Gets the url property for the provider. This is the URL from where the contents are fetched

Returns:
URL value
Throws:
ProviderException - if there is an error getting the URL property.
See Also:
ProviderException

getRuleSetID

protected java.lang.String getRuleSetID()
                                 throws ProviderException

Gets the urlScraperRulesetID to be used by rewriter.

Returns:
String value
Throws:
ProviderException - if there is an error getting the urlScrapperRulesetID.
See Also:
ProviderException

forward

protected boolean forward(java.lang.String cookieName,
                          boolean allCookies,
                          java.util.List cookiesToForwardList)

This method returns true if allCookies property is true otherwise checks if the cookie name exists in the cookiesToForward list and returns true if it does or false if it doesn't.

Parameters:
allCookies - allCookies property value from display profile
cookiesToForwardList - cookiesToForwardList property value from display profile
Returns:
boolean value

getInputEncoding

public java.lang.String getInputEncoding()
                                  throws ProviderException

Gets the inputEncoding to be used by content. This method returns the inputEncoding which would be used in encoding the scraped content.

Returns:
String value
Throws:
ProviderException - if there is an error getting the input encoding.
See Also:
ProviderException

isPresentable

public boolean isPresentable(javax.servlet.http.HttpServletRequest request)
Determines presentability for channels based on this provider. This overrides the base class's implementation to returns true for all device

Specified by:
isPresentable in interface Provider
Overrides:
isPresentable in class ProviderAdapter
Parameters:
request - the HttpServletRequest
Returns:
boolean true for all devices
See Also:
Provider.isPresentable(javax.servlet.http.HttpServletRequest)

getContent

public java.lang.StringBuffer getContent(javax.servlet.http.HttpServletRequest req,
                                         javax.servlet.http.HttpServletResponse res)
                                  throws ProviderException

Get the provider's content by retrieving content from specified URL. This method internally calls getHttpContent when the url returned from getURL() is a http or https url. This method wraps certain exceptions thrown, into an error message to display as the channel content.

Specified by:
getContent in interface Provider
Overrides:
getContent in class ProviderAdapter
Parameters:
req - An HttpServletRequest that contains information related to this request for content.
res - An HttpServletResponse that allows the provider to influence the overall response for the desktop page (besides generating the content).
Returns:
Channel content
Throws:
ProviderException - if there was an error generating the content.
See Also:
ProviderException, getHttpContent(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse, java.lang.String), getURL()

getHttpContent

protected java.lang.StringBuffer getHttpContent(javax.servlet.http.HttpServletRequest req,
                                                javax.servlet.http.HttpServletResponse res,
                                                java.lang.String url)
                                         throws java.lang.InterruptedException,
                                                java.net.MalformedURLException,
                                                ProviderException

Get the provider's content by retrieving content from the specified http or https URL.

This method does not handle file URLs. It only handles http or https urls. The content scraped from the specified url is rewritten if a rewriter is available using the ruleset returned by getRuleSetID()

This method throws exceptions for certain exceptional conditions instead of returning an error message in the returned StringBuffer

Parameters:
req - An HttpServletRequest that contains information related to this request for content.
res - An HttpServletResponse that allows the provider to influence the overall response for the desktop page (besides generating the content).
url - http or https url string
Returns:
Scraped content
Throws:
java.lang.InterruptedException - if there is a timeout while trying to get the scraped content
java.net.MalformedURLException - if the url passed in is not a valid http or https url.
ProviderException - if there was an error generating the content
See Also:
ProviderException, getRuleSetID()

getFile

protected java.io.File getFile(java.lang.String pathname)
This method is called by getContent() if the url returned by getURL() is a file url.

Returns:
File Object specified by the pathName or null if the file does not exists or cannot be read.

getFileAsBuffer

protected java.lang.StringBuffer getFileAsBuffer(java.lang.String pathName)
                                          throws java.io.IOException,
                                                 ProviderException
Gets the specified file as StringBuffer

Returns:
StringBuffer containing the data from the specified file or null if file does not exist or cannot be read.
Throws:
java.io.IOException
ProviderException - if there is an error getting the file as StringBuffer.
See Also:
ProviderException

getContentEncoding

protected java.lang.String getContentEncoding(java.lang.String contentType,
                                              byte[] bytes,
                                              java.lang.String MIMEType)
                                       throws ProviderException
Gets the charset

This method determines the charset based on the contentType header if it is available (only applies to http(s) urls), or from the inputEncoding property if it is non-blank, or from the meta tag in content, e.g. meta tag in html, xml or wml header if they are available (only applies to HTML, XML, WML).

Parameters:
contentType - If http(s) urls, null otherwise
bytes - Bytes from the scraped content
MIMEType - MIMEType for the content
Returns:
String charset or null if charset cannot be determined
Throws:
ProviderException - if there is an error getting the charset
See Also:
ProviderException

getContentEncodingFromContentBytes

protected java.lang.String getContentEncodingFromContentBytes(byte[] contentBytes)
Gets the charset from content

This method determines the charset based on meta tag in content

Parameters:
contentBytes - Bytes from the scraped content
Returns:
String charset or null if charset cannot be determined