org.openpipeline.pipeline.docfilter
Class HTMLFilter

java.lang.Object
  extended by org.openpipeline.pipeline.docfilter.HTMLFilter
All Implemented Interfaces:
DocFilter

public class HTMLFilter
extends Object
implements DocFilter

Implementation of the DocFilter interface for HTML files


Constructor Summary
HTMLFilter()
           
 
Method Summary
 String getDescription()
          Return a description of this filter suitable for display in the admin interface.
 String getErrorMessage()
          Return any error message that occurs during the parse process.
 Throwable getException()
          Return any exception that occurred during parsing.
 String[] getExtensions()
          Return an array of file extensions that this filter can handle.
 ArrayList getLinks()
          Returns links found in <a href="link"> tags.
 Map getMetaNameTags()
          Returns a Map containing the name/value pairs found in meta tags in the form <meta name="name" content="value">.
 String[] getMimeTypes()
          Return an array of mimetypes that this filter can handle.
 String getName()
          Return the name of this filter.
 boolean getNextItem(Item item)
          Reads data from the input, parses it, and adds it to the the specified item.
 boolean hasError()
          Returns true if the last call to getNextItem() generated an error.
 void setBaseURL(String url)
          Used to resolve relative links in the document.
 void setEncoding(String encoding)
          Set the encoding of the data in the input stream.
 void setExtensions(String[] exts)
          Set the extensions that this filter handles.
 void setInputStream(InputStream in)
          Set the input stream which contains the document to be added.
 void setMimeTypes(String[] mimeTypes)
          Set the mimetypes that this filter handles.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLFilter

public HTMLFilter()
Method Detail

setInputStream

public void setInputStream(InputStream in)
Description copied from interface: DocFilter
Set the input stream which contains the document to be added.

Specified by:
setInputStream in interface DocFilter

setBaseURL

public void setBaseURL(String url)
Used to resolve relative links in the document. Not part of the standard DocFilter interface

Parameters:
url - the URL of the page we're parsing

getNextItem

public boolean getNextItem(Item item)
Description copied from interface: DocFilter
Reads data from the input, parses it, and adds it to the the specified item. This method can be called repeatedly until all items in the stream have been exhausted. For normal documents (like HTML or PDF), there will be only one item. This interface can handle streams of items, though, of the kind seen in multi-item XML files or zip files.

Specified by:
getNextItem in interface DocFilter
Returns:
true if there was data in the input stream, false if the input stream was at the end. Returns true if there was data that generated an error.

getMetaNameTags

public Map getMetaNameTags()
Returns a Map containing the name/value pairs found in meta tags in the form <meta name="name" content="value">. Not part of the standard DocFilter interfacet

Returns:
a Map with the values, if any.

getErrorMessage

public String getErrorMessage()
Description copied from interface: DocFilter
Return any error message that occurs during the parse process. If the parse() method returns false, then this method should return the text of the error.

Specified by:
getErrorMessage in interface DocFilter
Returns:
the error message, or null if the parse was successful

getException

public Throwable getException()
Description copied from interface: DocFilter
Return any exception that occurred during parsing.

Specified by:
getException in interface DocFilter
Returns:
the exception

hasError

public boolean hasError()
Description copied from interface: DocFilter
Returns true if the last call to getNextItem() generated an error. Check getErrorMessage() and getException() for more details.

Specified by:
hasError in interface DocFilter
Returns:
true if there was an erro

setEncoding

public void setEncoding(String encoding)
Description copied from interface: DocFilter
Set the encoding of the data in the input stream. Optional; may apply in cases where the input is plain text or HTML, but will not apply in cases where the document specifies its own encoding.

Specified by:
setEncoding in interface DocFilter
Parameters:
encoding - an encoding string, for example, "UTF-8" or "ISO-8859-1". Must be one supported by the JVM.

getLinks

public ArrayList getLinks()
Returns links found in <a href="link"> tags. Links will be converted from relative to absolute: if a page contains a relative link like "anotherpage.htm", the full URL path will be added so it is returned as "http://mysite.com/anotherpage.htm". Only correctly-formatted URLs will be returned.

Specified by:
getLinks in interface DocFilter
Returns:
an ArrayList of links, where each link is a String. An empty ArrayList is returned if there are no links.

getDescription

public String getDescription()
Description copied from interface: DocFilter
Return a description of this filter suitable for display in the admin interface.

Specified by:
getDescription in interface DocFilter
Returns:
a String description

getExtensions

public String[] getExtensions()
Description copied from interface: DocFilter
Return an array of file extensions that this filter can handle. The extensions should be lower case and omit the dot, for example,

{"htm", "html", "jsp", "asp"}

Specified by:
getExtensions in interface DocFilter
Returns:
a list of extensions

getMimeTypes

public String[] getMimeTypes()
Description copied from interface: DocFilter
Return an array of mimetypes that this filter can handle. For example,

{"text/html", "text/plain"}

Other common mimetypes include application/pdf, application/msword, application/vnd.ms-excel, etc.

Specified by:
getMimeTypes in interface DocFilter
Returns:
a list of mimetypes

getName

public String getName()
Description copied from interface: DocFilter
Return the name of this filter.

Specified by:
getName in interface DocFilter
Returns:
a name

setExtensions

public void setExtensions(String[] exts)
Description copied from interface: DocFilter
Set the extensions that this filter handles. This value doesn't usually change the behavior of this class; it's just stored here and used for display purposes.

Specified by:
setExtensions in interface DocFilter
Parameters:
exts - extensions this class should handle

setMimeTypes

public void setMimeTypes(String[] mimeTypes)
Description copied from interface: DocFilter
Set the mimetypes that this filter handles. This value doesn't usually change the behavior of this class; it's just stored here and used for display purposes.

Specified by:
setMimeTypes in interface DocFilter
Parameters:
mimeTypes - mimetypes this class should handle