Plugins
Companies that produce connectors, file filters, and text analytics software are working to create plugins for the OpenPipeline framework. This is a wiki page; feel free to add your own plugins below.
Connectors
Connectors crawl data sources or listen for incoming messages.
| Name | Provider | License | |
|---|---|---|---|
| File Scanner | Built in | Apache | A crawler for file systems |
| SQL Crawler | Built in | Apache | A crawler for SQL databases. Will crawl any database with a JDBC driver |
| Item Sender/Receiver | Built in | Apache | A connector/stage combination for sending and receiving items acrosss a network using web services. |
| Web Crawler | Dieselpoint | Commercial | A general web crawler |
| Sharepoint Crawler | Dieselpoint | Commercial | A crawler for Sharepoint 2003 & 2007 |
| Day Communique | Dieselpoint | Commercial | A crawler for Day’s Communique content management system |
| Generic JCR crawler | Dieselpoint | Commercial | A crawler for JCR repositories. Several content management systems support JCR, including Alfresco and Magnolia. |
| IMAP | Dieselpoint | Commercial | A crawler for mail servers that implement the IMAP interface. |
| Exchange | Dieselpoint | Commercial | A crawler for Microsoft Exchange. |
| Vignette | Dieselpoint | Commercial | A crawler for Vignette content management software. |
| Documentum | Raritan Technologies | Commercial | A crawler for Documentum content management software. |
| Findwise Empty Connector | Findwise | Creative Commons | This connector runs a number of stages one time. Useful for testing pipelines. |
| Findwise Web Crawler | Findwise | Creative Commons | This connector fetches web pages from the internet using http. |
| Findwise Item Receiver | Findwise | Creative Commons | Webservice SOAP interface to Open Pipeline. |
| Findwise Open Pipeline API | Findwise | Creative Commons | This is not a real Open Pipeline connector, instead it is a library API that can connect to the Findwise Item Receiver connector. With this API you can programmatically send Items to an Open Pipeline installation. It can be integrated in any java program or web application of your choice. |
Stages
Stages transform an item in a pipeline in some manner.
| Name | Provider | License | |
|---|---|---|---|
| Attribute Remover | Built in | Apache | Removes specific attributes from an item. |
| Disk Writer | Built in | Apache | Writes items to disk, usually for debugging. |
| Flattener | Built in | Apache | Flattens an item by removing the tree-like hierarchy of attributes and promoting all attributes to the top. |
| Simple Sentence Extractor | Built in | Apache | Recognizes sentences, adds annotations for them. Uses the JDK’s built-in break iterator. |
| Simple Tokenizer | Built in | Apache | Breaks text up into tokens. |
| Open Calais Wrapper | Built in | Apache | Runs an item through Open Calais’text analytics web service. |
| Item Sender | Findwise | Creative Commons | An Item sender sage that sends Items to a Findwise Item Reciever connector on the same or another instance of Open Pipeline. |
| Special Token Handler | Dieselpoint | Commercial | A handler for acronyms, hyphenated words, words with apostrophes |
| Named Entity Recognizers | Dieselpoint | Commercial | Recognizes persons, places, companies, and other types of entities, adds to the document as annotations. Makes it possible to search, navigate, and display these entities. |
| Unicode character normalizer | Dieselpoint | Commercial | Necessary for proper handling of Unicode text in a search engine |
| Language detector | Dieselpoint | Commercial | Automatically recognizes the language a document is written in, annotates the document. |
| Indexer | Dieselpoint | Commercial | Adds the document to a Dieselpoint full-text search and navigation index. |
| Concept extractor | Dieselpoint | Commercial | Recognizes “concepts” in a document, makes them available for concept-based search. |
| Email transformer | Dieselpoint | Commercial | Provides a number of transformations on email. |
| HTML handlers | Dieselpoint | Commercial | Provides a way of marking or stripping HTML markup in an item so that it does not interfere with other processing. |
| Tokenizer | Dieselpoint | Commercial | A sophisticated tokenizer for text in a variety of languages. |
| Stopword handler | Dieselpoint | Commercial | Recognizes, marks stopwords. |
| LingPipe Wrapper | Alias-i | Apache/GPL/Commercial | LingPipe is a text processing library that provides a wide variety of text analytics functions. The wrapper makes most of LingPipe’s functionality available in OpenPipeline. LingPipe can detect a wide variety of entities, including person names, company names, places, phrases, parts of speech, proteins, and other constructs. The wrapper itself is Apache licensed; the LingPipe library is dual open source/commercial licensed. |
| UIMA Wrapper | Built in | Apache | A wrapper around IBM’s UIMA text analytics code. Makes UIMA annotators available in OpenPipeline. |
| MC+A Google Search Appliance Sender | MC+A | Commercial | Transforms an Item and sends to a Google Search Appliance. |
| Findwise Google Search Appliance Sender | Findwise | Commercial | Transforms Items and sends them to a Google Search Appliance. Supports web feeds and content feeds. Also supports batching. |
| Drop Stage | Findwise | Creative Commons | Specify rules for dropping items based on the characteristics contained fields |
| Field Re-namer | Findwise | Creative Commons | Re-names fields. |
| Derby Table Deleter | Findwise | Creative Commons | Deletes from the derby-database |
| Field Split | Findwise | Creative Commons | Split the specified field on the specified separtator. The result will be inserted as multiple fields with the name specified as output name. |
| Field Remover | Findwise | Creative Commons | Removes fields with the specified names. |
| Content Filter Stage | Findwise | Creative Commons | Remove parts of the content in a specied field (for example an address that appears on every page on a site) |
| RegExp Extract | Findwise | Creative Commons | Extracts content from item by using the specified Regular Expression |
| Time Stamper | Findwise | Creative Commons | Add a field called processingtime and contains the time the Item was processed. |
| Map Transform | Findwise | Creative Commons | Maps value of input field to value which is written to the output field. Fields are denoted by field name. |
| Date Normalizer | Findwise | Creative Commons | Normalizes Dates |
| RegExp Replacer | Findwise | Creative Commons | Regular expression replacement. |
| Unique Copy | Findwise | Creative Commons | Copies the value of a field only if there is no field with the target name already in the document. |
| Empty Title Fixer | Findwise | Creative Commons | Fixes missing titles. If no title is found, the contents after the last forward-slash in the url will be used as title. URL parameters will not be included in the title. |
| Empty Node Dropper | Findwise | Creative Commons | Drops empty nodes in the Item. |
Doc Filters
Doc Filters extract text and metadata from binary files.
| Name | Provider | License | |
|---|---|---|---|
| Microsoft Office | Dieselpoint | Commercial | Parses MS Word, Excel, and Powerpoint documents. |
| RTF | Dieselpoint | Commercial | Parses RTF documents. |
| Dieselpoint | Commercial | Parses PDFs. Extracts metadata, XMP, extracts titles from the text of the PDF. | |
| HTML | Dieselpoint | Commercial | Provides special handling of HTML and links. |
| XML | Dieselpoint | Commercial | Handles XML in a variety of formats. |
| Other | Dieselpoint | Commercial | Several other formats available. |
| HTML | Findwise | Creative Commons | Parses HTML files. Uses Apache Tika. |