Plugins

Companies that produce connectors, file filters, and text analytics software are working to create plugins for the OpenPipeline framework. This is a wiki page; feel free to add your own plugins below.

Connectors

Connectors crawl data sources or listen for incoming messages.

Name Provider License
File Scanner Built in Apache A crawler for file systems
SQL Crawler Built in Apache A crawler for SQL databases. Will crawl any database with a JDBC driver
Item Sender/Receiver Built in Apache A connector/stage combination for sending and receiving items acrosss a network using web services.
Web Crawler Dieselpoint Commercial A general web crawler
Sharepoint Crawler Dieselpoint Commercial A crawler for Sharepoint 2003 & 2007
Day Communique Dieselpoint Commercial A crawler for Day’s Communique content management system
Generic JCR crawler Dieselpoint Commercial A crawler for JCR repositories. Several content management systems support JCR, including Alfresco and Magnolia.
IMAP Dieselpoint Commercial A crawler for mail servers that implement the IMAP interface.
Exchange Dieselpoint Commercial A crawler for Microsoft Exchange.
Vignette Dieselpoint Commercial A crawler for Vignette content management software.
Documentum Raritan Technologies Commercial A crawler for Documentum content management software.
Findwise Empty Connector Findwise Creative Commons This connector runs a number of stages one time. Useful for testing pipelines.
Findwise Web Crawler Findwise Creative Commons This connector fetches web pages from the internet using http.
Findwise Item Receiver Findwise Creative Commons Webservice SOAP interface to Open Pipeline.
Findwise Open Pipeline API Findwise Creative Commons This is not a real Open Pipeline connector, instead it is a library API that can connect to the Findwise Item Receiver connector. With this API you can programmatically send Items to an Open Pipeline installation. It can be integrated in any java program or web application of your choice.


Stages

Stages transform an item in a pipeline in some manner.

Name Provider License
Attribute Remover Built in Apache Removes specific attributes from an item.
Disk Writer Built in Apache Writes items to disk, usually for debugging.
Flattener Built in Apache Flattens an item by removing the tree-like hierarchy of attributes and promoting all attributes to the top.
Simple Sentence Extractor Built in Apache Recognizes sentences, adds annotations for them. Uses the JDK’s built-in break iterator.
Simple Tokenizer Built in Apache Breaks text up into tokens.
Open Calais Wrapper Built in Apache Runs an item through Open Calais’text analytics web service.
Item Sender Findwise Creative Commons An Item sender sage that sends Items to a Findwise Item Reciever connector on the same or another instance of Open Pipeline.
Special Token Handler Dieselpoint Commercial A handler for acronyms, hyphenated words, words with apostrophes
Named Entity Recognizers Dieselpoint Commercial Recognizes persons, places, companies, and other types of entities, adds to the document as annotations. Makes it possible to search, navigate, and display these entities.
Unicode character normalizer Dieselpoint Commercial Necessary for proper handling of Unicode text in a search engine
Language detector Dieselpoint Commercial Automatically recognizes the language a document is written in, annotates the document.
Indexer Dieselpoint Commercial Adds the document to a Dieselpoint full-text search and navigation index.
Concept extractor Dieselpoint Commercial Recognizes “concepts” in a document, makes them available for concept-based search.
Email transformer Dieselpoint Commercial Provides a number of transformations on email.
HTML handlers Dieselpoint Commercial Provides a way of marking or stripping HTML markup in an item so that it does not interfere with other processing.
Tokenizer Dieselpoint Commercial A sophisticated tokenizer for text in a variety of languages.
Stopword handler Dieselpoint Commercial Recognizes, marks stopwords.
LingPipe Wrapper Alias-i Apache/GPL/Commercial LingPipe is a text processing library that provides a wide variety of text analytics functions. The wrapper makes most of LingPipe’s functionality available in OpenPipeline. LingPipe can detect a wide variety of entities, including person names, company names, places, phrases, parts of speech, proteins, and other constructs. The wrapper itself is Apache licensed; the LingPipe library is dual open source/commercial licensed.
UIMA Wrapper Built in Apache A wrapper around IBM’s UIMA text analytics code. Makes UIMA annotators available in OpenPipeline.
MC+A Google Search Appliance Sender MC+A Commercial Transforms an Item and sends to a Google Search Appliance.
Findwise Google Search Appliance Sender Findwise Commercial Transforms Items and sends them to a Google Search Appliance. Supports web feeds and content feeds. Also supports batching.
Drop Stage Findwise Creative Commons Specify rules for dropping items based on the characteristics contained fields
Field Re-namer Findwise Creative Commons Re-names fields.
Derby Table Deleter Findwise Creative Commons Deletes from the derby-database
Field Split Findwise Creative Commons Split the specified field on the specified separtator. The result will be inserted as multiple fields with the name specified as output name.
Field Remover Findwise Creative Commons Removes fields with the specified names.
Content Filter Stage Findwise Creative Commons Remove parts of the content in a specied field (for example an address that appears on every page on a site)
RegExp Extract Findwise Creative Commons Extracts content from item by using the specified Regular Expression
Time Stamper Findwise Creative Commons Add a field called processingtime and contains the time the Item was processed.
Map Transform Findwise Creative Commons Maps value of input field to value which is written to the output field. Fields are denoted by field name.
Date Normalizer Findwise Creative Commons Normalizes Dates
RegExp Replacer Findwise Creative Commons Regular expression replacement.
Unique Copy Findwise Creative Commons Copies the value of a field only if there is no field with the target name already in the document.
Empty Title Fixer Findwise Creative Commons Fixes missing titles. If no title is found, the contents after the last forward-slash in the url will be used as title. URL parameters will not be included in the title.
Empty Node Dropper Findwise Creative Commons Drops empty nodes in the Item.


Doc Filters

Doc Filters extract text and metadata from binary files.

Name Provider License
Microsoft Office Dieselpoint Commercial Parses MS Word, Excel, and Powerpoint documents.
RTF Dieselpoint Commercial Parses RTF documents.
PDF Dieselpoint Commercial Parses PDFs. Extracts metadata, XMP, extracts titles from the text of the PDF.
HTML Dieselpoint Commercial Provides special handling of HTML and links.
XML Dieselpoint Commercial Handles XML in a variety of formats.
Other Dieselpoint Commercial Several other formats available.
HTML Findwise Creative Commons Parses HTML files. Uses Apache Tika.

This page is wiki editable login or register .