OpenPipeline Wiki: Main Page
Contents
[ hide ]
Documentation
The system is mostly self-documenting; just run it and click on the pages in the admin app. For a Powerpoint overview, take a look at Introducing OpenPipeline.
For developers there’s a bit more:
Developers' Guide (Not very far along just yet)
Javadoc (Fairly complete)
FAQ (Just getting started. Please add to it.)
Plugins
The new Plugins page is here.
Roadmap
So where is OpenPipeline going? Here is our current thinking:
Some Short-Term Goals for OpenPipeline
Add binary content to the Item class.
The benefit is that we’ll be able to attach binary documents like pdfs, to the Item class and stages will be able to operate on them, transmit them, store them, etc.. We’ll be able to get the DocFilters out of the Connector classes and create a DocFilters stage. This will also make DocFilters a lot easier to configure.
Give stages access to the pipeline they run in.
Add tabs to the job configuration page so you can jump directly to the page you need.
Finish the Developer’s Guide
Modify the build so we include the test cases, the POM file and build file, and the Eclipse project files.
Create a contributor agreement so third parties can contribute to core OpenPipeline code.
Create a public svn repository. Create a public bugtracking system. Both of these things are private now.
Medium-Term Goals
Add a decent web crawler.
We’re busily working on one internally at Dieselpoint. We have one for our core Dieselpoint Search, but we’d like to release an open-source version. The key design decision now is: just how scalable do we want it to be? It’s easy to write a crawler that does a few million pages. It’s much harder to write one that does a few billion. We’re currently fooling with the architecture for the system, particularly how to handle the very large tables of URLs that a large-scale crawl entails. We’ve looked at the way Heretrix and Nutch do it, and it’s not ideal. Probably we’ll do a small-scale crawler first,
which is adequate for most purposes, and a larger-scale one later.
Longer-Term Goals
Make it possible to install plugins automatically.
The idea is that you should be able to click a link in the admin interface that points to an external site, and have the system automatically download a plugin and install it. To get this to happen we’ll have to resolve a few issues: first, a plugin may require more than just a jar file. You may also need to download a config page, some supporting jars, and other resources. The download process will have to copy these things to the right places. It would be better if each plugin got its own subdir and everything it needed went there. This means we’d have to serve config pages from a location not in the webapps directory. The other issue is that the current way of loading plugins requires a server restart. This isn’t fatal.
(Should we consider OSGi for this? Eclipse, an OSGi container, theoretically can install plugins without a restart, but as a practical matter it almost always requires one.)
(Should we consider Maven for this? Maven is really slick for handling dependencies, and we use it for building OpenPipeline, but it’s got its own issues. Plus it may not be equipped for handling anything more than jar files. Must investigate.)
Get OpenPipeline to run in a Hadoop-like manner.
Ideally you would be able to create a pipeline, tell the system how many remote systems to run it on, and turn it loose. This requires quite a bit of coordination and control. We have a means of doing this with our proprietary Dieselpoint Search product, and need to decide how to handle it here.
This page is wiki editable login or register .
[...] We’ve added a wiki to the site. We’ve moved the documentation there and published a roadmap for OpenPipeline’s future. Take a look. [...]
Really nice to know about yor roadmap both short term and long term. This is very useful to us.
One question though, do you plan to release all the stuff in the short term goals at the same time or in any particular order?
I will also encurage my collegues to add to the FAQ.
All of my following comments I make are made with the mindset of helping foster usage of OpenPipeline (OPL) for early-adopters that can help move the project along. :)
For your medium term goal, “Add a decent web crawler”, I propose you initially release a very simplistic, bare-bones, crawler for now (named SimpleCrawler?) to more easily let folks who are interested in exploring the features of the OpenPipeline platform get their feet wet. I’d then release a more advanced version (free) and a deluxe version (pay) later on. I’d look at SimpleCrawler as a reference implementation of how to integrate with a crawler API to behave as a connector.
I’d also make this a short term goal instead to help with early adoption of OPL. When I downloaded OPL for the first time, the very first thing I tried to do was to crawl my personal website (because I know that content) in order to “kick the tires”. Sadly, that doesn’t work yet, and not everyone who runs into trouble so early will spend the time as I did trying to get the filescanner to work properly. (I succeeded with filescanner after several false starts, first trying to scan PDFs in my docs dir and then settling for scanning HTML files in my docs dir) It would allow someone to download OPL, point it at one more more websites, point it at a pipeline with reasonable default stages and go. They’ll be able to do this and focus on their own custom stages while you are developing more advanced connectors and processors. OPL is already close to that right now sans a working crawler. For someone completely new to OPL (and maybe IR), not only is a advanced, scalable crawler unnecessary, I believe most folks would find a plethora of config options overwhelming.
I would tackle this problem by leveraging an existing, open-source, Java crawler/api (http://java-source.net/open-source/crawlers) and modify it behave as an OPL connector. I’d probably use a crawler API instead of a stand-alone crawler (hopefully avoiding more dependencies/setup to manage for OPL), so maybe JSpider or Arachnid would fit the bill for integration with wrapper code needed to function as on OPL connector.
I actually plan to blog about writing bare-bones crawler from scratch in Python eventually, so I’ll share my list of items I consider to be bare-bones crawler requirements given typical crawler usage I’ve witnessed over the years. (bare-bones doesn’t include inherently scalable architecture like we use in our commercial offerings)
I want a bare-bones crawler to be able to support the following features:
1. Accept list of website URLs (seed urls, usually homepage)
2. Fetch documents & discover new links (stay onsite for bare-bones, don’t follow offsite links)
3. Adhere to robots.txt site directives (and allow user to specify custom user-agent)
4. Configurable fetch rate limit of site documents (fetch documents from website at a polite speed, like 10 a minute)
5. Configurable number/level of crawls limit for sites (maybe I only want 250 pages max from a website (prevent honeypots))
Some more advanced features later implemented would include:
6. Crawl sites concurrently (have thread pool with N threads (128?), give a seed url to each worker thread and queue until done)
7. Cache crawled documents (so cached docs could be refed/reprocessed when pipeline is changed without needing a recrawl)
8. Support refresh-crawls (scan for changed documents and only download if HTTP header says its been modified last fetch [big speed up for recrawls!])
9. Support compressed content (some apache servers will send compressed HTTP doc if asked, faster downloads, especially for large docs)
I know that developing a robust, flexible and scalable production quality crawler is a major time sink and there are many other ways that development time can be spent on this project. If the project releases the “the simplest thing that can actually work” reference crawler, OPL users can have enough to play with for a while to keep them busy while the OPL team shifts focus to more advanced features, perhaps even the AdvancedCrawler.
-Michael
Yes, we’re definitely going to release a simple crawler early, and figure out the large-scale crawler later. The large-scale crawler will probably be integrated somehow with whatever scheme we use for making things Hadoop-like.
http://www.openpipeline.org/wp-admin/#
Yes I agree, a simple crawler in which not only can you parse/dl the HTML pages but also specify what other file types like PDF, DOC, XLS and so on to download and be able to process this through a STAGE/Filter or secondary connector/plug-in with its own staging process.