Analysis pipelines ease the exploration of Big Data by removing the need for a human to manually initialize every step of the process. Pipelines are especially useful in combination with larger computing clusters, allowing analyses to run over the course of days or weeks without constant human monitoring and interaction. As analytics increase into utilizing terabytes of data, the aspect of reuse also becomes a concern, requiring pipelines to be able to analyze previously computed data and determine what, if any, is able to be reused instead of being recomputed.
Our project, Orchard, generates and wraps pipelines in order to allow Data Scientists to easily setup and run a complex workflow, and also provides a management system for any branching or data reuse due to module parameter changes. In addition Orchard provides a level of abstraction between the pipeline modules and the user through the use of link and configuration files, allowing for any future module changes to be easily performed without large user impacts.
With Orchard we provide Big Data pipelines with all of the features present in current solutions, such as saving state to efficiently use previously computed outputs, while additionally providing missing features such as the ability branch new pipeline runs off of existing runs and the ability for less technical users to easily interact with the pipeline and its outputs.