This design document is intended to familiarize the reader with the tasks that the taxonomic assignment pipeline (hereby referred to as YAX) will be expected to accomplish. Any basic computer background should be sufficient to understand the design. Some minor bioinformatic knowledge may be required which can be gained from any source discussing the topic of read to reference genome alignment.

The primary function of YAX will be to manage the lengthy process of identifying genetic sequences collected from the environment. It is expected that the computational timeframe of this process will be fairly lengthy and that the capacity to recover from an error at some point in the process will be of paramount importance. The intention of this is to eliminate the need to restart the entire process from the beginning in the event of such a failure. In this way as little time as possible will be lost in what will already be a time consuming process.

This state aware system will also be conducive to any user need of rerunning a portion of YAX. Since YAX with have knowledge of the various states it has already created it should be a relatively trivial matter to append to those states additional pieces of data. For example the addition of reads to an existing run. This system will be entirely decoupled from the modules of YAX.

The modules of YAX will include readprep, survey/refinement, filter, aggregation, alignment and, summary. Readprep will work with the initial read input from the user. It will evaluate them based on quality and length, is will chunk reads if necessary and finally collapse any duplicate reads into sets that will be used throughout the other modules. Readprep will record some basic information that will be necessary for the useful output of YAX.

Survey and Refinement are actually the same module, the only difference being that Survey will work on all references found in the National Center for Biotechnology Information’s (NCBI) taxonomy database while Refinement will work on a reduced set of references produced by the Filter module based on hits found in Survey. Filter specifically receives data from Survey in the form of taxonomy identifiers (TaxIDs). These TaxIDs are used to find reference sequences that Refinement will use to build new Bowtie2 index files used for alignment to a reduced set of references.

Aggregation is fundamentally the same module as filter and will be used again after the Refinement module. Aggregation will take the output of Refinement which is again in the form of TaxIDs. These will be filtered once again based on various user inputs and finally a set of associated Genbank Identifiers (GIs) are compiled based on the decided set of TaxIDs. This output is the primary difference between Aggregation and Filter as Aggregation outputs a set of GIs while Filter outputs a set of TaxIDs.The TaxIDBranch tool, developed separately by Fofanov Labs, will be responsible for identifying these associated GIs and building a FASTA file which will be used in Alignment.

Alignment will utilize the tool Bowtie2 to do an alignment of the initially provided reads to the GIs found in Aggregation. The intention is that the Bowtie2 call will allow the user to utilize the full range of functionality and will produce information used to summarize the coverage of reads on references. Finally Summary will summarize this data in a single-document report.

This modular design is intended to extend the significance of YAX as much as possible by allowing the modules to be replaced as new technologies are developed with as little rework as possible. It will also aid in the development of YAX itself to allow an iterative approach to the implementation of the entire system.

2. Architectural Overview

YAX is divided up into a few important components:

IndianaJones - Responsible for handling artifacts and passing these artifacts between the different modules of the pipeline. It creates an ExeGraph that outlines the order in which the modules should be run. It traces this graph and verifies the validity of parameters and inputs for each module. It also records all artifacts created by the pipeline in a database.

Database - Holds references to all artifacts created by YAX. Artifacts have unique ids that point to their file location. These ids are stored in the database along with the configuration parameters that the artifact depends upon.

ExeGraph - A graph representing the modules and the order in which they must be run. IndianaJones will backtrack this graph and for each ExeNode will determine if that node has an artifact that has been previously completed. When IndianaJones finds an already completed artifact, or it reaches the root of the graph, IndianaJones will begin passing artifacts to the various ExeNodes in the proper order.

ExeNode - A node in the ExeGraph. Each ExeNode acts as an interface between its module and the ExeGraph. Responsible for passing artifacts to its module.This is the only part of the State System that has any direct interaction with the modules.

Module - Each different module in the pipeline, including ReadPrep, Survey, Filter, etc. is an instance of a Module. These modules must all take in artifacts and be responsible for know what kinds of artifacts they need. The modules will inform the rest of the system if the artifacts given to them are valid.

Users of YAX will control the pipeline through three primary commands and a configuration file. These commands are init, prep, and engage. Each of these commands will serve a different purpose in setting up and executing the architecture of YAX. Init generates the configuration file the user populates with parameters. Prep verifies that the provided parameters will indeed successfully run the pipeline in its current configuration. Engage actually runs the pipeline on the provided data and parameters.

The YAX pipeline will be described by the architectural configuration file. IndianaJones will build an exegraph and validate/construct a database based on this file. The validation of the architectural configuration file against the database is completed at every occurrence of running the pipeline. In this way the architectural configuration file is solely responsible for informing YAX of how the pipeline is configured.

Init

The Init step uses the arch_config to generate an ExeGraph which is then traversed to identify each module and their relative order. The annotations of each module are then used to create a template Config file and to generate a schema for the Database. The Database will be created in this step.

Prep

Once again the arch_config is used to generate the ExeGraph, but this time the parameters of Config are validated against the corresponding Module parameters. Each parameter is responsible for defining its own validation logic, so asserting that files exist and that inputs are bounded appropriately will happen as a part of this process. Once completed, the ExeGraph is again traversed and Artifacts are declared in the ArtifactMap representing the future output of each ExeNode execution.

Engage

Same as Prep, but now the defined Artifacts in the ArtifactMap are passed into each Module as described by Traverse (next page) by the ExeNode until the final output of the ExeGraph has been achieved.

Traversal

To evaluate the ExeGraph the last ExeNode is checked for a completed output, if found, no further work is required. Otherwise the second to last ExeNode is checked and so on, until the root is reached (and dependencies are available), or a completed output is found. The proceeding ExeNode passes the completed Artifact (a dependency of the current module) from the last ExeNode and its own incomplete output Artifact into its Module. The Module will write results onto the output Artifact until it is completed. If a failure occurs, the Artifact is still incomplete. This means the above process can be repeated and the same incomplete Artifact will be passed to the Module. The Module may then choose to start over or reuse any data already present in the incomplete Artifact.

3. Module Descriptions

Each module is defined in terms of inputs outputs and a description of the functionality it provides. In these descriptions parameters that may be shared among modules are not identified as such and will appear as parameters in both modules. Some modules fulfil multiple roles in the current version of the pipeline, this does not preclude the possibility of replacing the module in one instance and leaving it in another in the future. To be considered a module it must be able to take an artifact as an input. It must be able to explore the artifact to find the parameters required for running its functions. Also the module will need to output an artifact that it has populated with the correct output. Modules will have a defined interface to communicate with exenode, the bridge between the actual storage of the artifact and the querying of it.

module_schematic(blueprint for the creation of new modules)

Input:

artifact	TYPE
An item on which the module will work to achieve its assigned task

parameters	VARIED
A number of required input parameters that will be identified and added to the configuration file during the initialization phase of the State System

initialize_artifact():

Input:

parameters	VARIED
A number of required input parameters that will be used to initialize the empty artifact
artifact	TYPE
An unitialized artifact handle

Output:

artifact	TYPE
An initialized artifact that has a concept of completeness, its status and the components of the artifact that are incomplete

IndianaJones

Input: (based on indicated process: init, prep, engage)

arch_config (init)	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.

config_file (prep, engage)	FILE
Contains user input parameters for the various module to be run

Output: (based on indicated process: init, prep, engage)

config_file (init)	FILE
Produces a config file containing the necessary parameters to accomplish each part of the pipeline, this prompts the user to fill in the fields with appropriate information

database (init) if new arch_config	DB
If a new arch_config is identified from the previous run a new database is constructed to represent the pipeline described in the new arch_config

init():

Input:

arch_config

FILE

Description:

Receives an arch_config file which describes the current configuration of the modules. Based on this arch_config file __init__ identifies the necessary components, db, artifact_map that will be required for completion of the pipelines

create_exegraph()

Input

arch_config

FILE

Description:

Builds the ExeGraph which is an object that has knowledge of the current pipeline and access to the various modules through there ExeNode handlers.

create_config()

Input:

config_location

FILEPATH

Description:

Based on the arch_config modules are poled by their ExeNode handlers to retrieve their names and input parameters. Based on this information a config file is produced where users can populate the input fields. At this time an ExeGraph is created to inform the system of the pipeline’s configuration.

prep()

Input:

config	FILE
Containing the user input parameters for each module in the pipeline

Output:

Artifacts	DB ENTRY
Empty artifact object registered to the artifact_map

Description:

prep() takes in the user completed config file and verifies via ExeNode/Module communication that each parameter will indeed work during execution of the module by walking through the ExeGraph. This could require several iterations if the user has entered an invalid input. Upon successful completion of prep() artifacts are allocated.

engage()

Input:

config	FILE
Containing the user input parameters for each module in the pipeline

Description:

After initialization and preparation have completed successfully engage will set the various modules of the pipeline to accomplish their tasks.

ArtifactMap

init():

Input:

exe_graph	ExeGraph
A specific ExeGraph that will be used to create ArtifactMap
arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.

Output:

artifact_map	UUID
The location of the where the first artifact that is associated with the input will be put

Description:

If given an exe_graph it will look up the arch_config in order to build a map of where the artifacts will be placed that are required in the configuration

validate():

Input:

datapath	FILEPATH
The path to the arch_config in the database and its resulting artifacts

Output:

correct_path	BOOLEAN
Tells whether that path exists and points to a step in the pipeline

Description:

Responsible for assuring that the current version of the pipeline matches the database being used.

create_db():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline

Output:

artifact_db	FILE
A database of uninstantiated artifacts is created ready for use

Description:

Creates the idea of an ArtifactMap and its locations in memory, but nothing is populated or actually instantiated

add():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.
artifact_name	STRING
The location in the arch_config that you want the add to start

Output:

artifact_uuid	UUID
The universal unique identifier for the artifact in ArtifactMap

Description:

Adds an artifact into ArtifactMap using arch_config to place it in the correct position

remove():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.
artifact_name	FILEPATH
The location in the arch_config that you want the add to start

Output:

artifact_uuid	UUID
The universal unique identifier for the artifact in ArtifactMap that is located upstream of the artifact that was deleted

Description:

Will delete the artifact that is passed into the function as well as all downstream dependencies in the ArtifactMap.

update():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.
artifact_uuid	UUID
The UUID of the artifact that IndianaJones wants to update

Output:

update_success	BOOLEAN
A boolean to inform that the update was complete successfully

Description:

IndianaJones will use this function when the module has complete and passed a completed artifact through ExeNode, or if a module had been suspended and needed to complete the artifact that had already started.

get_next():

Input:

artifact_uuid	UUID
The universal unique identifier for the artifact
stream_direction	STRING
Whether you want to move UP or DOWN stream

Output:

artifact_UUID	UUID
The universal unique identifier for the artifact in the ArtifactMap that you are searching for

Description:

Will be used to move up or downstream to an artifact

delete_db():

Input:

arch_config	FILE
The ExeGraph you want to delete is uniquely identified by the arch_config that built it

Output:

successful_delete	BOOLEAN
Tells whether or not the ExeGraph was delete

Description:

Deletes an ExeGraph based on an arch_config that will be use to find the specific graph to delete

ExeGraph

init():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.

Output:

arch_config	UUID
The universal unique identifier for the artifact in the ArtifactMap

Description:

Creates the location in memory of the ExeGraph as well as all required ExeNodes through the arch_config

delete():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.

Output:

remove_success	BOOLEAN
Tells whether or not the deletion of the ExeGraph was a success

Description:

Delete the current arch_config ExeGraph

ExeNode

input:

module	TYPE
An ExeNode must be associated with a module.

validate_params():

Input:

config_file	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.

Output:

parameters_valid	BOOLEAN
If true then error_list will be empty, if false then error_list will be populated with at least one entry
error_list	FILE
Contains a list of all the parameters that were given that are not acceptable in their current format if they were to be given to the proper module.

Description:

Will probe each module and give it the parameters that the parameter_list has for the module then the module will return whether that is an acceptable input, according to its type and span.

template_params():

Input:

arch_config	FILE
A system configuration file containing the current configuration of modules to be used in the pipeline.

Output:

config_file	FILE
A list of all the parameters that are required for this specific architecture

Description:

This will supply a file that the user will look at and fill in with any information they want.

run():

Input:

artifact	TYPE
The artifact that is required from the module

Output:

artifact	TYPE
The output artifact from the module

Description:

This will actually run the module with its proper input and will receive back an output artifact that will be stored in ArtifactMap

get_dependencies():

Input:

artifact or exe_node	TYPE or ExeNode
This can be supplied with an artifact which will then be looked up by its configuration file to find the ExeNode that represents it, or through an ExeNode location which will then resolve its dependencies
direction	STRING
Tells the function which direction of dependencies the user wants returned, DOWN or UP stream

Output:

dependent_nodes	ARRAY
A list of ExeNodes that are dependent on the input artifact or Exenode

Description:

Will explore the database in order to find the list of nodes that are dependent on the input

is_artifact_complete():

Input:

artifact

TYPE

Output:

complete	BOOLEAN
Tells the completeness of the artifact

Description:

Called by module to decide if more work is needed to complete the artifact and keep the pipeline running

Readprep

Input:

infile_list	STRING
Contains a list of FASTQ formatted files which store the reads to be prepared

work_dir	STRING
Location of the working directory

out_dir	STRING
Location of the output directory

num_workers	INT
Number of processes that will work on preparing read files

max_mem_target_gb	INT
Memory limit to be used to avoid using swap

trim_type	STRING
Indicates the type of trimming to be accomplished (LCD, LCDQ, SEQ)

segment_length (optional)	INT
Indicates the length to trim reads to in some trim type schemes

adapter_list (optional)	STRING
Location of file containing adapter sequences to strip from reads

adapter_tolerance (optional)	INT
Number of bases in each read to check for the presence of adapter sequences

minimum_quality	STRING
Minimum quality threshold of a single base in the read

min_qual_tolerance	INT
Number of times a base in the read is allowed to miss the minimum_quality before the entire sequence is discarded

Output:

trimmed_reads	FILE
Containing the trimmed reads and the number of their occurrences in the provided reads

Description:

Takes read file input in the form of FASTQ formatted files and processes those reads in the indicated fashion to create sequences that will be used as reads throughout the alignment steps.

Survey/Refinement

Input:

trimmed_reads	FILE
Containing the trimmed reads and the number of their occurrences in the provided reads

index_location	STRING
Location of pre-computed index files

taxid_manifest	FILE
Containing a list of taxids to be used in alignment

Output:

survey_results	FILE
Containing a list of read fragments associated to the taxids of interest and sorted based on the number of mismatches per association

interigate_taxid_manifest():

Input:

taxid_manifest	FILE
Containing a list of taxids to be used in alignment

Description:

Taxids are retrieved from the taxid_manifest and prepared for use in alignment process

interigate_trimmed_reads():

Input:

trimmed_reads	FILE
Containing the trimmed reads and the number of their occurrences in the provided read files

Output:

reads	FILE
FASTA formatted file containing entries for the reads found in the trimmed_reads

Description:

reads are retrieved from the trimmed_reads file and put into FASTA format which can be easily input to various alignment tools

run_alignment():

Input:

taxid_of_interest	STRING
The taxid of the index to run an alignment on

reads_path	STRING
Path to the FASTA formatted file containing reads to align

Output:

taxid_reads_alignment	FILE
Alignment created file containing the alignment information of the taxid to the reads

Description:

Accomplishes the actual alignment of read files to a single taxid index producing alignment information in file form

survey_driver():

Input:

trimmed_reads	FILE
index_location	STRING
taxid_manifest	FILE

Output:

survey_results

FILE

Description:

For each index file a process is spawned to run the alignment of the reads against the index file or group of taxids indicated.

NOTE

Index files must be pre-computed and named as their reference taxid. This has the potential of using multiple processes for this module so that each read file can be aligned to an index in its own process potentially allowing for the alignment of multiple read files to different indexes simultaneously.

Filter

Inputs

survey_results	FILE
Containing a list of read fragments associated to taxids and sorted in tree form based on the number of mismatches per association.

maximum_lca_threshold	INT
The user will provide in the configuration file used to determine a threshold of informativeness

taxid_tree	FILE
The entire taxid tree for looking up informativeness

identify_informative_reads():

Input:

survey_results

maximum_lca_threshold

FILE

INT

Output:

informed_reads	FILE
Contains taxids that have been decided to be informative base on an least common ancestor approach

Description:

Identifies informative read fragments using lowest common ancestor approach on survey_results. The tree is traversed upwards maximum_lca_threshold at most times from each taxid and if the ancestor is shared then that read is deemed informative

rank_informed_reads():

Input:

informed_reads

FILE

Output:

filter_results	FILE
FASTA format with reference sequences associated with the taxids from the union of the members of each subtree of informed

Description:

Unions the the informative taxids and isolate the subtrees of the taxid tree resulting from rooting at the members of that union.

Output

filter_results	FILE
FASTA format of taxids that have been deemed informative

Aggregation

Inputs

refinement_results	FILE
containing a list of read fragments associated to taxids and sorted in tree form based on the number of mismatches per association

maximum_lca_threshold	INT
the user will provide in the configuration file used to determine a threshold of informativeness

taxid_tree	FILE
The entire taxid tree for looking up informativeness

identify_informative_reads():

Input:

refinement_results	FILE
maximum_lca_threshold	INT

Output:

informed_reads	FILE
Contains taxids that have been decided to be informative base on an least common ancestor approach

Description:

rank_informed_reads():

Input:

informed_reads

FILE

Output:

summary_stats	FILE
Contains the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample

Description:

Unions the the informative taxids and isolate the subtrees of the taxid tree resulting from rooting at the members of that union

create_summary_table():

Input:

summary_stats

FILE

Output:

summary_table	FILE
Matrix identifying the number of times each TaxID was hit by an informative read fragment and the number of times it was hit exclusively by an informative read fragment

Description:

Puts summary_stats into a searchable table format that is more legible for the user.

Output

summary_table	FILE
describes the number of times each identified TaxID was hit by an informative read fragment and the number of times it was hit exclusively by an informative read fragment

summary_stats	FILE
containing the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample

Alignment

Inputs

gi_reference_set	FILE
File containing the GIs identified by the aggregation module

trimmed_reads	FILE
Contains the trimmed reads and their frequencies per sample.

generate_coverage():

Input:

gi_reference_set	FILE
trimmed_reads	FILE

Output:

coverage_data

FILE

Description:

Creates a file detailing the amount of coverage that exists for the references.

Outputs

coverage_data	FILE
Contains the amount of coverage for each GI sequence.

Summary

Inputs

summary_stats	FILE
File containing the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample.

summary_table	FILE
File which describes the number of times each identified TaxID was hit by an informative read fragment and the number of times it was hit by exclusively by an informative read fragment.

coverage_data	FILE
File that describes the amount of coverage that is present for the references.

output_file_path	STRING
File path where the outputted summary file is to be located.

order_method	STRING
Parameter indicating how the list of GIs in the summary file is to be ordered.

total_results	INT
The number of results the user wishes to have displayed in the summary file.

generate_summary():

Input:

summary_stats	FILE
summary_table	FILE
coverage_data	FILE
output_file_path	STRING
order_method	STRING
total_results	INT

Output:

summary_file

FILE

Description:

Writes a file that lists the top X identified GIs, ordered by a specified parameter, as well as the coverage data for each GI.

Output

summary_file	FILE
File containing a list of Tax IDs that were hit, GI’s associated with those tax IDs as well as coverage plots graphically representing the coverage data.

4. Implementation Plan

The module themselves will be implemented independently and wholly rely on the state system to prompt them to activate. In this way the individual modules can be completely tested before integration via the State System.Since the modules known as Survey/Refinement and Filter/Aggregation are effectively the same module in terms of functionality and only differ based on the inputs given they will be implemented as one.

The state system itself will be developed independently of any module. This is relying on the capacity to register a module with the system so that it can identify the component necessary for each step of the work flow. The outcome of this is that the State System will actually be a process independent product. It could potentially be used for any pipelined process that can be adapted to the use of modules to produce an output.

Based on the decoupled architecture of YAX only one individual will implement each module, although code review will entail the team entire team evaluating the contribution.

Implementation will begin with team members completing the following modules:

Mike:

Readprep

Survey/Refinement

Andrew:

Filter

Aggregation

Hayden:

Alignment

Summary

Evan will Begin with implementation of the State System, when their individual modules are finished team members will assist in the State System implementation.

Implementation will begin on the 15th of February with each member completing a module every seven days. On the 29th of February Mike, Andrew, and Hayden will have completed their modules and will begin to assist Evan with State System implementation. On the 11th of march the State System will be complete. Testing will be accomplished on individual modules as part of the implementation process to assure their correct operation. Upon completion of the State System testing will begin to prove that it can successfully organize and operate the desperate pipeline modules.

DesignDocImplementationGanttChart

Gantt chart of implementation plan