Design Document
VERSION 4.2 Final
Team Yaxonomic
Evan Bolyen
Mike Deberg
Andrew Hodel
Hayden Westbrook
Faculty Mentor/Project Sponsor: Viacheslav “Slava” Fofanov, PhD
module_schematic(blueprint for the creation of new modules)
Input: (based on indicated process: init, prep, engage)
Output: (based on indicated process: init, prep, engage)
This design document is intended to familiarize the reader with the tasks that the taxonomic assignment pipeline (hereby referred to as YAX) will be expected to accomplish. Any basic computer background should be sufficient to understand the design. Some minor bioinformatic knowledge may be required which can be gained from any source discussing the topic of read to reference genome alignment.
The primary function of YAX will be to manage the lengthy process of identifying genetic sequences collected from the environment. It is expected that the computational timeframe of this process will be fairly lengthy and that the capacity to recover from an error at some point in the process will be of paramount importance. The intention of this is to eliminate the need to restart the entire process from the beginning in the event of such a failure. In this way as little time as possible will be lost in what will already be a time consuming process.
This state aware system will also be conducive to any user need of rerunning a portion of YAX. Since YAX with have knowledge of the various states it has already created it should be a relatively trivial matter to append to those states additional pieces of data. For example the addition of reads to an existing run. This system will be entirely decoupled from the modules of YAX.
The modules of YAX will include readprep, survey/refinement, filter, aggregation, alignment and, summary. Readprep will work with the initial read input from the user. It will evaluate them based on quality and length, is will chunk reads if necessary and finally collapse any duplicate reads into sets that will be used throughout the other modules. Readprep will record some basic information that will be necessary for the useful output of YAX.
Survey and Refinement are actually the same module, the only difference being that Survey will work on all references found in the National Center for Biotechnology Information’s (NCBI) taxonomy database while Refinement will work on a reduced set of references produced by the Filter module based on hits found in Survey. Filter specifically receives data from Survey in the form of taxonomy identifiers (TaxIDs). These TaxIDs are used to find reference sequences that Refinement will use to build new Bowtie2 index files used for alignment to a reduced set of references.
Aggregation is fundamentally the same module as filter and will be used again after the Refinement module. Aggregation will take the output of Refinement which is again in the form of TaxIDs. These will be filtered once again based on various user inputs and finally a set of associated Genbank Identifiers (GIs) are compiled based on the decided set of TaxIDs. This output is the primary difference between Aggregation and Filter as Aggregation outputs a set of GIs while Filter outputs a set of TaxIDs.The TaxIDBranch tool, developed separately by Fofanov Labs, will be responsible for identifying these associated GIs and building a FASTA file which will be used in Alignment.
Alignment will utilize the tool Bowtie2 to do an alignment of the initially provided reads to the GIs found in Aggregation. The intention is that the Bowtie2 call will allow the user to utilize the full range of functionality and will produce information used to summarize the coverage of reads on references. Finally Summary will summarize this data in a single-document report.
This modular design is intended to extend the significance of YAX as much as possible by allowing the modules to be replaced as new technologies are developed with as little rework as possible. It will also aid in the development of YAX itself to allow an iterative approach to the implementation of the entire system.
YAX is divided up into a few important components:
IndianaJones - Responsible for handling artifacts and passing these artifacts between the different modules of the pipeline. It creates an ExeGraph that outlines the order in which the modules should be run. It traces this graph and verifies the validity of parameters and inputs for each module. It also records all artifacts created by the pipeline in a database.
Database - Holds references to all artifacts created by YAX. Artifacts have unique ids that point to their file location. These ids are stored in the database along with the configuration parameters that the artifact depends upon.
ExeGraph - A graph representing the modules and the order in which they must be run. IndianaJones will backtrack this graph and for each ExeNode will determine if that node has an artifact that has been previously completed. When IndianaJones finds an already completed artifact, or it reaches the root of the graph, IndianaJones will begin passing artifacts to the various ExeNodes in the proper order.
ExeNode - A node in the ExeGraph. Each ExeNode acts as an interface between its module and the ExeGraph. Responsible for passing artifacts to its module.This is the only part of the State System that has any direct interaction with the modules.
Module - Each different module in the pipeline, including ReadPrep, Survey, Filter, etc. is an instance of a Module. These modules must all take in artifacts and be responsible for know what kinds of artifacts they need. The modules will inform the rest of the system if the artifacts given to them are valid.
Users of YAX will control the pipeline through three primary commands and a configuration file. These commands are init, prep, and engage. Each of these commands will serve a different purpose in setting up and executing the architecture of YAX. Init generates the configuration file the user populates with parameters. Prep verifies that the provided parameters will indeed successfully run the pipeline in its current configuration. Engage actually runs the pipeline on the provided data and parameters.
The YAX pipeline will be described by the architectural configuration file. IndianaJones will build an exegraph and validate/construct a database based on this file. The validation of the architectural configuration file against the database is completed at every occurrence of running the pipeline. In this way the architectural configuration file is solely responsible for informing YAX of how the pipeline is configured.
Init
The Init step uses the arch_config to generate an ExeGraph which is then traversed to identify each module and their relative order. The annotations of each module are then used to create a template Config file and to generate a schema for the Database. The Database will be created in this step.
Prep
Once again the arch_config is used to generate the ExeGraph, but this time the parameters of Config are validated against the corresponding Module parameters. Each parameter is responsible for defining its own validation logic, so asserting that files exist and that inputs are bounded appropriately will happen as a part of this process. Once completed, the ExeGraph is again traversed and Artifacts are declared in the ArtifactMap representing the future output of each ExeNode execution.
Engage
Same as Prep, but now the defined Artifacts in the ArtifactMap are passed into each Module as described by Traverse (next page) by the ExeNode until the final output of the ExeGraph has been achieved.
To evaluate the ExeGraph the last ExeNode is checked for a completed output, if found, no further work is required. Otherwise the second to last ExeNode is checked and so on, until the root is reached (and dependencies are available), or a completed output is found. The proceeding ExeNode passes the completed Artifact (a dependency of the current module) from the last ExeNode and its own incomplete output Artifact into its Module. The Module will write results onto the output Artifact until it is completed. If a failure occurs, the Artifact is still incomplete. This means the above process can be repeated and the same incomplete Artifact will be passed to the Module. The Module may then choose to start over or reuse any data already present in the incomplete Artifact.
Each module is defined in terms of inputs outputs and a description of the functionality it provides. In these descriptions parameters that may be shared among modules are not identified as such and will appear as parameters in both modules. Some modules fulfil multiple roles in the current version of the pipeline, this does not preclude the possibility of replacing the module in one instance and leaving it in another in the future. To be considered a module it must be able to take an artifact as an input. It must be able to explore the artifact to find the parameters required for running its functions. Also the module will need to output an artifact that it has populated with the correct output. Modules will have a defined interface to communicate with exenode, the bridge between the actual storage of the artifact and the querying of it.
Input:
artifact | TYPE |
An item on which the module will work to achieve its assigned task |
parameters | VARIED |
A number of required input parameters that will be identified and added to the configuration file during the initialization phase of the State System |
Input:
parameters | VARIED |
A number of required input parameters that will be used to initialize the empty artifact | |
artifact | TYPE |
An unitialized artifact handle |
Output:
artifact | TYPE |
An initialized artifact that has a concept of completeness, its status and the components of the artifact that are incomplete |
arch_config (init) | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. |
config_file (prep, engage) | FILE |
Contains user input parameters for the various module to be run |
config_file (init) | FILE |
Produces a config file containing the necessary parameters to accomplish each part of the pipeline, this prompts the user to fill in the fields with appropriate information |
database (init) if new arch_config | DB |
If a new arch_config is identified from the previous run a new database is constructed to represent the pipeline described in the new arch_config |
Input:
arch_config | FILE |
Description:
Receives an arch_config file which describes the current configuration of the modules. Based on this arch_config file __init__ identifies the necessary components, db, artifact_map that will be required for completion of the pipelines
Input
arch_config | FILE |
Description:
Builds the ExeGraph which is an object that has knowledge of the current pipeline and access to the various modules through there ExeNode handlers.
Input:
config_location | FILEPATH |
Description:
Based on the arch_config modules are poled by their ExeNode handlers to retrieve their names and input parameters. Based on this information a config file is produced where users can populate the input fields. At this time an ExeGraph is created to inform the system of the pipeline’s configuration.
Input:
config | FILE |
Containing the user input parameters for each module in the pipeline |
Output:
Artifacts | DB ENTRY |
Empty artifact object registered to the artifact_map |
Description:
prep() takes in the user completed config file and verifies via ExeNode/Module communication that each parameter will indeed work during execution of the module by walking through the ExeGraph. This could require several iterations if the user has entered an invalid input. Upon successful completion of prep() artifacts are allocated.
Input:
config | FILE |
Containing the user input parameters for each module in the pipeline |
Description:
After initialization and preparation have completed successfully engage will set the various modules of the pipeline to accomplish their tasks.
Input:
exe_graph | ExeGraph |
A specific ExeGraph that will be used to create ArtifactMap | |
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. |
Output:
artifact_map | UUID |
The location of the where the first artifact that is associated with the input will be put |
Description:
If given an exe_graph it will look up the arch_config in order to build a map of where the artifacts will be placed that are required in the configuration
Input:
datapath | FILEPATH |
The path to the arch_config in the database and its resulting artifacts |
Output:
correct_path | BOOLEAN |
Tells whether that path exists and points to a step in the pipeline |
Description:
Responsible for assuring that the current version of the pipeline matches the database being used.
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline |
Output:
artifact_db | FILE |
A database of uninstantiated artifacts is created ready for use |
Description:
Creates the idea of an ArtifactMap and its locations in memory, but nothing is populated or actually instantiated
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. | |
artifact_name | STRING |
The location in the arch_config that you want the add to start |
Output:
artifact_uuid | UUID |
The universal unique identifier for the artifact in ArtifactMap |
Description:
Adds an artifact into ArtifactMap using arch_config to place it in the correct position
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. | |
artifact_name | FILEPATH |
The location in the arch_config that you want the add to start |
Output:
artifact_uuid | UUID |
The universal unique identifier for the artifact in ArtifactMap that is located upstream of the artifact that was deleted |
Description:
Will delete the artifact that is passed into the function as well as all downstream dependencies in the ArtifactMap.
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. | |
artifact_uuid | UUID |
The UUID of the artifact that IndianaJones wants to update |
Output:
update_success | BOOLEAN |
A boolean to inform that the update was complete successfully |
Description:
IndianaJones will use this function when the module has complete and passed a completed artifact through ExeNode, or if a module had been suspended and needed to complete the artifact that had already started.
Input:
artifact_uuid | UUID |
The universal unique identifier for the artifact | |
stream_direction | STRING |
Whether you want to move UP or DOWN stream |
Output:
artifact_UUID | UUID |
The universal unique identifier for the artifact in the ArtifactMap that you are searching for |
Description:
Will be used to move up or downstream to an artifact
Input:
arch_config | FILE |
The ExeGraph you want to delete is uniquely identified by the arch_config that built it |
Output:
successful_delete | BOOLEAN |
Tells whether or not the ExeGraph was delete |
Description:
Deletes an ExeGraph based on an arch_config that will be use to find the specific graph to delete
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. |
Output:
arch_config | UUID |
The universal unique identifier for the artifact in the ArtifactMap |
Description:
Creates the location in memory of the ExeGraph as well as all required ExeNodes through the arch_config
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. |
Output:
remove_success | BOOLEAN |
Tells whether or not the deletion of the ExeGraph was a success |
Description:
Delete the current arch_config ExeGraph
module | TYPE |
An ExeNode must be associated with a module. |
Input:
config_file | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. |
Output:
parameters_valid | BOOLEAN |
If true then error_list will be empty, if false then error_list will be populated with at least one entry | |
error_list | FILE |
Contains a list of all the parameters that were given that are not acceptable in their current format if they were to be given to the proper module. |
Description:
Will probe each module and give it the parameters that the parameter_list has for the module then the module will return whether that is an acceptable input, according to its type and span.
Input:
arch_config | FILE |
A system configuration file containing the current configuration of modules to be used in the pipeline. |
Output:
config_file | FILE |
A list of all the parameters that are required for this specific architecture |
Description:
This will supply a file that the user will look at and fill in with any information they want.
Input:
artifact | TYPE |
The artifact that is required from the module |
Output:
artifact | TYPE |
The output artifact from the module |
Description:
This will actually run the module with its proper input and will receive back an output artifact that will be stored in ArtifactMap
Input:
artifact or exe_node | TYPE or ExeNode |
This can be supplied with an artifact which will then be looked up by its configuration file to find the ExeNode that represents it, or through an ExeNode location which will then resolve its dependencies | |
direction | STRING |
Tells the function which direction of dependencies the user wants returned, DOWN or UP stream |
Output:
dependent_nodes | ARRAY |
A list of ExeNodes that are dependent on the input artifact or Exenode |
Description:
Will explore the database in order to find the list of nodes that are dependent on the input
Input:
artifact | TYPE |
Output:
complete | BOOLEAN |
Tells the completeness of the artifact |
Description:
Called by module to decide if more work is needed to complete the artifact and keep the pipeline running
infile_list | STRING |
Contains a list of FASTQ formatted files which store the reads to be prepared |
work_dir | STRING |
Location of the working directory |
out_dir | STRING |
Location of the output directory |
num_workers | INT |
Number of processes that will work on preparing read files |
max_mem_target_gb | INT |
Memory limit to be used to avoid using swap |
trim_type | STRING |
Indicates the type of trimming to be accomplished (LCD, LCDQ, SEQ) |
segment_length (optional) | INT |
Indicates the length to trim reads to in some trim type schemes |
adapter_list (optional) | STRING |
Location of file containing adapter sequences to strip from reads |
adapter_tolerance (optional) | INT |
Number of bases in each read to check for the presence of adapter sequences |
minimum_quality | STRING |
Minimum quality threshold of a single base in the read |
min_qual_tolerance | INT |
Number of times a base in the read is allowed to miss the minimum_quality before the entire sequence is discarded |
trimmed_reads | FILE |
Containing the trimmed reads and the number of their occurrences in the provided reads |
Takes read file input in the form of FASTQ formatted files and processes those reads in the indicated fashion to create sequences that will be used as reads throughout the alignment steps.
trimmed_reads | FILE |
Containing the trimmed reads and the number of their occurrences in the provided reads |
index_location | STRING |
Location of pre-computed index files |
taxid_manifest | FILE |
Containing a list of taxids to be used in alignment |
survey_results | FILE |
Containing a list of read fragments associated to the taxids of interest and sorted based on the number of mismatches per association |
Input:
taxid_manifest | FILE |
Containing a list of taxids to be used in alignment |
Description:
Taxids are retrieved from the taxid_manifest and prepared for use in alignment process
Input:
trimmed_reads | FILE |
Containing the trimmed reads and the number of their occurrences in the provided read files |
Output:
reads | FILE |
FASTA formatted file containing entries for the reads found in the trimmed_reads |
Description:
reads are retrieved from the trimmed_reads file and put into FASTA format which can be easily input to various alignment tools
Input:
taxid_of_interest | STRING |
The taxid of the index to run an alignment on |
reads_path | STRING |
Path to the FASTA formatted file containing reads to align |
Output:
taxid_reads_alignment | FILE |
Alignment created file containing the alignment information of the taxid to the reads |
Description:
Accomplishes the actual alignment of read files to a single taxid index producing alignment information in file form
Input:
trimmed_reads | FILE |
index_location | STRING |
taxid_manifest | FILE |
Output:
survey_results | FILE |
Description:
For each index file a process is spawned to run the alignment of the reads against the index file or group of taxids indicated.
NOTE
Index files must be pre-computed and named as their reference taxid. This has the potential of using multiple processes for this module so that each read file can be aligned to an index in its own process potentially allowing for the alignment of multiple read files to different indexes simultaneously.
survey_results | FILE |
Containing a list of read fragments associated to taxids and sorted in tree form based on the number of mismatches per association. |
maximum_lca_threshold | INT |
The user will provide in the configuration file used to determine a threshold of informativeness |
taxid_tree | FILE |
The entire taxid tree for looking up informativeness |
Input:
survey_results maximum_lca_threshold | FILE INT |
Output:
informed_reads | FILE |
Contains taxids that have been decided to be informative base on an least common ancestor approach |
Description:
Identifies informative read fragments using lowest common ancestor approach on survey_results. The tree is traversed upwards maximum_lca_threshold at most times from each taxid and if the ancestor is shared then that read is deemed informative
Input:
informed_reads | FILE |
Output:
filter_results | FILE |
FASTA format with reference sequences associated with the taxids from the union of the members of each subtree of informed |
Description:
Unions the the informative taxids and isolate the subtrees of the taxid tree resulting from rooting at the members of that union.
filter_results | FILE |
FASTA format of taxids that have been deemed informative |
refinement_results | FILE |
containing a list of read fragments associated to taxids and sorted in tree form based on the number of mismatches per association |
maximum_lca_threshold | INT |
the user will provide in the configuration file used to determine a threshold of informativeness |
taxid_tree | FILE |
The entire taxid tree for looking up informativeness |
Input:
refinement_results | FILE |
maximum_lca_threshold | INT |
Output:
informed_reads | FILE |
Contains taxids that have been decided to be informative base on an least common ancestor approach |
Description:
Identifies informative read fragments using lowest common ancestor approach on survey_results. The tree is traversed upwards maximum_lca_threshold at most times from each taxid and if the ancestor is shared then that read is deemed informative
Input:
informed_reads | FILE |
Output:
summary_stats | FILE |
Contains the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample |
Description:
Unions the the informative taxids and isolate the subtrees of the taxid tree resulting from rooting at the members of that union
Input:
summary_stats | FILE |
Output:
summary_table | FILE |
Matrix identifying the number of times each TaxID was hit by an informative read fragment and the number of times it was hit exclusively by an informative read fragment |
Description:
Puts summary_stats into a searchable table format that is more legible for the user.
summary_table | FILE |
describes the number of times each identified TaxID was hit by an informative read fragment and the number of times it was hit exclusively by an informative read fragment |
summary_stats | FILE |
containing the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample |
gi_reference_set | FILE |
File containing the GIs identified by the aggregation module |
trimmed_reads | FILE |
Contains the trimmed reads and their frequencies per sample. |
Input:
gi_reference_set | FILE |
trimmed_reads | FILE |
Output:
coverage_data | FILE |
Description:
Creates a file detailing the amount of coverage that exists for the references.
coverage_data | FILE |
Contains the amount of coverage for each GI sequence. |
summary_stats | FILE |
File containing the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample. |
summary_table | FILE |
File which describes the number of times each identified TaxID was hit by an informative read fragment and the number of times it was hit by exclusively by an informative read fragment. |
coverage_data | FILE |
File that describes the amount of coverage that is present for the references. |
output_file_path | STRING |
File path where the outputted summary file is to be located. |
order_method | STRING |
Parameter indicating how the list of GIs in the summary file is to be ordered. |
total_results | INT |
The number of results the user wishes to have displayed in the summary file. |
Input:
summary_stats | FILE |
summary_table | FILE |
coverage_data | FILE |
output_file_path | STRING |
order_method | STRING |
total_results | INT |
Output:
summary_file | FILE |
Description:
Writes a file that lists the top X identified GIs, ordered by a specified parameter, as well as the coverage data for each GI.
summary_file | FILE |
File containing a list of Tax IDs that were hit, GI’s associated with those tax IDs as well as coverage plots graphically representing the coverage data. |
The module themselves will be implemented independently and wholly rely on the state system to prompt them to activate. In this way the individual modules can be completely tested before integration via the State System.Since the modules known as Survey/Refinement and Filter/Aggregation are effectively the same module in terms of functionality and only differ based on the inputs given they will be implemented as one.
The state system itself will be developed independently of any module. This is relying on the capacity to register a module with the system so that it can identify the component necessary for each step of the work flow. The outcome of this is that the State System will actually be a process independent product. It could potentially be used for any pipelined process that can be adapted to the use of modules to produce an output.
Based on the decoupled architecture of YAX only one individual will implement each module, although code review will entail the team entire team evaluating the contribution.
Implementation will begin with team members completing the following modules:
Mike:
Readprep
Survey/Refinement
Andrew:
Filter
Aggregation
Hayden:
Alignment
Summary
Evan will Begin with implementation of the State System, when their individual modules are finished team members will assist in the State System implementation.
Implementation will begin on the 15th of February with each member completing a module every seven days. On the 29th of February Mike, Andrew, and Hayden will have completed their modules and will begin to assist Evan with State System implementation. On the 11th of march the State System will be complete. Testing will be accomplished on individual modules as part of the implementation process to assure their correct operation. Upon completion of the State System testing will begin to prove that it can successfully organize and operate the desperate pipeline modules.
Gantt chart of implementation plan