A string of characters representing nucleotides that is the output of a sequencing machine. The characters used to represent nucleotides are A,T,C and, G in some cases N and other characters will be used as placeholders for unknown nucleotides. This is specifically the IUPAC alphabet.

GI:

A unique numeric identifier assigned to every genetic sequence observed by NCBI. Every genetic sequence observed will have a unique GI, even if they are from the same species or even the same individual.

High-throughput Sequencing:

A term describing modern sequencing techniques

Hit:

When a read is mapped to a TaxID

In List:

Metagenome:

The presumed set of all DNA in a discrete environment. (The function capacity of that environment)

NCBI:

National Center for Biotechnology Information

Nucleotide:

A single nucleic acid of which DNA is composed of.

Pathogen:

A harmful biological agent.

Read:

A genetic sequence from an unknown species that comes from a sample from the environment.

Read Fragment:

A read that has been processed by ReadPrep

Reference:

A genetic sequence contained within the NCBI database that belongs to a known species.

RunID:

Identification of a single complete branch of a YAX run. Used to manage branch data; represents an amalgamation of Artifact Keys in the pipeline.

Sample:

Taxon:

A particular node in the taxonomy tree.

Taxonomic Assignment:

The pairing of a sequence to a taxonomy

TaxID:

A unique numeric identifier for a node in the taxonomic tree of NCBI’s database.

TaxID Tree:

File containing a set of TaxIDs and metadata that can be used to build a tree representation of taxonomic data

TaxID Reference Set:

FASTA file that contains the set of sequences for TaxID references

TaxIDBranch:

Tool for isolating branches of taxonomy.

Wall-time:

The quantity of time a process executed.

YAX:

A working name for the taxonomy assignment pipeline. We do not have an official name yet, but we recommend YAX (YAX assigns taxonomy).

2. Introduction

This requirements document is intended to familiarize the reader with the tasks that the taxonomic assignment pipeline (hereby referred to as YAX) will be expected to accomplish. Any basic computer background should be sufficient to understand the requirements. Some minor bioinformatic knowledge may be required which can be gained from any source discussing the topic of read to reference genome alignment.

The primary function of YAX will be to manage the lengthy process of identifying genetic sequences collected from the environment. It is expected that the computational timeframe of this process will be fairly lengthy and that the capacity to recover from an error at some point in the process will be of paramount importance. The intention of this is to eliminate the need to restart the entire process from the beginning in the event of such a failure. In this way as little time as possible will be lost in what will already be a time consuming process.

This state aware system will also be conducive to any user need of rerunning a portion of YAX. Since YAX with have knowledge of the various states it has already created it should be a relatively trivial matter to append to those states additional pieces of data. For example the addition of reads to an existing run. This system will be entirely decoupled from the modules of YAX.

The modules of YAX will include readprep, survey/refinement, filter, aggregation, alignment and, summary. Readprep will work with the initial read input from the user. It will evaluate them based on quality and length, is will chunk reads if necessary and finally collapse any duplicate reads into sets that will be used throughout the other modules. Readprep will record some basic information that will be necessary for the useful output of YAX.

Survey and Refinement are actually the same module, the only difference being that Survey will work on all references found in the National Center for Biotechnology Information’s (NCBI) taxonomy database while Refinement will work on a reduced set of references produced by the Filter module based on hits found in Survey. Filter specifically receives data from Survey in the form of taxonomy identifiers (TaxIDs). These TaxIDs are used to find reference sequences that Refinement will use to build new Bowtie2 index files used for alignment to a reduced set of references.

Aggregation is fundamentally the same module as filter and will be used again after the Refinement module. Aggregation will take the output of Refinement which is again in the form of TaxIDs. These will be filtered once again based on various user inputs and finally a set of associated Genbank Identifiers (GIs) are compiled based on the decided set of TaxIDs. This output is the primary difference between Aggregation and Filter as Aggregation outputs a set of GIs while Filter outputs a set of TaxIDs.The TaxIDBranch tool, developed separately by Fofanov Labs, will be responsible for identifying these associated GIs and building a FASTA file which will be used in Alignment.

Alignment will utilize the tool Bowtie2 to do an alignment of the initially provided reads to the GIs found in Aggregation. The intention is that the Bowtie2 call will allow the user to utilize the full range of functionality and will produce information used to summarize the coverage of reads on references. Finally Summary will summarize this data in a single-document report.

This modular design is intended to extend the significance of YAX as much as possible by allowing the modules to be replaced as new technologies are developed with as little rework as possible. It will also aid in the development of YAX itself to allow an iterative approach to the implementation of the entire system.

3. Problem and Solution Statement

The project’s sponsor, Dr. Viacheslav Fofanov, is an assistant professor in the informatics and computing program at NAU. His research focuses on detecting pathogens in environmental samples by utilizing high-throughput sequencing data. This research is important because it provides a line of defense against the spread of dangerous pathogens.

To accomplish this, rapid assessment of the samples is necessary. Using traditional selective or differential media to culture organisms can be especially time-consuming. Worse, not all (in fact most) pathogenic organisms cannot be cultured in a lab.

A relatively modern alternative to this problem is high-throughput sequencing, which offers the ability to quickly collect large volumes of data about the DNA present in a sample. There are two general approaches to this: amplicon and shotgun sequencing. Amplicon sequencing chemically targets known areas of a genome, while shotgun sequencing pulls out random fragments. Because the goal is to collect community information spanning bacteria, eukaryotes, and viruses, targeting any one area of a genome will not work well as these groups are too different. That leaves shotgun sequencing, which will produce information about the entire metagenome, but at the cost of not knowing where any particular sequence came from in a genome or even the genome it originated from. Furthermore these reads are generally short, spanning a couple-hundred nucleotides.

NCBI maintains databases that contain information on organisms which it has an assigned TaxID and a list of GIs. Most importantly for our purposes it maintains associations linking the two, so given an organism one can retrieve a TaxID and all the GIs associated with it. However, these databases are massive, and it is unlikely that a given read will ever match a reference exactly. This means efficient fuzzy fragment matching of these massive datasets is needed. Additionally, as all life (with the potential exception of viruses) has a common origin, many components of a genome are shared. This complicates the search as the detection of a match does not mean it is conclusive evidence for the presence of a given taxon.

What is needed is a system that can: rapidly match metagenomic shotgun reads to sub-sequences of a reference database; identify which reads are informative and which are not; draw conclusions on community membership; and provide some quantitative measure of confidence in the those conclusions.

YAX will be a pipeline that should determine what species are represented within a sample of genetic data. The pipeline will use two primary sources of data. The first are the reads that are the product of sequencing environmental samples. The second are the references that come from the NCBI database. YAX will also utilize a configuration file that the user can edit to manipulate different settings. The reads will be mapped to references to determine what species are represented in the sample. The output of the pipeline will be a number of PDF documents that will describe and visualize the results.

The actual read preparation, aligning of genetic sequences, and mapping of reads to references will be taken care of by tools that are already implemented. Because of this, the system to be developed will primarily be responsible for moving data between separate modules of the pipeline. As a result, one of the core functionalities of the pipeline is a state system that will keep track of what step of the pipeline the execution is currently in, and that will allow the user to return to any previous state and branch execution into another run of the pipeline using different data or configuration options.

Finding the identity of an unknown sample sequence is time consuming and requires the management of extremely large data sets. For this reason, it is expected that coarse decisions are made very rapidly based on the total dataset to shrink it into manageable sizes for more comprehensive analysis. In order to achieve the highest level of accuracy and precision while successfully managing the large data sets presented in this problem an iterative filtering process will be used.

This means that the same general process will be repeated multiple times as it reduces and fine tunes the parameters during each iteration.

The run time of the YAX is expected to be a relatively long period of time, up to a number of weeks. Because of this, error recovery is a major issue to be addressed by the implementation of the YAX. This must be done in a fashion that relies on as little recomputation as possible. This means that the various steps and phases of the project must report accurately on their successful completion or error state and must also store the data in such a way that the damaged or incomplete section is the only part that needs recomputation.

Figure 1. An outline of the module associations with corresponding artifact inputs and output. It should be noted that each module and artifact is a requirement given for implementing the Taxonomy Assignment algorithm.

4. Functional Requirements

Global Requirements

Must rely on configuration file for the provision of numerous inputs to the various YAX modules.

The configuration file will be associated to the project it is used to run.
Must provide initialization of configuration file that produces a file, with some defaults settings, for the user to fill in with the applicable data.
File and directory locations are to be referenced by absolute path in the configuration file.
This configuration file will contain the number of CPUs to be allocated to YAX.
This configuration file will contain the total memory allocated to YAX.
This configuration file will contain the absolute path to the in list of read sets file.
The configuration file will contain the absolute path to the reference data.
The configuration file will contain the absolute path to the pre-computed set of references sequence data that will be used in Survey.
The configuration file will contain the absolute path to a TaxID Reference Data file containing all reference sequences of interest.
The configuration file will contain the absolute path to a TaxID Tree file which will be used during the Filter and Aggregation step for taxonomic tree construction.

YAX will accept reads sets in the form of absolute paths per sample identified in the in.list file parameter.
YAX must utilize a state system that will allow the user to return to a previous state and continue operation from that state in a separate branch of the state system.

Must be able to restart from any completed artifact.
Must track whether an artifact is complete.
Must maintain the associations between each run and the artifacts involved in that run.

The user should be able to remove all artifacts after a given artifact to the end of a run.

Initial user inputs for YAX must be limited to a read set file list and a configuration file.
YAX verifies the existence of any dependencies before beginning any process.
YAX directory structure must allow for the migration of a project within a file system.
YAX must be able to identify some likely reasons (exceeding memory or runtime allowances) for module failure and, when possible, augment its non-functional behaviour so as to avoid the same failure condition.

Read Preperation (ReadPrep)

The module must receive a list of samples and the associated sequence data.
The module may optionally accept a list of adapters which when provided, should be stripped from the beginning of each sequence if present.

The user may specify the minimum number of matching bases at the beginning of the read and end of the adapter before the adapter is stripped.

The module must filter out sequences of a low call quality. These sequences must not be used in downstream analysis.

The user should be able to specify a minimum base call quality that the bases of the sequence should ideally meet.
The user should be able to specify a maximum number of times the minimum call quality may be missed before discarding the entire sequence.

The module must convert raw reads into uniform length sequence fragments. The three supported modes for this must be: LCD, LCDQ, and SEG.

LCD (least common denominator) must take the first N contiguous bases of every sequence where N is the length of the shortest complete sequence.
LCDQ (LCD + quality) must take N contiguous bases of every sequence where N is the length of the shortest complete sequence and contiguous bases start at positions that maximize the call quality for contiguous bases.
SEG (segmented) will cut a sequence into contiguous non-overlapping fragments of length N where N is provided by the user. The remainder fragment is discarded.

The module must collapse identical sequence fragments and annotate the observation frequency of each sequence fragment for every sample.
The module will produce a single file with each computed sequence fragment and the per-sample observation frequencies.

Survey

This module must receive the output defined in 2.6 as input.
Must receive pre-computed TaxID Reference Data set.
Must receive a manifest of the TaxIDs to be used.
This module must identify all possible TaxIDs that every collapsed read fragment matches.

User may provide a mismatch threshold which will be used as the maximum number of allowable mismatches when matching.

This module will produce a file associating the read fragments to matched TaxIDs for 0 through N mismatches where N is the mismatch threshold.

Filter

The module must receive the output specified in 3.5 as input.
The module must receive the entire TaxID Tree.
The module must be provided an output location for the files it creates.
The module must identify informative read fragments from those provided by the output specified in 3.5 determined through application of Lowest Common Ancestor (LCA) approach.

If a read fragment hits on multiple references a common ancestor is investigated.
The user may supply a threshold for the maximum ancestry to be used in LCA.
The module will traverse the edges of the TaxID tree upward, if a common ancestor between the hit TaxIDs in a read fragment is found before the LCA threshold, then the read fragment must be considered informative.

The module must use the informative reads to rank TaxIDs in each sample.

The user may specify minimum thresholds of informativeness to filter TaxIDs. The most restrictive threshold must be used.

The first threshold is the minimum observation frequency of informative reads for a TaxID in a sample.
The second threshold is the minimum relative frequency of informative reads for a TaxID in a sample with respect to all informative hits.

The module must union the identified TaxIDs of each sample and isolate the subtrees of the TaxID Tree resulting from rooting at the members of that union.
The module must output to a FASTA formatted file, the reference sequences associated with the TaxIDs resulting from the union of the members of each subtree isolated in 4.7.

Map Preparation

Must receive reduced reference set computed by the Filter (4) module instead of the full set of TaxID Reference Data.

Refinement

The Refinement module must perform all required functions outlined in the Survey (3) module with the noted exceptions.
User parameters defined in Survey (3) must be independently replicated for this module.

Aggregation

The Aggregation module must perform all required functions outlined in the Filter (4) module with the noted exceptions :

This module must receive the hit data (defined in 3.5) resulting from Refinement (5) instead the hit data of Survey (3).

This module must create a Summary Stat file containing the total number of hits, hits by informative read fragments, and exclusive hits by informative read fragments for each sample.
This module must create a Summary Table file which describes the number of times each identified TaxID was hit by an informative read fragment and the number of times it was hit exclusively by an informative read fragment.

Alignment

This module must receive the output specified in 2.6 as input.
This module must take parameters from 5.2.
This module must produce a FASTA formatted file containing the reference sequences indicated by the GIs associated with the TaxIDs resulting from the process described by 4.8.
This module must align the input of 8.2 to the input of 8.3 using the specifications of 8.2 to identify the positions of the reference sequences covered by the read fragments in each sample.
This module must create an output file that describes the coverage of read fragments on reference sequences for each sample.

Summary

This module must receive the output specified in 7.5 as input.
This module must receive the output specified in 6.4 as input.
This module must receive the output specified in 6.3 as input.
This module must be supplied a designated output directory.
For each sample, this module must create a single-document output file summarization.

The summary file must contain a list of TaxIDs identified in 6.4.
The summary file must contain a list of GIs associated to the TaxIDs determined to be in the sample. These are the GIs which were hit by the reads in the sample.

The user may choose to order this list by:

Absolute total coverage: the total number of bases covered by one or more read fragments for a given GI.
Relative coverage: the relative proportion of bases covered by the read fragments for a given GI, with respect to total bases in GI.
Total number of hits to the GI. These include non-informative read fragments.
Total number of informative hits to the GI.
Total number of unique hits to the GI. These are read fragments which exclusively hit that GI.

The user may choose to filter the number of top candidates reported.

For each sample, this module must create coverage plots for each GI which graphically represents the coverage of reads in this sample on the sequences of GIs associated to the TaXID found to be in the sample.

5. Non-functional Requirements

Global Requirements

YAX must be modular. Modules must be replaceable with as little development work as possible.

The state system is uncoupled from other YAX modules.
Interfaces should exist between each module to ensure output from each module is in a standard format.
YAX must adhere to the modular architecture outlined in Figure 1.

YAX must provide user documentation completely describing the intended behaviour of:

Each user parameter
Each dependency
Each input artifact
Each module
How to interact with the YAX as a whole.

There must be cross-validation test data sets to measure the precision and recall of the taxonomy assignment.
YAX must be capable of determining how best to utilize resources allocated to it.

YAX must be parallelized.
YAX must accept a number of CPUs allocated to it.
YAX must accept an amount of memory allocated to it.

YAX must be capable of expanding a data set by processing only the additional reads without re-computing previously processed data
YAX must run in a clustered environment utilizing a provided workload manager.

YAX must run in parallel in this clustered environment to maximize efficiency.
YAX must create and batch jobs in this environment using the workload manager.

YAX must be written in Python.
YAX must be reproducibly deployable.

6. Potential Risks

Ultimately the greatest risk is present in the final output of YAX. If this information is inaccurate, then incorrect or erroneous conclusions could be drawn. These conclusions could go on to impact a vast array of situations from public health to personal reputation. The most likely risk would be the identification of an organism that is not present in the samples. This kind of false positive could significantly alter any kind of published results. The worst case of this scenario could cause serious implications for the researchers' careers, as they would lose credibility and may have future funding opportunities adversely impacted. Any published works based on such erroneous data may also be subjected to retraction.

Alternatively the most severe risk would be a false negative, where a species that is represented within the samples is not identified by YAX. If YAX failed to identify the presence of a potentially harmful organism, there could be serious public health consequences. This situation however, is unlikely to occur, and is even less likely to cause serious problems as YAX is not meant to be relied on for pathogen detection. Either way, the most fundamental way to avoid both false positives and false negatives is to use YAX as one of many identification sources in all critical health-related investigations.

Many researchers do not have access to local superclusters and may rent CPU-hours from cloud based services such as AWS or RackSpace. Additionally if institutional machines are available, there is still the opportunity cost associated with dedicating CPU-hours to any given task. As a result, it is important that YAX is effective in its use of wall-time. Should a runtime-failure occur, it would be disastrous if the entire process needed to be reconstructed from scratch. Failing to create an effective system to “checkpoint” the current state would result in wasted dollars and/or opportunity when such a failure occurs. Most commonly we expect this to be the process running into hard-limits on wall-time, which may be measured in weeks. Recomputing weeks of effort because the walltime was off by a day or two is simply not admissible.

The impact on human resources that are potentially standing by is also not insignificant. YAX is very likely a single step in a chain of investigation. If it does not run in the most efficient manner possible, with accurate run-time failure recovery, resources that are waiting on its finding down the line could be blocked from progression. Obviously this could have a snowballing effect particularly if some step of a process, time in a sequencer for example, needs to be accurately scheduled ahead of time and could result in a significant monetary wastes and missed opportunity.

7. Project Plan

Requirements Draft
Design Review Presentation
Requirements Final
Create testing data set and test cases
Prototype state system
Implement configuration initialization and reading system
Implement ReadPrep module
Implement Survey/Refinement module
Implement Filter/Aggregation module
Implement Alignment module
Implement Summary module
Develop Conda dependencies package
Local Testing
Cluster Testing

After the completion of this semester we will have a final requirements document in hand. Based on this sets of test data will be created to provide smaller sample sets of data that can be used to quickly debug the software. The typical amount of data YAX will handle is much too large for initial testing purposes, so the creation of a smaller test data set is crucial.

Implementation will begin with the prototyping of the state system in general so that it can be used throughout implementation of the other modules. Because the state system is relevant through every step of the pipeline, further iteration on the state system will continue as modules are added to the pipeline. Implementation will continue in the order of modules in the pipeline. With the exception of the Survey and Refinement modules which are the same module with the same inputs but of slightly reduced size. The first time the module is given the full precomputed taxonomy tree and TaxID reference set (Survey) and the second time with the reduced TaxID reference set (Refinement). This module will be developed once prior the Filter module and then placed in the workflow again after the Filter module. Similarly the steps of Filter and Aggregation are the same module except for output. Refinement of the reference set is required in either case.

Additionally, the conda packages will be developed alongside to make continuous integration and local tests of the system possible. This will additionally assure that the system can be rolled out on various platforms without issue.

Figure 2. Gantt chart of project timeline.