PMC full text:
Published online 2021 Jul 28. doi: 10.1177/11779322211035921
Table 3.
Application | Tools | Description | Advantages | Limitations |
---|---|---|---|---|
Genomic sequencing analysis | Crossbow28-30 | A pipeline for whole-genome re-sequencing analysis, combining Bowtie and Soapsnp | Cost-effective, automatic, memory-efficient and ultrafast short-read aligner | Single cluster implementation Postalignment bottleneck due to insufficient thread use during multithreading |
Programming model | Dryad 31 | A parallel processing framework with the extension of MapReduce for NGS data analysis. Runs on Hadoop YARN | Easy implementation over large data clusters | Works solemnly on DAG and renders the development of new models challenging |
Short-read mapper | DistMap 32 | A scalable, modular, and unified workflow for mapping short reads from NGS data in the distributed Hadoop computing framework. | Rapid parallel processing and accurate analysis using parallel graph algorithms | The 2-step input output transfer requires huge amount of disk space |
Proteomic search engine | Hydra 33 | A scalable proteomic search engine for high-rate data generated from mass spectrometry. Runs on the Hadoop MapReduce framework | Use of the Hadoop infrastructure, catering the management of parallel jobs by reducing infrastructure costs | Scalability issues due to increasing search rates with increase in mass spectrometry proteomics |
Phylogenetic analysis | GATK 34 | A framework for large-scale next-generation DNA-sequencing analysis using MapReduce | Use of a robust common data management engine. Provision of automatic parallelization with efficient memory and CPU utilization. Applicable to both shared memory and distributed machines | Does not support additional data access patterns |
Sequence file management | Hadoop-BAM 35 | A novel scalable distributed processing library uses the Hadoop framework for manipulating aligned next-generation sequencing large-scale data | Use of Picard SAM JDK. API to implement MapReduce to operate on BAM records, Picard API easily supports large-scale distributed analysis | Uses command line, which is not user-friendly and limited in scope; nonexpert Hadoop users face difficulties |
Query engine | SeqWare 36 | Query engine used to load and query variants with a rich annotation standard, including coverage and functional consequences. Built with NoSQL HBase database. | Helps build automated workflows and processes for large-scale NGS analysis. SeqWare tracks analytical events by linking to samples and studies | Does not work well if you want to analyze small number of NGS samples. SeqWare does not contain pre-built workflows to analyze NGS data sets |
Phylogenetic analysis | MrsRF 37 | A scalable multicore algorithm computing the Robinson-Foulds (RF) distance matrix between a large numbers of (t) trees using the MapReduce for multi-core phylogenetic applications | The MapReduce framework reduces output size of all-pairs RF distance (t × RF matrix), therefore advantageous in computations involving phylogenetic tree | MrsRF does not incorporate communication cost |
Phylogenetic analysis | Nephele 38 | A tool suite that uses a composition vector algorithm for sequence comparison and affinity propagation clustering for grouping sequences into genotypes. Provision of an advanced computing infrastructure for understanding role of microbiota in human health by Amplicon-based and whole metagenomic sequencing analysis | Cost-effective. All jobs in analysis are reproducible. Tracks input files, VM images used in data analysis | Limited granular control of parameters and flexibility in output generation |
GPU-based software | GPU-BLAST 39 | An 4 times faster version of NCBI-BLAST | Capable of using both GPU and multiple-core CPU for parallel execution of comparisons of short and long sequences | High power consumption. Load balancing required gaining higher performance with large clusters |
GPU-based software | SOAP3 40 | The first parallel short-read alignment tool used to improve speed and deployed on multi-processors in GPU | 2 to 10 times faster than widely adopted sequencing tools, achieves highest sensitivity and low false discovery rates on different length sequence reads | Limited to INDELs, and small gaps identification, alignment reads up to 4 mismatches |
Hadoop-based framework | Biodoop 41 | A Hadoop-based framework for the generation of large-scale virtual clusters for sequence alignment | Computational efficiency, scalability, and maintenance | Start-up overhead, improvement in post-processing of BLAST results and parallelizing computation of P value |
Large-scale sequencing | BioPig 42 | A novel sequence data analysis framework for bioinformatics applications using MapReduce and pig Latin | Automated scalability with exponentially growing sequence data | Slow start up of MapReduce jobs |
Feature-rich sequence processing | SeqPig 43 | Scalable and simple scripting for parallelizing large-scale sequencing tasks on distributed Hadoop that uses Apache Pig scripting language | Automatic scripting for parallelized data processing | Implementing interactive jobs are impossible due to MapReduce |
Workflow | Nextflow 44 | Open-source workflow framework used for scalable and integrative data-intensive bioinformatics computational pipelines | Software containers are used to enable consistency and reproducibility. Built-in support for HPC environments, singularity, and docker support. Portable, fast prototyping, scalable, and stream oriented | Does not support the CWL specification, module, workflow
compositions There is no implementation of a graphical user interface to interact with the pipeline Does not spawn the executions of pipeline tasks through a distributed cluster such as Apache spark |
Workflow | Snakemake45,46 | Designed for reproducible and scalable data analyses | Provides an execution environment that scales to server, cluster, grid, and cloud environments without modifying the workflow definition | Automatic translation of any CWL workflow definition into a Snakemake workflow not yet implemented |
Parallel RNA-seq processing | Falco 47 | Cloud-based framework to enable parallelization, RNA-seq alignment/feature quantification, and quality control using big data technologies of Apache Hadoop and Apache Spark | Usage of spot computing resources for analysis provides a ~65% reduction in the cost of analyzing data | Large files splitting speed |
Abbreviations: API, Application Programming Interface; BAM, Binary Alignment/Map; CPU, Central Processing Units; CWL, Common workflow language; DAG, Directed Acyclic Graph; GATK, Genome Analysis Toolkit; GPU, Graphics Processing Unit; GPU-BLAST, General-purpose graphics processing unit Basic Local Alignment Search Tool; HPC, high-performance computing; MrsRF, MapReduce Speeds up Robinson-Foulds; NCBI-BLAST, National Center for Biotechnology Information–Basic Local Alignment Search Tool; NGS, next-generation sequencing; RF, Robinson-Foulds; SOAP3, Short Oligonucleotide Alignment Program 3; VM, virtual machine.