Logo of bbiBioinformatics and Biology Insights
PMC full text:

Table 3.

Summary of big data tools for genotype and other omics analysis.

ApplicationToolsDescriptionAdvantagesLimitations
Genomic sequencing analysisCrossbow28-30A pipeline for whole-genome re-sequencing analysis, combining Bowtie and SoapsnpCost-effective, automatic, memory-efficient and ultrafast short-read alignerSingle cluster implementation
Postalignment bottleneck due to insufficient thread use during multithreading
Programming modelDryad 31 A parallel processing framework with the extension of MapReduce for NGS data analysis. Runs on Hadoop YARNEasy implementation over large data clustersWorks solemnly on DAG and renders the development of new models challenging
Short-read mapperDistMap 32 A scalable, modular, and unified workflow for mapping short reads from NGS data in the distributed Hadoop computing framework.Rapid parallel processing and accurate analysis using parallel graph algorithmsThe 2-step input output transfer requires huge amount of disk space
Proteomic search engineHydra 33 A scalable proteomic search engine for high-rate data generated from mass spectrometry. Runs on the Hadoop MapReduce frameworkUse of the Hadoop infrastructure, catering the management of parallel jobs by reducing infrastructure costsScalability issues due to increasing search rates with increase in mass spectrometry proteomics
Phylogenetic analysisGATK 34 A framework for large-scale next-generation DNA-sequencing analysis using MapReduceUse of a robust common data management engine. Provision of automatic parallelization with efficient memory and CPU utilization. Applicable to both shared memory and distributed machinesDoes not support additional data access patterns
Sequence file managementHadoop-BAM 35 A novel scalable distributed processing library uses the Hadoop framework for manipulating aligned next-generation sequencing large-scale dataUse of Picard SAM JDK. API to implement MapReduce to operate on BAM records, Picard API easily supports large-scale distributed analysisUses command line, which is not user-friendly and limited in scope; nonexpert Hadoop users face difficulties
Query engineSeqWare 36 Query engine used to load and query variants with a rich annotation standard, including coverage and functional consequences. Built with NoSQL HBase database.Helps build automated workflows and processes for large-scale NGS analysis. SeqWare tracks analytical events by linking to samples and studiesDoes not work well if you want to analyze small number of NGS samples. SeqWare does not contain pre-built workflows to analyze NGS data sets
Phylogenetic analysisMrsRF 37 A scalable multicore algorithm computing the Robinson-Foulds (RF) distance matrix between a large numbers of (t) trees using the MapReduce for multi-core phylogenetic applicationsThe MapReduce framework reduces output size of all-pairs RF distance (t × RF matrix), therefore advantageous in computations involving phylogenetic treeMrsRF does not incorporate communication cost
Phylogenetic analysisNephele 38 A tool suite that uses a composition vector algorithm for sequence comparison and affinity propagation clustering for grouping sequences into genotypes. Provision of an advanced computing infrastructure for understanding role of microbiota in human health by Amplicon-based and whole metagenomic sequencing analysisCost-effective. All jobs in analysis are reproducible. Tracks input files, VM images used in data analysisLimited granular control of parameters and flexibility in output generation
GPU-based softwareGPU-BLAST 39 An 4 times faster version of NCBI-BLASTCapable of using both GPU and multiple-core CPU for parallel execution of comparisons of short and long sequencesHigh power consumption. Load balancing required gaining higher performance with large clusters
GPU-based softwareSOAP3 40 The first parallel short-read alignment tool used to improve speed and deployed on multi-processors in GPU2 to 10 times faster than widely adopted sequencing tools, achieves highest sensitivity and low false discovery rates on different length sequence readsLimited to INDELs, and small gaps identification, alignment reads up to 4 mismatches
Hadoop-based frameworkBiodoop 41 A Hadoop-based framework for the generation of large-scale virtual clusters for sequence alignmentComputational efficiency, scalability, and maintenanceStart-up overhead, improvement in post-processing of BLAST results and parallelizing computation of P value
Large-scale sequencingBioPig 42 A novel sequence data analysis framework for bioinformatics applications using MapReduce and pig LatinAutomated scalability with exponentially growing sequence dataSlow start up of MapReduce jobs
Feature-rich sequence processingSeqPig 43 Scalable and simple scripting for parallelizing large-scale sequencing tasks on distributed Hadoop that uses Apache Pig scripting languageAutomatic scripting for parallelized data processingImplementing interactive jobs are impossible due to MapReduce
WorkflowNextflow 44 Open-source workflow framework used for scalable and integrative data-intensive bioinformatics computational pipelinesSoftware containers are used to enable consistency and reproducibility. Built-in support for HPC environments, singularity, and docker support. Portable, fast prototyping, scalable, and stream orientedDoes not support the CWL specification, module, workflow compositions
There is no implementation of a graphical user interface to interact with the pipeline
Does not spawn the executions of pipeline tasks through a distributed cluster such as Apache spark
WorkflowSnakemake45,46Designed for reproducible and scalable data analysesProvides an execution environment that scales to server, cluster, grid, and cloud environments without modifying the workflow definitionAutomatic translation of any CWL workflow definition into a Snakemake workflow not yet implemented
Parallel RNA-seq processingFalco 47 Cloud-based framework to enable parallelization, RNA-seq alignment/feature quantification, and quality control using big data technologies of Apache Hadoop and Apache SparkUsage of spot computing resources for analysis provides a ~65% reduction in the cost of analyzing dataLarge files splitting speed

Abbreviations: API, Application Programming Interface; BAM, Binary Alignment/Map; CPU, Central Processing Units; CWL, Common workflow language; DAG, Directed Acyclic Graph; GATK, Genome Analysis Toolkit; GPU, Graphics Processing Unit; GPU-BLAST, General-purpose graphics processing unit Basic Local Alignment Search Tool; HPC, high-performance computing; MrsRF, MapReduce Speeds up Robinson-Foulds; NCBI-BLAST, National Center for Biotechnology Information–Basic Local Alignment Search Tool; NGS, next-generation sequencing; RF, Robinson-Foulds; SOAP3, Short Oligonucleotide Alignment Program 3; VM, virtual machine.

-