Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2018 Aug 1;7(8):giy098.
doi: 10.1093/gigascience/giy098.

Bioinformatics applications on Apache Spark

Affiliations
Review

Bioinformatics applications on Apache Spark

Runxin Guo et al. Gigascience. .

Abstract

With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
The cluster architecture of Spark.
Figure 2:
Figure 2:
Examples of narrow and wide dependencies. Each box is an RDD, where the partition is shown as a shaded rectangle.
Figure 3:
Figure 3:
An example of how Spark computes job stages. Boxes with solid outlines are RDDs. Partitions are shaded rectangles and are black if they are already in memory. To run an action on RDD G, we build stages at wide dependencies and pipeline narrow transformation inside each stage. In this case, the output RDD of stage 1 is already in memory, so we run stage 2 and then stage 3.

Similar articles

Cited by

References

    1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.
    1. Zou Q, Li X-B, Jiang W-R et al. . Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. 2013;15(4):637–47. - PubMed
    1. Zou Q, Hu Q, Guo M, et al. . HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics. 2015;31(15):2475–81. - PubMed
    1. Gaggero M, Leo S, Manca S et al. . Parallelizing bioinformatics applications with MapReduce. Cloud Computing and Its Applications. 2008;12(18):22–23.
    1. Leo S, Santoni F, Zanetti G. Biodoop: bioinformatics on hadoop. In: Parallel Processing Workshops, 2009 ICPPW'09 International Conference on: 2009. IEEE: 415–22.

Publication types

LinkOut - more resources

-