Bioinformatics applications on Apache Spark
- PMID: 30101283
- PMCID: PMC6113509
- DOI: 10.1093/gigascience/giy098
Bioinformatics applications on Apache Spark
Abstract
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of-the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
Figures
Similar articles
-
A distributed computing model for big data anonymization in the networks.PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023. PLoS One. 2023. PMID: 37115783 Free PMC article.
-
Framing Apache Spark in life sciences.Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb. Heliyon. 2023. PMID: 36852030 Free PMC article. Review.
-
VC@Scale: Scalable and high-performance variant calling on cluster environments.Gigascience. 2021 Sep 7;10(9):giab057. doi: 10.1093/gigascience/giab057. Gigascience. 2021. PMID: 34494101 Free PMC article.
-
Lessons learnt on the analysis of large sequence data in animal genomics.Anim Genet. 2018 Jun;49(3):147-158. doi: 10.1111/age.12655. Epub 2018 Apr 6. Anim Genet. 2018. PMID: 29624711 Review.
-
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19. Bioinformatics. 2014. PMID: 24845651
Cited by
-
Biomedical Big Data Technologies, Applications, and Challenges for Precision Medicine: A Review.Glob Chall. 2023 Nov 20;8(1):2300163. doi: 10.1002/gch2.202300163. eCollection 2024 Jan. Glob Chall. 2023. PMID: 38223896 Free PMC article. Review.
-
Negation recognition in clinical natural language processing using a combination of the NegEx algorithm and a convolutional neural network.BMC Med Inform Decis Mak. 2023 Oct 13;23(1):216. doi: 10.1186/s12911-023-02301-5. BMC Med Inform Decis Mak. 2023. PMID: 37833661 Free PMC article.
-
Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment.PLoS Comput Biol. 2023 Jul 20;19(7):e1011272. doi: 10.1371/journal.pcbi.1011272. eCollection 2023 Jul. PLoS Comput Biol. 2023. PMID: 37471333 Free PMC article.
-
Fog-Based Smart Cardiovascular Disease Prediction System Powered by Modified Gated Recurrent Unit.Diagnostics (Basel). 2023 Jun 15;13(12):2071. doi: 10.3390/diagnostics13122071. Diagnostics (Basel). 2023. PMID: 37370966 Free PMC article.
-
A distributed computing model for big data anonymization in the networks.PLoS One. 2023 Apr 28;18(4):e0285212. doi: 10.1371/journal.pone.0285212. eCollection 2023. PLoS One. 2023. PMID: 37115783 Free PMC article.
References
-
- Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.
-
- Zou Q, Li X-B, Jiang W-R et al. . Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. 2013;15(4):637–47. - PubMed
-
- Zou Q, Hu Q, Guo M, et al. . HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics. 2015;31(15):2475–81. - PubMed
-
- Gaggero M, Leo S, Manca S et al. . Parallelizing bioinformatics applications with MapReduce. Cloud Computing and Its Applications. 2008;12(18):22–23.
-
- Leo S, Santoni F, Zanetti G. Biodoop: bioinformatics on hadoop. In: Parallel Processing Workshops, 2009 ICPPW'09 International Conference on: 2009. IEEE: 415–22.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources