Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 27;22(1):570.
doi: 10.1186/s12859-021-04480-2.

EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality

Affiliations

EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality

Madolyn L MacDonald et al. BMC Bioinformatics. .

Abstract

Background: To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment.

Results: EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study.

Conclusions: EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species.

Keywords: CHO cells; Chinese hamster; Genome assembly; Genome assembly quality; Machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Computational workflow of EvalDNA. EvalDNA requires the assembly of interest in FASTA format, a configuration file, and Illumina paired read data in either FASTQ or BAM file format. EvalDNA first calculates contiguity and completeness metrics, and then calculates accuracy metrics based on the output from running REAPR and SAMtools. This part of EvalDNA produces a list of metrics that will be given to the scoring model (written in R) to estimate the overall quality score for the assembly. The red arrows signify the sequence of steps EvalDNA goes through to calculate various metrics, while the gray arrows signify input and output of data
Fig. 2
Fig. 2
Pearson correlation among all metrics. Cells with an X denote metrics with insignificant correlation. Dark blue represents a stronger positive correlation, while dark red represents a stronger negative correlation
Fig. 3
Fig. 3
Performance of the random forest regression model on test data. Estimated quality scores for the test instances are plotted against the reference-based quality scores of the test instances. A 100% accurate model would produce the blue line with an r-squared equal to 1. The line of best fit for the plotted data is shown as the red line and has an r-squared of 0.8597
Fig. 4
Fig. 4
Comparison of quality evaluation methods on human chromosome 14 assemblies (from the GAGE study). A The EvalDNA ranking of assemblers used to build the human chromosome 14 assembly are compared to the rankings from ALE and FRCbam. The highest quality assembly is given a rank of 1. B EvalDNA and ALE scores for the human chromosome 14 assemblies were normalized (scaled to be between [0, 1]). ALE paper scores were calculated using the same parameters and version of Bowtie described in Clark et al. The ALE redone scores were calculated with an updated version of Bowtie. FRCbam normalized scores were derived using the x-value (feature threshold) where the y-value (percent approximate coverage) was maximum for each FRCbam curve
Fig. 5
Fig. 5
Comparison of quality evaluation methods on Chinese hamster genome assemblies. A Comparison of the EvalDNA ranking of the multiple CH genome assemblies to a manual ranking, and rankings from ALE and FRCbam. The highest quality assembly is given a rank of 1. B EvalDNA and ALE scores for the CH assemblies as well as the rankings from Rupp et al. and FRCbam were normalized (scaled to be between [0, 1]). FRCbam normalized scores were derived using the x-value (feature threshold) where the y-value (percent approximate coverage) was maximum for each FRCbam curve
Fig. 6
Fig. 6
EvalDNA quality scores for chromosomes from various genome assemblies. A EvalDNA quality scores for chromosomes from CH PICR, CH 2013 RefSeq, and the mouse, rat, human, cow, and rice reference genome assemblies. B EvalDNA quality scores for the same chromosomes but calculated using a model that does not include the normalized N50 metric
Fig. 7
Fig. 7
Impact of error rates on the EvalDNA quality scores of CH PICR chromosomes. A Changes in EvalDNA quality scores due to simulation errors. B Changes in scaled EvalDNA quality scores due to simulation errors. Scores were scaled so that the maximum score for a chromosome became 100
Fig. 8
Fig. 8
Impact of error rates on the EvalDNA quality scores of CH PICR scaffolds. A Changes in EvalDNA quality scores due to simulation errors. B Changes in scaled EvalDNA quality scores due to simulation errors. Scores were scaled so that the maximum score for each scaffold became 100

Similar articles

References

    1. Kitts PA, Church DM, Thibaud-Nissen F, Choi J, Hem V, Sapojnikov V, et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 2016;44(D1):D73–80. doi: 10.1093/nar/gkv1226. - DOI - PMC - PubMed
    1. Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 2008;9(3):R55. doi: 10.1186/gb-2008-9-3-r55. - DOI - PMC - PubMed
    1. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–1075. doi: 10.1093/bioinformatics/btt086. - DOI - PMC - PubMed
    1. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22(3):557–567. doi: 10.1101/gr.131383.111. - DOI - PMC - PubMed
    1. Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics. 2018;34(13):142–150. doi: 10.1093/bioinformatics/bty266. - DOI - PMC - PubMed

LinkOut - more resources

-