Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;14 Suppl 15(Suppl 15):S16.
doi: 10.1186/1471-2105-14-S15-S16. Epub 2013 Oct 15.

Finishing bacterial genome assemblies with Mix

Finishing bacterial genome assemblies with Mix

Hayssam Soueidan et al. BMC Bioinformatics. 2013.

Abstract

Motivation: Among challenges that hamper reaping the benefits of genome assembly are both unfinished assemblies and the ensuing experimental costs. First, numerous software solutions for genome de novo assembly are available, each having its advantages and drawbacks, without clear guidelines as to how to choose among them. Second, these solutions produce draft assemblies that often require a resource intensive finishing phase.

Methods: In this paper we address these two aspects by developing Mix , a tool that mixes two or more draft assemblies, without relying on a reference genome and having the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a set of paths in the extension graph that maximizes the cumulative contig length.

Results: We evaluate the performance of Mix on bacterial NGS data from the GAGE-B study and apply it to newly sequenced Mycoplasma genomes. Resulting final assemblies demonstrate a significant improvement in the overall assembly quality. In particular, Mix is consistent by providing better overall quality results even when the choice is guided solely by standard assembly statistics, as is the case for de novo projects.

Availability: Mix is implemented in Python and is available at https://github.com/cbib/MIX, novel data for our Mycoplasma study is available at http://services.cbib.u-bordeaux2.fr/mix/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Possible extension alignments between Ci and Cj. Arrows stand for contigs' orientation, b and e stand for beginning and end coordinates of the alignment on each contig. Reverse cases are not depicted (i.e. where b and e positions are inverted).
Figure 2
Figure 2
Extension graph for two terminal alignments. Terminal alignments α between contigs A, B and γ between B, C are each represented by eight nodes. Nodes encode the extremities of the alignment on each contig (border b and internal i extremities) and the direction in which it is read (forward or reverse ). Edges encode the possible "glue" between contigs. Light gray edges represent a given alignment on the contig and carry no weight. Turquoise edges connect two contigs within an alignment and are labeled by its length ( lα and lγ ). Black edges connect to the In and Out nodes, allowing for reading each contig in both directions as well as complex paths and are labeled by the remaining contig length (lA, lB and lC). Notice that values of lB on the left-hand side of the figure and on the right-hand side are not the same as they depend on the alignment length; they are |B| − lα and |B| − lγ, respectively. Orange edges connect the extremities of different alignments in which one contig can participate: here α and γ for B. Their weights are deduced from the corresponding intervals (here |B| − lα − lγ for both).
Figure 3
Figure 3
Comparison of (A) NA50 and (B) duplication ratio measures for GAGE-B benchmark. (A) For six bacterial genome (six panels), eight assemblies were provided by GAGE-B, and were merged either with GAA (64 combinations), GAM-NGS (64 combinations) or Mix (28 combinations only since no asymmetry between input assemblies is introduced) or not further processed (Single Assembly). The resulting assemblies were accessed against the reference genome by QUAST and the length of the shortest aligned contig from all that cover 50% of all assembly (AKA NA50 or "corrected N50") for each possible combinations of species, mergers and assembers are reported as points (Top panel). The higher the better. Box-plots indicate the quartiles of the distribution of NA50. For each species and mergers, the top 5 combinations of assemblies according to N50 were selected, and their NA50 are depicted using large triangles. Panel (B)) report the duplication ratio of the same assemblies, the horizontal dashed line indicate a perfect ratio of 1.
Figure 4
Figure 4
Comparison of single and merged assemblies for Mycoplasma. For ten bacterial My- coplasma genomes (ten columns), we generated three assemblies using CLC, MIRA and ABySS, that were subsequently merged either with GAA, GAM-NGS or Mix (28 combinations); or not further processed (Single Assembly). The resulting assemblies were assessed using standard statistics for genome assemblies (four rows): Number of contigs, size of the largest contig, N50, number of genes of more than 300bp identified by the GeneMark gene finder. For the number of contigs, the lower the better. For the other three statistics, the higher the better.
Figure 5
Figure 5
Core genome conservation. For ten bacterial Mycoplasma genomes, assembled using using CLC, MIRA and ABySS and then either left as is (single-assemblies) or combined using GAA, GAM-NGS or MIX or; we determined how much a core genome defined over the whole genus of Mycoplasma is preserved for these ten genomes. The core genome is a set of 170 clusters of orthologous genes present in all strains. For each combination of species, single assemblies and merger, this figure report the distribution of the number of clusters of the core genome for which we can find at least a single gene present with 99% identity in the assembly.

Similar articles

Cited by

References

    1. Chevreux B, Pfisterer T, Drescher B, Driesel A, Muller W, Wetter T, Suhai S. Using the MiraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004;14(6):1147–59. doi: 10.1101/gr.1917404. - DOI - PMC - PubMed
    1. Simpson J, Wong K, Jackman S, Schein J, Jones S, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;14(6):1117–23. doi: 10.1101/gr.089532.108. - DOI - PMC - PubMed
    1. Ye L, Hillier L, Minx P, Thane N, Locke D, Martin J, Chen L, Mitreva M, Miller J, Haub K, Dooling D, Mardis E, Wilson R, Weinstock G, Warren W. A vertebrate case study of the quality of assemblies derived from next-generation sequences. Genome Biol. 2011;14(3):R31. doi: 10.1186/gb-2011-12-3-r31. - DOI - PMC - PubMed
    1. Harismendy O, Ng P, Strausberg R, Wang X, Stockwell T, Beeson K, Schork N, Murray S, Topol E, Levy S, Frazer K. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;14:R32. doi: 10.1186/gb-2009-10-3-r32. - DOI - PMC - PubMed
    1. Diguistini S, Liao N, Platt D, Robertson G, Seidel M, Chan S, Docking T, Birol I, Holt R, Hirst M, Mardis E, Marra M, Hamelin R, Bohlmann J, Breuil C, Jones S. De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol. 2009;14:R94. doi: 10.1186/gb-2009-10-9-r94. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

-