Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 4;119(1):e2113075119.
doi: 10.1073/pnas.2113075119.

AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication

Affiliations

AnchorWave: Sensitive alignment of genomes with high sequence diversity, extensive structural polymorphism, and whole-genome duplication

Baoxing Song et al. Proc Natl Acad Sci U S A. .

Abstract

Millions of species are currently being sequenced, and their genomes are being compared. Many of them have more complex genomes than model systems and raise novel challenges for genome alignment. Widely used local alignment strategies often produce limited or incongruous results when applied to genomes with dispersed repeats, long indels, and highly diverse sequences. Moreover, alignment using many-to-many or reciprocal best hit approaches conflicts with well-studied patterns between species with different rounds of whole-genome duplication. Here, we introduce Anchored Wavefront alignment (AnchorWave), which performs whole-genome duplication-informed collinear anchor identification between genomes and performs base pair-resolved global alignment for collinear blocks using a two-piece affine gap cost strategy. This strategy enables AnchorWave to precisely identify multikilobase indels generated by transposable element (TE) presence/absence variants (PAVs). When aligning two maize genomes, AnchorWave successfully recalled 87% of previously reported TE PAVs. By contrast, other genome alignment tools showed low power for TE PAV recall. AnchorWave precisely aligns up to three times more of the genome as position matches or indels than the closest competitive approach when comparing diverse genomes. Moreover, AnchorWave recalls transcription factor-binding sites at a rate of 1.05- to 74.85-fold higher than other tools with significantly lower false-positive alignments. AnchorWave complements available genome alignment tools by showing obvious improvement when applied to genomes with dispersed repeats, active TEs, high sequence diversity, and whole-genome duplication variation.

Keywords: genome comparison; regulatory element alignment; sensitive genome alignment; transposable element variation; whole-genome duplication.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Principle of the AnchorWave process. AnchorWave identifies collinear regions via conserved anchors (here, full-length CDS) and breaks collinear regions into shorter fragments (i.e., anchor and interanchor intervals). By merging shorter intervals together after performing sensitive sequence alignment via a two-piece affine gap cost global sequence alignment strategy, AnchorWave generates a whole-genome alignment. AnchorWave implements commands to guide collinear block identification with or without chromosomal rearrangements and provides options to use known WGD to inform the alignment.
Fig. 2.
Fig. 2.
Comparison of genome alignment tools using genomes of different individuals in the same species. (A) Comparison of the performance of genome alignment tools at variant sites of 18 Arabidopsis accession alignment benchmarks. Genome alignments were performed using minimap2 with preset options asm5, asm10, and asm20, which terminate extension in regions with 5, 10, and 20% sequence divergence, respectively. GSAlign and MUMmer4 alignments were performed with default parameters. LAST genome alignments with default parameters were termed as LAST many-to-many. LAST many-to-many alignments were processed with a chain and net procedure to generate LAST many-to-one alignments (each query genome nucleotide may be aligned multiple times, while each reference nucleotide can be aligned up to one time). LAST many-to-one alignments were filtered to generate LAST one-to-one alignments (SI Appendix, Supporting Note 1). (B) Recall and precision of TE deletions by aligning the TE-removed maize B73 genome against the reference genome. MUMmer4, GSAlign, and LAST one-to-one had zero recall ratio. (C) Overview of the maize B73 genome sites aligned to the maize Mo17 genome. In those TE regions which were previously reported as present in B73 and absent in Mo17 (TE PAV on the legend), no position match alignments are expected. A higher number of position matches in these regions (striped orange) indicates a higher false-positive ratio. (D) Two inversions were located using AnchorWave between the maize B73-Ab10 assembly and the B73 v4 reference genome.
Fig. 3.
Fig. 3.
A comparison of different genome alignment tools for aligning the maize B73 v4 and sorghum genomes. (A) The identified collinear anchors between the maize B73 v4 assembly and sorghum genome on chr4 and chr5. Each dot is plotted based on the start coordinate of the reference genome and query genome of each anchor. Collinear anchors on the same strand between the reference genome and query genome are shown in blue, otherwise red. (B) Cumulative distribution of TE length versus TE age in the B73 v4 maize genome with age measured in millions of years (My). The dashed line indicates 12 My, the estimated divergence time of maize and sorghum (34). TE age data are from Stitzer et al. (39); 371 TEs older than 20 My were not plotted, and the total length of these 371 TEs is 531 Kbp. (C) Sequence alignment between the maize B73 v4 genome and the sorghum genome. Minimap2, MUMmer4, and GSAlign generated many-to-many alignments. Since most maize TEs are not shared with sorghum, a higher number of position matches in maize TE regions (striped orange) indicates a higher false-positive ratio. AnchorWave aligns 88.7% of the maize genome to the sorghum genome, while the second highest is 28.0% generated by LAST many-to-many. (D) Comparison of the proportion of sites in maize TFBS that were aligned as a position match (recall) and the position match ratios (number of position match sites to number of aligned sites) in TFBS versus non-TFBS regions.

Similar articles

Cited by

References

    1. Lewin H. A., et al. , Earth BioGenome project: Sequencing life for the future of life. Proc. Natl. Acad. Sci. U.S.A. 115, 4325–4333 (2018). - PMC - PubMed
    1. Exposito-Alonso M., Drost H.-G., Burbano H. A., Weigel D., The Earth BioGenome project: Opportunities and challenges for plant genomics and conservation. Plant J. 102, 222–229 (2020). - PubMed
    1. Wei B., et al. , Genome-wide characterization of non-reference transposons in crops suggests non-random insertion. BMC Genomics 17, 536 (2016). - PMC - PubMed
    1. Lu Z., et al. , The prevalence, evolution and chromatin signatures of plant regulatory elements. Nat. Plants 5, 1250–1259 (2019). - PubMed
    1. Freeling M., Scanlon M. J., Fowler J. E., Fractionation and subfunctionalization following genome duplications: Mechanisms that drive gene content and their consequences. Curr. Opin. Genet. Dev. 35, 110–118 (2015). - PubMed

Publication types

LinkOut - more resources

-