Automated de novo identification of repeat sequence families in sequenced genomes

doi:10.1101/gr.88502

. 2002 Aug;12(8):1269-76.

doi: 10.1101/gr.88502.

Automated de novo identification of repeat sequence families in sequenced genomes

Zhirong Bao¹, Sean R Eddy

Affiliations

PMID: 12176934
PMCID: PMC186642
DOI: 10.1101/gr.88502

Automated de novo identification of repeat sequence families in sequenced genomes

Zhirong Bao et al. Genome Res. 2002 Aug.

. 2002 Aug;12(8):1269-76.

doi: 10.1101/gr.88502.

Authors

Zhirong Bao¹, Sean R Eddy

Affiliation

¹ Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA.

PMID: 12176934
PMCID: PMC186642
DOI: 10.1101/gr.88502

Abstract

Repetitive sequences make up a major part of eukaryotic genomes. We have developed an approach for the de novo identification and classification of repeat sequence families that is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences. Our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. When tested on the human genome, our approach was able to properly identify and group known transposable elements. The program, should be useful for first-pass automatic classification of repeats in newly sequenced genomes.

PubMed Disclaimer

Figures

**Figure 1**
Flowchart of the de novo strategy. Input genomic sequences (black lines on top) contain a family of repeats with three copies (i.e., elements); two full length (blue and red boxes) and one partially deleted (green box). These elements, unknown at this point, will yield three alignments in an all versus all pairwise comparison of the genomic sequences. The aligned fragments (i.e., images), colored as their corresponding elements for clarity, are sorted to their corresponding genomic region, and those coming from the same element (i.e., syntopic images) can be grouped together according to their overlaps. On the basis of the syntopic sets, elements can be defined. These defined elements are then clustered into one family because they are all similar to each other.

**Figure 2**
Different biological scenarios require different methods of syntopy inference. (A) For three images (thin black lines) in a genomic region (top bold black line), the single and double coverage methods lead to different definitions of elements. (B) A full-length element and its images (black and grey lines below). The top long image is formed with another full-length member in its family, whereas the shorter images are formed with the fragmented members. (C) A segmental duplication covering two kinds of elements. The top long image is formed with the other copy of this segmental duplication, whereas the shorter images are formed with other members in the two families, respectively.

**Figure 3**
The RECON algorithm uses the aggregation of endpoints in the multiple alignment of images to distinguish between different biological scenarios.

**Figure 4**
Complications because of sequence similarity between related families. (A) The schematic structure of Tc1 and Tc7, two related DNA transposons that are similar at the end of their terminal inverted repeats (black and grey blocks) but not in the rest of the sequences (Plasterk and von Luenen 1997). (B) A Tc1 element and its images. (C) Images in B are filtered, and only those ends labeled with closed circles will be collected to determine whether the element should be split. Open circles in Box b mark the misleading ends. Dashed lines link the pairs of images formed with the same copy of Tc7 and represent the unalignable sequences between a Tc1 and a Tc7. Although not shown in the figure, the two TIRs of Tc1 also form alignments in the opposite strands, and images from these alignments are also filtered.

**Figure 5**
False primary edges because of partial elements. (A) The schematic structure of full-length Tc1 and Tc7 (see also Fig. 4) and a partially deleted Tc7, which preserves only the region similar to Tc1. (B) Graph constructed for Tc1s and Tc7s. Closed nodes represent full-length elements. Solid and dashed lines represent primary and secondary edges, respectively. (C) Certain primary edges are removed from the partial Tc7 to eliminate the false ones.

See this image and copyright information in PMC

Comment in

Transcendent elements: whole-genome transposon screens and open evolutionary questions.
Holmes I. Holmes I. Genome Res. 2002 Aug;12(8):1152-5. doi: 10.1101/gr.453102. Genome Res. 2002. PMID: 12176921 Review. No abstract available.

Cited by

Multi-omics analyses provide insights into the evolutionary history and the synthesis of medicinal components of the Chinese wingnut.
Zhang ZY, Xia HX, Yuan MJ, Gao F, Bao WH, Jin L, Li M, Li Y. Zhang ZY, et al. Plant Divers. 2024 Apr 8;46(3):309-320. doi: 10.1016/j.pld.2024.03.010. eCollection 2024 May. Plant Divers. 2024. PMID: 38798724 Free PMC article.
Acceleration of genome rearrangement in clitellate annelids.
Schultz DT, Heath-Heckman EAC, Winchell CJ, Kuo DH, Yu YS, Oberauer F, Kocot KM, Cho SJ, Simakov O, Weisblat DA. Schultz DT, et al. bioRxiv [Preprint]. 2024 May 14:2024.05.12.593736. doi: 10.1101/2024.05.12.593736. bioRxiv. 2024. PMID: 38798472 Free PMC article. Preprint.
De Novo Genome Assembly and Annotation of Leptosia nina Provide New Insights into the Evolutionary Dynamics of Genes Involved in Host-Plant Adaptation of Pierinae Butterflies.
Okamura Y, Vogel H. Okamura Y, et al. Genome Biol Evol. 2024 May 2;16(5):evae105. doi: 10.1093/gbe/evae105. Genome Biol Evol. 2024. PMID: 38778773 Free PMC article.
De Novo Genome Assembly for the Coppery Titi Monkey (Plecturocebus cupreus): An Emerging Nonhuman Primate Model for Behavioral Research.
Pfeifer SP, Baxter A, Savidge LE, Sedlazeck FJ, Bales KL. Pfeifer SP, et al. Genome Biol Evol. 2024 May 2;16(5):evae108. doi: 10.1093/gbe/evae108. Genome Biol Evol. 2024. PMID: 38758096 Free PMC article.
Exploring the extrachromosomal plasmid rDNA of Naegleria fowleri AY27 genotype II: A human brain-eating amoeba via high-throughput sequencing.
Aurongzeb M, Talha Malik HM, Jahanzaib M, Hassan SS, Rashid Y, Aziz T, Alharbi M. Aurongzeb M, et al. BMC Med Genomics. 2024 May 7;17(1):125. doi: 10.1186/s12920-024-01890-y. BMC Med Genomics. 2024. PMID: 38715056 Free PMC article.

See all "Cited by" articles

References

1. Agarwal P, States DJ. The Repeat Pattern Toolkit (RPT): Analyzing the structure and evolution of the C. elegans genome. In: Altman R, et al., editors. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1994. pp. 1–9. - PubMed
1. Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266. - PMC - PubMed
1. Berg DE. Transposon Tn5. In: Berg DE, et al., editors. Mobile DNA. Washington, DC.: American Society for Microbiology; 1989. pp. 185–210.
1. Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature. 1980;284:601–603. - PubMed
1. Gracy J, Argos P. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics. 1998;14:174–187. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

[1] Agarwal P, States DJ. The Repeat Pattern Toolkit (RPT): Analyzing the structure and evolution of the C. elegans genome. In: Altman R, et al., editors. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1994. pp. 1–9. - PubMed

[2] Agarwal P, States DJ. The Repeat Pattern Toolkit (RPT): Analyzing the structure and evolution of the C. elegans genome. In: Altman R, et al., editors. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1994. pp. 1–9. - PubMed

[3] Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266. - PMC - PubMed

[4] Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer ELL. The Pfam protein families database. Nucleic Acids Res. 2000;28:263–266. - PMC - PubMed

[5] Berg DE. Transposon Tn5. In: Berg DE, et al., editors. Mobile DNA. Washington, DC.: American Society for Microbiology; 1989. pp. 185–210.

[6] Berg DE. Transposon Tn5. In: Berg DE, et al., editors. Mobile DNA. Washington, DC.: American Society for Microbiology; 1989. pp. 185–210.

[7] Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature. 1980;284:601–603. - PubMed

[8] Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature. 1980;284:601–603. - PubMed

[9] Gracy J, Argos P. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics. 1998;14:174–187. - PubMed

[10] Gracy J, Argos P. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics. 1998;14:174–187. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated de novo identification of repeat sequence families in sequenced genomes

Affiliation

Automated de novo identification of repeat sequence families in sequenced genomes

Authors

Affiliation

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources