The Dfam database of repetitive DNA families

doi:10.1093/nar/gkv1272

. 2016 Jan 4;44(D1):D81-9.

doi: 10.1093/nar/gkv1272. Epub 2015 Nov 26.

The Dfam database of repetitive DNA families

Robert Hubley¹, Robert D Finn², Jody Clements³, Sean R Eddy⁴, Thomas A Jones⁴, Weidong Bao⁵, Arian F A Smit⁶, Travis J Wheeler⁷

Affiliations

¹ Institute for Systems Biology, Seattle, WA 98109, USA rhubley@systemsbiology.org.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1RQ, UK.
³ HHMI Janelia Research Campus, Ashburn, VA 20147, USA.
⁴ Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA.
⁵ Genetic Information Research Institute, Los Altos, CA 94022, USA.
⁶ Institute for Systems Biology, Seattle, WA 98109, USA.
⁷ University of Montana, Missoula, MT 59812, USA travis.wheeler@umontana.edu.

PMID: 26612867
PMCID: PMC4702899
DOI: 10.1093/nar/gkv1272

The Dfam database of repetitive DNA families

Robert Hubley et al. Nucleic Acids Res. 2016.

. 2016 Jan 4;44(D1):D81-9.

doi: 10.1093/nar/gkv1272. Epub 2015 Nov 26.

Authors

Robert Hubley¹, Robert D Finn², Jody Clements³, Sean R Eddy⁴, Thomas A Jones⁴, Weidong Bao⁵, Arian F A Smit⁶, Travis J Wheeler⁷

Affiliations

¹ Institute for Systems Biology, Seattle, WA 98109, USA rhubley@systemsbiology.org.
² European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1RQ, UK.
³ HHMI Janelia Research Campus, Ashburn, VA 20147, USA.
⁴ Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA.
⁵ Genetic Information Research Institute, Los Altos, CA 94022, USA.
⁶ Institute for Systems Biology, Seattle, WA 98109, USA.
⁷ University of Montana, Missoula, MT 59812, USA travis.wheeler@umontana.edu.

PMID: 26612867
PMCID: PMC4702899
DOI: 10.1093/nar/gkv1272

Abstract

Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.

PubMed Disclaimer

Figures

**Figure 1.**
Influence of average relative entropy on annotation for one family. This plot shows the impact of target average relative entropy values of the Charlie15a (DF0000089) model on both annotation coverage (true positives) and overextension. Using the Charlie15a seed, profile HMMs were built with HMMER's *hmmbuild* tool, with varying target average relative entropy values ranging from 0.4 to 0.9 bits per position, using the - -ere flag. The largest of these values represents the average relative entropy of the model when no sequence downweighting (entropy weighting) is performed. Coverage was assessed by searching each entropy-weighted profile HMM against the human genome. Overextension was assessed by searching each profile against a simulated genome containing fragments of true Charlie15a elements planted into realistic simulated genomic sequence built using GARLIC.

**Figure 2.**
Impact of exponential entropy weighting on position-specific relative entropy. L1PREC2_5end (DF0000315) per-position relative entropy averaged over 30 bp windows with uniform and exponential entropy weighting functions. The region around position 1900 caused both false hits and overextension of true hits when using uniform entropy weighting; most of these were removed with the higher positional relative entropy generated using exponential entropy weighting.

**Figure 3.**
Distribution of overextension lengths. Profile HMMs for human Dfam families were searched against an overextension benchmark trained on human sequence data, built using GARLIC. For each hit above GA threshold, overextension was calculated. The plot shows, for each overextension length, the number of hits with that length. Application of our two changes (increased average relative entropy and exponential entropy weighting) clearly reduced the frequency of very long overextensions.

**Figure 4.**
Hit statistics for MLT1A (DF0001126).

**Figure 5.**
Hits displayed on karyotypes. This plot shows the distribution of HAT1_CE (DF0001401) elements across C. elegans chromosomes, demonstrating the well-known accumulation of some DNA transposons towards telomeres (26,27).

**Figure 6.**
Coverage, Conservation, and Insert plot for MIR (DF0000001).

See this image and copyright information in PMC

Cited by

Genome of the endangered eastern quoll (Dasyurus viverrinus) reveals signatures of historical decline and pelage color evolution.
Hartley GA, Frankenberg SR, Robinson NM, MacDonald AJ, Hamede RK, Burridge CP, Jones ME, Faulkner T, Shute H, Rose K, Brewster R, O'Neill RJ, Renfree MB, Pask AJ, Feigin CY. Hartley GA, et al. Commun Biol. 2024 May 25;7(1):636. doi: 10.1038/s42003-024-06251-0. Commun Biol. 2024. PMID: 38796620 Free PMC article.
Diversity and evolution of transposable elements in the plant-parasitic nematodes.
Dayi M. Dayi M. BMC Genomics. 2024 May 23;25(1):511. doi: 10.1186/s12864-024-10435-7. BMC Genomics. 2024. PMID: 38783171 Free PMC article.
RNA editing in host lncRNAs as potential modulator in SARS-CoV-2 variants-host immune response dynamics.
Chattopadhyay P, Mehta P, Kanika, Mishra P, Chen Liu CS, Tarai B, Budhiraja S, Pandey R. Chattopadhyay P, et al. iScience. 2024 Apr 29;27(6):109846. doi: 10.1016/j.isci.2024.109846. eCollection 2024 Jun 21. iScience. 2024. PMID: 38770134 Free PMC article.
Reference-free inferring of transcriptomic events in cancer cells on single-cell data.
Eralp B, Sefer E. Eralp B, et al. BMC Cancer. 2024 May 20;24(1):607. doi: 10.1186/s12885-024-12331-5. BMC Cancer. 2024. PMID: 38769480 Free PMC article.
SOS1 tonoplast neo-localization and the RGG protein SALTY are important in the extreme salinity tolerance of Salicornia bigelovii.
Salazar OR, Chen K, Melino VJ, Reddy MP, Hřibová E, Čížková J, Beránková D, Arciniegas Vega JP, Cáceres Leal LM, Aranda M, Jaremko L, Jaremko M, Fedoroff NV, Tester M, Schmöckel SM. Salazar OR, et al. Nat Commun. 2024 May 20;15(1):4279. doi: 10.1038/s41467-024-48595-5. Nat Commun. 2024. PMID: 38769297 Free PMC article.

See all "Cited by" articles

References

1. Bao Z., Eddy S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12:1269–1276. - PMC - PubMed
1. Price A.L., Jones N.C., Pevzner P.A. De novo identification of repeat families in large genomes. 2005;21(Suppl. 1):I351–I358. - PubMed
1. Flutre T., Duprat E., Feuillet C., Quesneville H. Considering transposable element diversification in de novo annotation approaches. PLoS ONE. 2011;6:e16526. - PMC - PubMed
1. Kohany O., Gentles A., Hankus L., Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006;7:474. - PMC - PubMed
1. Krogh A. An Introduction to Hidden Markov Models for Biological Sequences. In: Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Elsevier; 1998. pp. 45–63.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

[1] Bao Z., Eddy S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12:1269–1276. - PMC - PubMed

[2] Bao Z., Eddy S.R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12:1269–1276. - PMC - PubMed

[3] Price A.L., Jones N.C., Pevzner P.A. De novo identification of repeat families in large genomes. 2005;21(Suppl. 1):I351–I358. - PubMed

[4] Price A.L., Jones N.C., Pevzner P.A. De novo identification of repeat families in large genomes. 2005;21(Suppl. 1):I351–I358. - PubMed

[5] Flutre T., Duprat E., Feuillet C., Quesneville H. Considering transposable element diversification in de novo annotation approaches. PLoS ONE. 2011;6:e16526. - PMC - PubMed

[6] Flutre T., Duprat E., Feuillet C., Quesneville H. Considering transposable element diversification in de novo annotation approaches. PLoS ONE. 2011;6:e16526. - PMC - PubMed

[7] Kohany O., Gentles A., Hankus L., Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006;7:474. - PMC - PubMed

[8] Kohany O., Gentles A., Hankus L., Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006;7:474. - PMC - PubMed

[9] Krogh A. An Introduction to Hidden Markov Models for Biological Sequences. In: Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Elsevier; 1998. pp. 45–63.

[10] Krogh A. An Introduction to Hidden Markov Models for Biological Sequences. In: Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Elsevier; 1998. pp. 45–63.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Dfam database of repetitive DNA families

Affiliations

The Dfam database of repetitive DNA families

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous