Rfam: an RNA family database

Griffiths-Jones, Sam; Bateman, Alex; Marshall, Mhairi; Khanna, Ajay; Eddy, Sean R.

doi:10.1093/nar/gkg006

Abstract

Rfam is a collection of multiple sequence alignments and covariance models representing non-coding RNA families. Rfam is available on the web in the UK at http://www.sanger.ac.uk/Software/Rfam/ and in the US at http://rfam.wustl.edu/ . These websites allow the user to search a query sequence against a library of covariance models, and view multiple sequence alignments and family annotation. The database can also be downloaded in flatfile form and searched locally using the INFERNAL package ( http://infernal.wustl.edu/ ). The first release of Rfam (1.0) contains 25 families, which annotate over 50 000 non-coding RNA genes in the taxonomic divisions of the EMBL nucleotide database.

Received August 15, 2002; Accepted September 1, 2002

INTRODUCTION

Non-coding RNA genes produce a functional RNA molecule as a final product, rather than a translated protein. Current gene-finding methods largely ignore non-coding RNA genes, yet they produce some of the cell's most important products—transfer RNA and ribosomal RNA are two of the well-known examples. The number of known RNA genes is expanding rapidly due to the deluge of genomic data, but also aided by recent systematic efforts to detect RNA genes [reviewed in ( 1 – 3 )].

Just like protein coding genes ncRNAs fall into families that have evolved from a common ancestor. By making alignments of these families of ncRNA genes we can learn about their structure and function. Indeed, accurate prediction of RNA secondary structure relies on multiple sequence alignments to provide data on co-varying bases ( 4 ). Ribosomal RNA alignments are used to make molecular phylogenies that guide taxonomic classification of all species ( 5 ).

Many RNA sequence families conserve a consensus base-paired secondary structure. Standard primary sequence analysis tools [such as BLAST ( 6 ) for database searches and CLUSTALW ( 7 ) for multiple alignment] are useful for closely related RNAs, but recognition and alignment of more distantly related structural RNAs is greatly aided by consensus secondary structure information. Historically, structure-based RNA sequence analysis has been difficult to automate. Most RNA structural alignments are the product of expert manual curation. Recent software advances ( 8 ) using secondary structure profiles called ‘covariance models’ (CMs—also called profile stochastic context-free grammars) ( 1 , 9 ) have led us to begin the development and automated maintenance of a database of structural RNA alignments. This is analogous to the use of profile hidden Markov models of primary sequence consensus in the development and maintenance of thousands of protein sequence alignments in the Pfam database ( 10 ).

Several databases already exist that contain RNA alignments and information—for example, the European Large Subunit Ribosomal RNA Database ( 11 ), the SRP database ( 12 ), the uRNA database ( 13 ), the Comparative RNA Web ( 14 ), and others ( 15 – 22 ). These databases are well curated and provide a large amount of information to the specialist. However, they vary greatly in the file formats used and the data presented. There are also several specialised computational tools to aid identification of specific RNA types. For example, tRNAscan-SE is a standard tool in the genome annotation field for identifying tRNA genes with extremely high sensitivity and specificity ( 23 ). A recent report describes a new tool, BRUCE, which aims to predict tmRNA genes in genomic sequence ( 24 ). However, the RNA analysis field lacks any analogue to the comprehensive secondary sequence databases that greatly aid protein annotation, such as Pfam ( 10 ), SMART ( 25 ) and Prosite ( 26 ).

The aims of the Rfam database are (i) to integrate the many existing curated structural RNA alignments (in addition to new alignments) into a common structure-annotated format, analogous to Pfam's curated seed alignments; (ii) to use covariance model software to search the growing sequence databases and maintain automatically-generated alignments of all detectable homologues, analogous to Pfam's automatically-generated full alignments; and (iii) to provide a system for automatically analysing and annotating sequences (including complete genome sequences) for the presence of homologues to known structural RNAs, analogous to the public Pfam search servers.

METHODS

Each family in Rfam is represented by two multiple sequence alignments and a covariance model. The seed alignment contains known representative members of the family, is hand-curated, and is annotated with structural information. The seed alignment is used to build a covariance model using the CMBUILD program from the INFERNAL suite ( http://infernal.wustl.edu/ ) ( 8 ). The model is then used to search a nucleotide sequence database using the CMSEARCH program. CMSEARCH reports scores for matches to the model, and a family-specific threshold is chosen such that we believe no false positives fall above the threshold. The matches are then aligned to the model using the CMALIGN program.

The nucleotide database searched is called RFAMSEQ, and is built from a subset of the EMBL nucleotide database ( 27 ). RFAMSEQ 1 is based on EMBL release 71. RFAMSEQ includes the ‘finished’ portion of EMBL distributed in the organism specific data files, and excludes the EST, GSS, HTG, HTC, STS and patent sections of the database. Despite these exclusions, RFAMSEQ 1 contains 1 075 317 sequences and over 5.3 billion bases. CM searches are particularly computationally expensive, with a small model (such as tRNA) searching around 200 bases per second on a 600 MHz Compaq ALPHA. A full CM search of RFAMSEQ with one small model would take around 300 cpu days. The search time scales roughly with the cube of the query consensus length, so this quickly becomes entirely infeasible for larger RNAs. We, therefore, employ an initial BLAST search ( 6 ) with relaxed search parameters to reduce the search space. All BLAST hits with P -value <10 to a member of the seed alignment are retrieved, a family specific window size added to each end of the matches, and the reduced database subjected to a full CM search. This approach is similar to that employed by the tRNAscan-SE program which uses an heuristic first step followed by full covariance model search ( 23 ), but is generally applicable to any ncRNA search. We anticipate that technological and software improvements will in the future allow us to conduct full CM searches to build family alignments.

AVAILABILITY

Rfam is available on the web at http://www.sanger.ac.uk/Software/Rfam/ in the UK, and http://rfam.wustl.edu/ in the US. The database is also available in flatfile format for local installation. To search Rfam locally, the user will also need the INFERNAL software suite, available from http://infernal.wustl.edu/ . Table 1 shows a list of families contained in Rfam 1.0. These families annotate over 50 000 ncRNAs in the RFAMSEQ database.

WEBSITE FEATURES

The Rfam websites have been designed to be intuitive to use—users of the Pfam database of protein families will recognise the layout and format of the database. The websites provide the facility to search a DNA sequence against the library of CMs. The user can view annotation on each RNA family, and follow links to other databases and literature references. The multiple sequence alignments on which Rfam is based are available in a number of formats for viewing in a browser or for downloading. Both the seed and full alignments contain secondary structure mark-up to describe the base-paired positions in the member sequences, and the web view provides a colour-encoded representation of these co-varying columns (Fig. 1 ). In addition the web pages allow the user to quickly determine the species distribution within a family.

FUTURE DIRECTIONS

Rfam is under active development and will increase significantly in size and scope over the next 12 months. Novel ncRNA genes are being discovered at a rapid rate, and we aim to quickly translate such discoveries into useful and searchable RNA families. However, we recognise a number of limitations with our approach. The most obvious of these is the computational cost of using CMs. We predict that technological advances will soon make these searches far more feasible, and will allow full CM genome-wide searches for ncRNAs using Rfam. Until such a time narrowing the search space using BLAST greatly facilitates such searches, though at an inevitable and unknown cost in search sensitivity. In addition, there are RNA families that we cannot model using the alignment- and profile-based approach at present—for example, microRNAs (miRNA precursor secondary structures are only vaguely similar stem-loops) and many small nucleolar RNAs (the consensus of modification guide snoRNAs includes significant inter-molecular base pairing to their target RNAs). Despite such limitations, the Rfam library of alignments and CMs provides a useful tool for genome annotation, as well as a comprehensive resource for RNA family information and multiple sequence alignments.

ACKNOWLEDGEMENTS

We are grateful to William Mifsud for providing annotation for many of the families in Rfam.

Figure 1. Seed alignment for the U12 spliceosomal RNA family from the UK website. Secondary structure base pairs are encoded in the coloured bases in the alignment, and are marked-up in the SS_cons lines with nested sets of < and > tags. The sequence accessions link to entries in the EMBL database.

Open in new tab Download slide

Table 1.

Open in new tab

Accession numbers and descriptions of the families in Rfam release 1.0. Where data from third party sources are repackaged, or have been used in the construction of seed alignments, the appropriate source is cited

Rfam accession

Family description

Third party sources

RF00001

5S ribosomal RNA

5S ribosomal RNA database ( 16 )

RF00002

5.8S ribosomal RNA

European LSU rRNA database ( 11 )

RF00003

U1 spliceosomal RNA

The uRNA database ( 13 )

RF00004

U2 spliceosomal RNA

The uRNA database ( 13 )

RF00005

Transfer RNA

RF00006

Vault RNA

RF00007

U12 minor spliceosomal RNA

( 28 )

RF00008

Hammerhead ribozyme

RF00009

Nuclear RNase P

The Ribonuclease P Database ( 15 )

RF00010

Bacterial RNase P class A

The Ribonuclease P Database ( 15 )

RF00011

Bacterial RNase P class B

The Ribonuclease P Database ( 15 )

RF00012

U3 small nucleolar RNA

The uRNA database ( 13 )

RF00013

6S/SsrS RNA

RF00014

DsrA RNA

RF00015

U4 spliceosomal RNA

The uRNA database ( 13 )

RF00016

U14 small nucleolar RNA

RF00017

Signal recognition particle RNA

SRPDB ( 12 )

RF00018

CsrB/RsmB RNA

RF00019

Y RNA (Ro RNP component)

RF00020

U5 spliceosomal RNA

The uRNA database ( 13 )

RF00021

Spot 42 RNA

RF00022

GcvB RNA

RF00023

tmRNA

The tmRNA Website ( 21 )

RF00024

Vertebrate telomerase RNA

( 29 )

RF00025

Ciliate telomerase RNA

( 30 )

References

1.

Eddy,S.R. (

2002

) Computational genomics of noncoding RNA genes.

Cell

,

109

,

137

–140.

2.

Eddy,S.R. (

2001

) Non-coding RNA genes and the modern RNA world.

Nature Rev. Genet.

,

2

,

919

–929.

3.

Storz,G. (

2002

) An expanding universe of noncoding RNAs.

Science

,

296

,

1260

–1263.

4.

Pace,N.R., Thomas,B.C. and Woese,C.R. (

1999

) Probing RNA structure, function and history by comparative analysis. In Gesteland,R.F., Cech,T.R. and Atkins,J.F. (eds),

The RNA World

, 2nd Edn. Cold Spring Harbor Laboratory Press, pp.

113

–141.

5.

Pace,N.R. (

1997

) A molecular view of microbial diversity and the biosphere.

Science

,

276

,

734

–740.

6.

Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (

1997

) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res.

,

25

,

3389

–3402.

7.

Thompson,J.D., Higgins,D.G. and Gibson,T.J. (

1994

) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Nucleic Acids Res.

,

22

,

4673

–4680.

8.

Eddy,S.R. (

2002

) A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure.

BMC Bioinformatics

,

3

,

18

.

9.

Eddy,S.R. and Durbin,R. (

1994

) RNA sequence analysis using covariance models.

Nucleic Acids Res.

,

22

,

2079

–2088.

10.

Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L.L. (

2002

) The Pfam protein families database.

Nucleic Acids Res.

,

30

,

276

–280.

11.

Wuyts,J., De Rijk,P., Van de Peer,Y., Winkelmans,T. and De Wachter,R. (

2001

) The European Large Subunit Ribosomal RNA Database.

Nucleic Acids Res.

,

29

,

175

–177.

12.

Gorodkin,J., Knudsen,B., Zwieb,C. and Samuelsson,T. (

2001

) SRPDB (Signal Recognition Particle Database).

Nucleic Acids Res.

,

29

,

169

–170.

13.

Zwieb,C. (

1997

) The uRNA database.

Nucleic Acids Res.

,

25

,

102

–103.

14.

Cannone,J.J., Subramanian,S., Schnare,M.N., Collett,J.R., D'Souza,L.M., Du,Y., Feng,B., Lin,N., Madabusi,L.V., ller,K.M., Pande,N., Shang,Z., Yu,N. and Gutell,R.R. (

2002

) The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs.

BMC Bioinformatics

,

3

,

2

.

15.

Brown,J.W. (

1999

) The Ribonuclease P Database.

Nucleic Acids Res.

,

27

,

314

.

16.

Szymanski,M., Barciszewska,M.Z., Erdmann,V.A. and Barciszewski,J. (

2002

) 5S Ribosomal RNA Database.

Nucleic Acids Res.

,

30

,

176

–178.

17.

Klosterman,P.S., Tamura,M., Holbrook,S.R. and Brenner,S.E. (

2002

) SCOR: a Structural Classification of RNA database.

Nucleic Acids Res.

,

30

,

392

–394.

18.

Maidak,B.L., Cole,J.R., Lilburn,T.G., Parker,C.T.,Jr0a, Saxman,P.R., Farris,R.J., Garrity,G.M., Olsen,G.J., Schmidt,T.M. and Tiedje,J.M. (

2001

) The RDP-II (Ribosomal Database Project).

Nucleic Acids Res.

,

29

,

173

–174.

19.

van Batenburg,F.H., Gultyaev,A.P. and Pleij,C.W. (

2001

) PseudoBase: structural information on RNA pseudoknots.

Nucleic Acids Res

,

29

,

194

–195.

20.

Wuyts,J., Van de Peer,Y., Winkelmans,T. and De Wachter,R. (

2002

) The European database on small subunit ribosomal RNA.

Nucleic Acids Res.

,

30

,

183

–185.

21.

Williams,K.P. (

2002

) The tmRNA Website: invasion by an intron.

Nucleic Acids Res.

,

30

,

179

–182.

22.

Knudsen,B., Wower,J., Zwieb,C. and Gorodkin,J. (

2001

) tmRDB (tmRNA database).

Nucleic Acids Res.

,

29

,

171

–172.

23.

Lowe,T.M. and Eddy,S.R. (

1997

) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.

Nucleic Acids Res.

,

25

,

955

–964.

24.

Laslett,D., Canback,B. and Andersson,S. (

2002

) BRUCE: a program for the detection of transfer–messenger RNA genes in nucleotide sequences.

Nucleic Acids Res.

,

30

,

3449

–3453.

25.

Letunic,I., Goodstadt,L., Dickens,N.J., Doerks,T., Schultz,J., Mott,R., Ciccarelli,F., Copley,R.R., Ponting,C.P. and Bork,P. (

2002

) Recent improvements to the SMART domain-based sequence annotation resource.

Nucleic Acids Res.

,

30

,

242

–244.

26.

Falquet,L., Pagni,M., Bucher,P., Hulo,N., Sigrist,C.J., Hofmann,K. and Bairoch,A. (

2002

) The PROSITE database, its status in 2002.

Nucleic Acids Res.

,

30

,

235

–238.

27.

Stoesser,G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V., Lopez,R., Redaschi,N., Stoehr,P., Tuli,M.A., Tzouvara,K. and Vaughan,R. (

2002

) The EMBL Nucleotide Sequence Database.

Nucleic Acids Res.

,

30

,

21

–26.

28.

Shukla,G.C. and Padgett,R.A. (

1999

) Conservation of functional features of U6atac and U12 snRNAs between vertebrates and higher plants.

RNA

,

5

,

525

–538.

29.

Chen,J.L., Blasco,M.A. and Greider,C.W. (

2000

) Secondary structure of vertebrate telomerase RNA.

Cell

,

100

,

503

–514.

30.

McCormick-Graham,M. and Romero,D.P. (

1995

) Ciliate telomerase RNA structural features.

Nucleic Acids Res.

,

23

,

1091

–1097.

Author notes

The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 1Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, St Louis, MO 63110, USA

Download all slides

Month:	Total Views:
November 2016	2
December 2016	2
January 2017	11
February 2017	41
March 2017	33
April 2017	16
May 2017	25
June 2017	24
July 2017	19
August 2017	21
September 2017	16
October 2017	15
November 2017	24
December 2017	85
January 2018	90
February 2018	64
March 2018	100
April 2018	74
May 2018	76
June 2018	52
July 2018	96
August 2018	135
September 2018	58
October 2018	63
November 2018	67
December 2018	72
January 2019	56
February 2019	66
March 2019	96
April 2019	94
May 2019	98
June 2019	70
July 2019	61
August 2019	126
September 2019	78
October 2019	75
November 2019	58
December 2019	65
January 2020	59
February 2020	62
March 2020	48
April 2020	51
May 2020	41
June 2020	66
July 2020	51
August 2020	91
September 2020	55
October 2020	48
November 2020	66
December 2020	61
January 2021	89
February 2021	59
March 2021	106
April 2021	97
May 2021	85
June 2021	63
July 2021	72
August 2021	66
September 2021	118
October 2021	165
November 2021	133
December 2021	125
January 2022	69
February 2022	56
March 2022	109
April 2022	79
May 2022	78
June 2022	65
July 2022	60
August 2022	62
September 2022	53
October 2022	58
November 2022	80
December 2022	144
January 2023	114
February 2023	121
March 2023	111
April 2023	58
May 2023	82
June 2023	92
July 2023	58
August 2023	61
September 2023	78
October 2023	81
November 2023	78
December 2023	75
January 2024	98
February 2024	81
March 2024	125
April 2024	94
May 2024	86
June 2024	55
July 2024	96
August 2024	61

Article Contents

Rfam: an RNA family database

Abstract

INTRODUCTION

METHODS

AVAILABILITY

WEBSITE FEATURES

FUTURE DIRECTIONS

ACKNOWLEDGEMENTS

References

Author notes

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Rfam: an RNA family database

Abstract

INTRODUCTION

METHODS

AVAILABILITY

WEBSITE FEATURES

FUTURE DIRECTIONS

ACKNOWLEDGEMENTS

References

Author notes

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only