RefSeq: an update on prokaryotic genome annotation and curation

doi:10.1093/nar/gkx1068

. 2018 Jan 4;46(D1):D851-D860.

doi: 10.1093/nar/gkx1068.

RefSeq: an update on prokaryotic genome annotation and curation

Daniel H Haft¹, Michael DiCuccio¹, Azat Badretdin¹, Vyacheslav Brover¹, Vyacheslav Chetvernin¹, Kathleen O'Neill¹, Wenjun Li¹, Farideh Chitsaz¹, Myra K Derbyshire¹, Noreen R Gonzales¹, Marc Gwadz¹, Fu Lu¹, Gabriele H Marchler¹, James S Song¹, Narmada Thanki¹, Roxanne A Yamashita¹, Chanjuan Zheng¹, Françoise Thibaud-Nissen¹, Lewis Y Geer¹, Aron Marchler-Bauer¹, Kim D Pruitt¹

Affiliations

PMID: 29112715
PMCID: PMC5753331
DOI: 10.1093/nar/gkx1068

RefSeq: an update on prokaryotic genome annotation and curation

Daniel H Haft et al. Nucleic Acids Res. 2018.

. 2018 Jan 4;46(D1):D851-D860.

doi: 10.1093/nar/gkx1068.

Authors

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA.

PMID: 29112715
PMCID: PMC5753331
DOI: 10.1093/nar/gkx1068

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

PubMed Disclaimer

Figures

**Figure 1.**
New workflow for structural annotation by the PGAP 4.x series pipeline. Computational processes are shown in blue, data in white or gray. GeneMarkS+ provides *ab initio* prediction of protein-coding genes, but in the context of hints from homology-based evidence, including HMM evidence for the first time. The use of ORFfinder to produce every stop-to-stop translations, and HMM searching to find every translation with an HMM hit, are steps first introduced in the PGAP-4.1 release. The pipeline detects both disrupted genes (e.g. pseudogenes) and exceptional reading frames (e.g. selenoproteins).

**Figure 2.**
A partially expanded view of the homology evidence and protein naming hierarchy used in RefSeq and PGAP annotation. Four families of beta-lactamases are shown (A, metallo, C, and D), each of which is more similar to various hydrolases of other substrates, such as RNA, than to any members of the other beta-lactamase classes. For each class, a protein profile HMM identifies members and suggests a protein product name, but further expansion of the hierarchy can reveal multiple child families, each identified by a more specific HMM that receives a higher precedence during annotation. The hierarchy of evidence largely follows an implicit hierarchy of protein names, with exceptions necessary occasionally, as when unrelated proteins perform closely related functions.

See this image and copyright information in PMC

Cited by

Regulatory sequence-based discovery of anti-defense genes in archaeal viruses.
Bhoobalan-Chitty Y, Xu S, Martinez-Alvarez L, Karamycheva S, Makarova KS, Koonin EV, Peng X. Bhoobalan-Chitty Y, et al. Nat Commun. 2024 May 2;15(1):3699. doi: 10.1038/s41467-024-48074-x. Nat Commun. 2024. PMID: 38698035 Free PMC article.
Concatenated ScaA and TSA56 Surface Antigen Sequences Reflect Genome-Scale Phylogeny of Orientia tsutsugamushi: An Analysis Including Two Genomes from Taiwan.
Minahan NT, Yen TY, Guo YL, Shu PY, Tsai KH. Minahan NT, et al. Pathogens. 2024 Apr 3;13(4):299. doi: 10.3390/pathogens13040299. Pathogens. 2024. PMID: 38668254 Free PMC article.
Acidithiobacillia class members originating at sites within the Pacific Ring of Fire and other tectonically active locations and description of the novel genus 'Igneacidithiobacillus'.
Arisan D, Moya-Beltrán A, Rojas-Villalobos C, Issotta F, Castro M, Ulloa R, Chiacchiarini PA, Díez B, Martín AJM, Ñancucheo I, Giaveno A, Johnson DB, Quatrini R. Arisan D, et al. Front Microbiol. 2024 Apr 3;15:1360268. doi: 10.3389/fmicb.2024.1360268. eCollection 2024. Front Microbiol. 2024. PMID: 38633703 Free PMC article.
KEGG orthology prediction of bacterial proteins using natural language processing.
Chen J, Wu H, Wang N. Chen J, et al. BMC Bioinformatics. 2024 Apr 11;25(1):146. doi: 10.1186/s12859-024-05766-x. BMC Bioinformatics. 2024. PMID: 38600441 Free PMC article.
Interaction of bacteriophage P1 with an epiphytic Pantoea agglomerans strain-the role of the interplay between various mobilome elements.
Giermasińska-Buczek K, Gawor J, Stefańczyk E, Gągała U, Żuchniewicz K, Rekosz-Burlaga H, Gromadka R, Łobocka M. Giermasińska-Buczek K, et al. Front Microbiol. 2024 Mar 25;15:1356206. doi: 10.3389/fmicb.2024.1356206. eCollection 2024. Front Microbiol. 2024. PMID: 38591037 Free PMC article.

See all "Cited by" articles

References

1. Cochrane G., Karsch-Mizrachi I., Takagi T. International Nucleotide Sequence Database, C. . The international nucleotide sequence database collaboration. Nucleic Acids Res. 2016; 44:D48–D50. - PMC - PubMed
1. Maglott D.R., Katz K.S., Sicotte H., Pruitt K.D.. NCBI’s LocusLink and RefSeq. Nucleic Acids Res. 2000; 28:126–128. - PMC - PubMed
1. Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed
1. Tatusova T., Ciufo S., Fedorov B., O’Neill K., Tolstoy I.. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014; 42:D553–D559. - PMC - PubMed
1. Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S.V., Eiglmeier K., Gas S., Barry C.E. 3rd et al. . Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998; 393:537–544. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
- scite Smart Citations

[1] Cochrane G., Karsch-Mizrachi I., Takagi T. International Nucleotide Sequence Database, C. . The international nucleotide sequence database collaboration. Nucleic Acids Res. 2016; 44:D48–D50. - PMC - PubMed

[2] Cochrane G., Karsch-Mizrachi I., Takagi T. International Nucleotide Sequence Database, C. . The international nucleotide sequence database collaboration. Nucleic Acids Res. 2016; 44:D48–D50. - PMC - PubMed

[3] Maglott D.R., Katz K.S., Sicotte H., Pruitt K.D.. NCBI’s LocusLink and RefSeq. Nucleic Acids Res. 2000; 28:126–128. - PMC - PubMed

[4] Maglott D.R., Katz K.S., Sicotte H., Pruitt K.D.. NCBI’s LocusLink and RefSeq. Nucleic Acids Res. 2000; 28:126–128. - PMC - PubMed

[5] Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed

[6] Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed

[7] Tatusova T., Ciufo S., Fedorov B., O’Neill K., Tolstoy I.. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014; 42:D553–D559. - PMC - PubMed

[8] Tatusova T., Ciufo S., Fedorov B., O’Neill K., Tolstoy I.. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014; 42:D553–D559. - PMC - PubMed

[9] Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S.V., Eiglmeier K., Gas S., Barry C.E. 3rd et al. . Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998; 393:537–544. - PubMed

[10] Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S.V., Eiglmeier K., Gas S., Barry C.E. 3rd et al. . Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998; 393:537–544. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RefSeq: an update on prokaryotic genome annotation and curation

Affiliation

RefSeq: an update on prokaryotic genome annotation and curation

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources