Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 4;46(D1):D851-D860.
doi: 10.1093/nar/gkx1068.

RefSeq: an update on prokaryotic genome annotation and curation

Affiliations

RefSeq: an update on prokaryotic genome annotation and curation

Daniel H Haft et al. Nucleic Acids Res. .

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule-BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
New workflow for structural annotation by the PGAP 4.x series pipeline. Computational processes are shown in blue, data in white or gray. GeneMarkS+ provides ab initio prediction of protein-coding genes, but in the context of hints from homology-based evidence, including HMM evidence for the first time. The use of ORFfinder to produce every stop-to-stop translations, and HMM searching to find every translation with an HMM hit, are steps first introduced in the PGAP-4.1 release. The pipeline detects both disrupted genes (e.g. pseudogenes) and exceptional reading frames (e.g. selenoproteins).
Figure 2.
Figure 2.
A partially expanded view of the homology evidence and protein naming hierarchy used in RefSeq and PGAP annotation. Four families of beta-lactamases are shown (A, metallo, C, and D), each of which is more similar to various hydrolases of other substrates, such as RNA, than to any members of the other beta-lactamase classes. For each class, a protein profile HMM identifies members and suggests a protein product name, but further expansion of the hierarchy can reveal multiple child families, each identified by a more specific HMM that receives a higher precedence during annotation. The hierarchy of evidence largely follows an implicit hierarchy of protein names, with exceptions necessary occasionally, as when unrelated proteins perform closely related functions.

Similar articles

  • RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.
    Haft DH, Badretdin A, Coulouris G, DiCuccio M, Durkin AS, Jovenitti E, Li W, Mersha M, O'Neill KR, Virothaisakun J, Thibaud-Nissen F. Haft DH, et al. Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988. Nucleic Acids Res. 2024. PMID: 37962425 Free PMC article.
  • Genome annotation of disease-causing microorganisms.
    Dong Y, Li C, Kim K, Cui L, Liu X. Dong Y, et al. Brief Bioinform. 2021 Mar 22;22(2):845-854. doi: 10.1093/bib/bbab004. Brief Bioinform. 2021. PMID: 33537706 Free PMC article. Review.
  • RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.
    Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. Li W, et al. Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028. doi: 10.1093/nar/gkaa1105. Nucleic Acids Res. 2021. PMID: 33270901 Free PMC article.
  • NCBI Taxonomy: a comprehensive update on curation, resources and tools.
    Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, O'Neill K, Robbertse B, Sharma S, Soussov V, Sullivan JP, Sun L, Turner S, Karsch-Mizrachi I. Schoch CL, et al. Database (Oxford). 2020 Jan 1;2020:baaa062. doi: 10.1093/database/baaa062. Database (Oxford). 2020. PMID: 32761142 Free PMC article. Review.
  • Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.
    O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. O'Leary NA, et al. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8. Nucleic Acids Res. 2016. PMID: 26553804 Free PMC article.

Cited by

References

    1. Cochrane G., Karsch-Mizrachi I., Takagi T. International Nucleotide Sequence Database, C. . The international nucleotide sequence database collaboration. Nucleic Acids Res. 2016; 44:D48–D50. - PMC - PubMed
    1. Maglott D.R., Katz K.S., Sicotte H., Pruitt K.D.. NCBI’s LocusLink and RefSeq. Nucleic Acids Res. 2000; 28:126–128. - PMC - PubMed
    1. Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed
    1. Tatusova T., Ciufo S., Fedorov B., O’Neill K., Tolstoy I.. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014; 42:D553–D559. - PMC - PubMed
    1. Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., Gordon S.V., Eiglmeier K., Gas S., Barry C.E. 3rd et al. . Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998; 393:537–544. - PubMed

Publication types

-