Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 8;49(D1):D1020-D1028.
doi: 10.1093/nar/gkaa1105.

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

Affiliations

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

Wenjun Li et al. Nucleic Acids Res. .

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
TCDB-derived evidence (NBR011040, NBR011041, NBR011042 and NF038175) improves the annotation of three proteins (WP_003900124.1, WP_042507723.1, WP_003401747.1 and WP_003401737.1, respectively) from the iniBAC operon in Mycobacterium tuberculosis CDC1551 genome (NC_002755.2: 408694–414298). The new evidence was built from TCDB and used to name RefSeq proteins.
Figure 2.
Figure 2.
Example record for HMM NF033727.1. (A) Description of the function of proteins included in the family defined by the model. (B) Attributes and characteristics of the model, including name that is propagated to protein products named by PGAP based on the model. (C) Download options for the profile and the seed proteins (D) Publications supporting the definition of the model and its functional assignment. (E) RefSeq proteins hit by the model. (F) One tab lists the hits for which NF033727.1 is the highest precedence evidence (shown), and another tab lists the proteins that are hit by the model but named after a higher-precedence evidence (not shown). (G) Menu of possible actions for proteins selected on the left.
Figure 3.
Figure 3.
Increase over time of RefSeq non-redundant proteins named after a protein family model: the stacked bars (values on the left axis) indicate the proportion of proteins named by HMMs (orange), CDD architecture (gray), BlastRules (yellow), or Blast hits to cluster representative protein (blue). The blue line (values on the right axis) represents the growth in the total number of prokaryotic RefSeq proteins
Figure 4.
Figure 4.
Example comment block on a non-redundant RefSeq protein named after HMM NF033727.1.

Similar articles

Cited by

References

    1. Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E.P., Zaslavsky L., Lomsadze A., Pruitt K.D., Borodovsky M., Ostell J.. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44:6614–6624. - PMC - PubMed
    1. Haft D.H., DiCuccio M., Badretdin A., Brover V., Chetvernin V., O’Neill K., Li W., Chitsaz F., Derbyshire M.K., Gonzales N.R. et al. .. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018; 46:D851–D860. - PMC - PubMed
    1. Saier M.H. Jr., Reddy V.S., Tsu B.V., Ahmed M.S., Li C., Moreno-Hagelsieb G.. The Transporter Classification Database (TCDB): recent advances. Nucleic Acids Res. 2016; 44:D372–D379. - PMC - PubMed
    1. Liu B., Zheng D., Jin Q., Chen L., Yang J.. VFDB 2019: a comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res. 2019; 47:D687–D692. - PMC - PubMed
    1. Ciufo S., Kannan S., Sharma S., Badretdin A., Clark K., Turner S., Brover S., Schoch C.L., Kimchi A., DiCuccio M.. Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int. J. Syst. Evol. Microbiol. 2018; 68:2386–2392. - PMC - PubMed

Publication types

-