RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation
- PMID: 33270901
- PMCID: PMC7779008
- DOI: 10.1093/nar/gkaa1105
RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
Published by Oxford University Press on behalf of Nucleic Acids Research 2020.
Figures
![Figure 1.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/7779008/bin/gkaa1105fig1.gif)
![Figure 2.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/7779008/bin/gkaa1105fig2.gif)
![Figure 3.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/7779008/bin/gkaa1105fig3.gif)
![Figure 4.](https://www.ncbi.nlm.nih.gov/pmc/articles/instance/7779008/bin/gkaa1105fig4.gif)
Similar articles
-
RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.Nucleic Acids Res. 2024 Jan 5;52(D1):D762-D769. doi: 10.1093/nar/gkad988. Nucleic Acids Res. 2024. PMID: 37962425 Free PMC article.
-
An Experimental Approach to Genome Annotation: This report is based on a colloquium sponsored by the American Academy of Microbiology held July 19-20, 2004, in Washington, DC.Washington (DC): American Society for Microbiology; 2004. Washington (DC): American Society for Microbiology; 2004. PMID: 33001599 Free Books & Documents. Review.
-
RefSeq: an update on prokaryotic genome annotation and curation.Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860. doi: 10.1093/nar/gkx1068. Nucleic Acids Res. 2018. PMID: 29112715 Free PMC article.
-
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. doi: 10.1093/nar/gkv1189. Epub 2015 Nov 8. Nucleic Acids Res. 2016. PMID: 26553804 Free PMC article.
-
Annotation of bacterial and archaeal genomes: improving accuracy and consistency.Chem Rev. 2007 Aug;107(8):3431-47. doi: 10.1021/cr068308h. Epub 2007 Jul 21. Chem Rev. 2007. PMID: 17658903 Review. No abstract available.
Cited by
-
A novel replication initiation region encoded in a widespread Acinetobacter plasmid lineage carrying a blaNDM-1 gene.PLoS One. 2024 May 31;19(5):e0303976. doi: 10.1371/journal.pone.0303976. eCollection 2024. PLoS One. 2024. PMID: 38820537 Free PMC article.
-
A cascade of sulfur transferases delivers sulfur to the sulfur-oxidizing heterodisulfide reductase-like complex.Protein Sci. 2024 Jun;33(6):e5014. doi: 10.1002/pro.5014. Protein Sci. 2024. PMID: 38747384 Free PMC article.
-
Genome sequence of the bialaphos producer Streptomyces sp. DSM 41527 and two putative phosphonate antibiotic producers Streptomyces sp. DSM 41014 and DSM 41981 from the DSMZ strain collection.Access Microbiol. 2024 Apr 19;6(4):000770.v3. doi: 10.1099/acmi.0.000770.v3. eCollection 2024. Access Microbiol. 2024. PMID: 38737806 Free PMC article.
-
Loss to gain: pseudogenes in microorganisms, focusing on eubacteria, and their biological significance.Appl Microbiol Biotechnol. 2024 May 8;108(1):328. doi: 10.1007/s00253-023-12971-w. Appl Microbiol Biotechnol. 2024. PMID: 38717672 Free PMC article. Review.
-
Persistent homology reveals strong phylogenetic signal in 3D protein structures.PNAS Nexus. 2024 Apr 17;3(4):pgae158. doi: 10.1093/pnasnexus/pgae158. eCollection 2024 Apr. PNAS Nexus. 2024. PMID: 38689707 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Research Materials