NCBI prokaryotic genome annotation pipeline

doi:10.1093/nar/gkw569

. 2016 Aug 19;44(14):6614-24.

doi: 10.1093/nar/gkw569. Epub 2016 Jun 24.

NCBI prokaryotic genome annotation pipeline

Affiliations

¹ National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA.
² Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech, Atlanta, GA 30332, USA.
³ Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech, Atlanta, GA 30332, USA School of Computational Science and Engineering, Georgia Tech, Atlanta, GA 30332, USA borodovsky@gatech.edu.

PMID: 27342282
PMCID: PMC5001611
DOI: 10.1093/nar/gkw569

NCBI prokaryotic genome annotation pipeline

Tatiana Tatusova et al. Nucleic Acids Res. 2016.

. 2016 Aug 19;44(14):6614-24.

doi: 10.1093/nar/gkw569. Epub 2016 Jun 24.

Affiliations

¹ National Center for Biotechnology Information, U.S. National Library of Medicine, Bethesda, MD 20894, USA.
² Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech, Atlanta, GA 30332, USA.
³ Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech, Atlanta, GA 30332, USA School of Computational Science and Engineering, Georgia Tech, Atlanta, GA 30332, USA borodovsky@gatech.edu.

PMID: 27342282
PMCID: PMC5001611
DOI: 10.1093/nar/gkw569

Abstract

Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/.

Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

PubMed Disclaimer

Figures

**Figure 1.**
Cumulative number of protein clusters (Y) is defined for a given X (%) as the number of clusters containing proteins from fraction x ≥ X of all members of the clade. Data are presented for the four well studied clades.

**Figure 2.**
A fragment of the PGAP execution graph: prediction of structural RNA genes (ncRNA, tRNA, 5S-, 16S-, 23S- rRNA).

**Figure 3.**
Flowchart of PGAP. The red dotted line indicates separation between pass one and pass two (see text for details).

**Figure 4.**
A region in the *Deinococcus radiodurans* R1 genome assembly (GCA_000008565.1) contains three overlapping ORFs predicted ab initio as CDSs in the first pass of PGAP. Automatic evaluation of the cross-species protein evidence through the second pass of PGAP reveals proteins bearing homology to all three fragments. Alignment of the proteins to the genome reveals otherwise unpredicted frameshifts. Green bars represent genes, red bars – coding regions; grey bars – alignments with red vertical bars indicating mismatches. (A) A region of Chromosome 1 of *D. radiodurans* (AE000513.1) containing the three CDS features is displayed alongside the six-frame translation. (B) The same region, updated to include final annotation markup with a frameshifted CDS as well as supporting proteins that demonstrate a consistent pattern and location of two frameshifts (marked by arrows at positions 100 733 and 100 959).

**Figure 5.**
Annotation of genome of *Salmonella enterica* subsp. *enterica* serovar *Typhimurium* str. LT2 (NC_003197). Protein alignment provides support for gene start selection. See legend to Figure 4 for description of the meaning of green, red and gray bars. (A) the first round of alignments of protein representatives from the ‘core’ protein clusters doesn't give enough evidence for gene start selection. (B) the second round of alignments clearly supports a shorter gene model which does not overlap with the upstream gene.

**Figure 6.**
A summary of PGAP genome annotation process is provided in the COMMENT section of GenBank and RefSeq records. The example is given for *Listeria monocytogenes* strain CFSAN010068, complete genome NZ_CP014250.1.

**Figure 7.**
Frequency histogram of genomes with respect to the fraction of the whole complement of genes supported by similarity to proteins in RefSeq. In about 50% of the total set of genomes in consideration, mostly from highly populated clades, more than 95% of protein-coding genes are supported by protein sequence similarity.

See this image and copyright information in PMC

Cited by

Genomic analysis of a halophilic bacterium Nesterenkonia sp. CL21 with ability to produce a diverse group of lignocellulolytic enzymes.
An H, Ching XH, Cheah WJ, Lim WL, Ee KY, Chong CS, Lam MQ. An H, et al. Folia Microbiol (Praha). 2024 Jun 6. doi: 10.1007/s12223-024-01178-9. Online ahead of print. Folia Microbiol (Praha). 2024. PMID: 38842626
Roseateles caseinilyticus sp. nov. and Roseateles cellulosilyticus sp. nov., isolated from rice paddy field soil.
So Y, Chhetri G, Kim I, Park S, Jung Y, Seo T. So Y, et al. Antonie Van Leeuwenhoek. 2024 Jun 4;117(1):87. doi: 10.1007/s10482-024-01988-4. Antonie Van Leeuwenhoek. 2024. PMID: 38833203
Comparative Analyses of Bacteriophage Genomes.
Rossi FPN, Flores VS, Uceda-Campos G, Amgarten DE, Setubal JC, da Silva AM. Rossi FPN, et al. Methods Mol Biol. 2024;2802:427-453. doi: 10.1007/978-1-0716-3838-5_14. Methods Mol Biol. 2024. PMID: 38819567
Annotation and Comparative Genomics of Prokaryotic Transposable Elements.
Ross K, Zerillo MM, Chandler M, Varani AM. Ross K, et al. Methods Mol Biol. 2024;2802:189-213. doi: 10.1007/978-1-0716-3838-5_8. Methods Mol Biol. 2024. PMID: 38819561
How to Obtain and Compare Metagenome-Assembled Genomes.
Sanchez FB, Sato Guima SE, Setubal JC. Sanchez FB, et al. Methods Mol Biol. 2024;2802:135-163. doi: 10.1007/978-1-0716-3838-5_6. Methods Mol Biol. 2024. PMID: 38819559

See all "Cited by" articles

References

1. Besemer J., Lomsadze A., Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;26:1107–1115. - PMC - PubMed
1. Delcher A.L., Harmon D., Kasif S., White O., Salzberg S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999;23:4636–4641. - PMC - PubMed
1. Tatusov R.L., Natale D.A., Garkavtsev I.V., Tatusova T.A., Shankavaram U.T., Rao B.S., Kiryutin B., Galperin M.Y., Fedorova N.D., Koonin E.V. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. - PMC - PubMed
1. Klimke W., Agarwala R., Badretdin A., Chetvernin S., Ciufo S., Fedorov B., Kiryutin B., O'Neill K., Resch W., Resenchuk S., et al. The National Center for Biotechnology Information's Protein Clusters Database. Nucleic Acids Res. 2009;37:D216–D223. - PMC - PubMed
1. Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA Homology searches. Bioinformatics. 2013;29:2933–2935. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 HG000783/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
- scite Smart Citations

[1] Besemer J., Lomsadze A., Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;26:1107–1115. - PMC - PubMed

[2] Besemer J., Lomsadze A., Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;26:1107–1115. - PMC - PubMed

[3] Delcher A.L., Harmon D., Kasif S., White O., Salzberg S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999;23:4636–4641. - PMC - PubMed

[4] Delcher A.L., Harmon D., Kasif S., White O., Salzberg S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 1999;23:4636–4641. - PMC - PubMed

[5] Tatusov R.L., Natale D.A., Garkavtsev I.V., Tatusova T.A., Shankavaram U.T., Rao B.S., Kiryutin B., Galperin M.Y., Fedorova N.D., Koonin E.V. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. - PMC - PubMed

[6] Tatusov R.L., Natale D.A., Garkavtsev I.V., Tatusova T.A., Shankavaram U.T., Rao B.S., Kiryutin B., Galperin M.Y., Fedorova N.D., Koonin E.V. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001;29:22–28. - PMC - PubMed

[7] Klimke W., Agarwala R., Badretdin A., Chetvernin S., Ciufo S., Fedorov B., Kiryutin B., O'Neill K., Resch W., Resenchuk S., et al. The National Center for Biotechnology Information's Protein Clusters Database. Nucleic Acids Res. 2009;37:D216–D223. - PMC - PubMed

[8] Klimke W., Agarwala R., Badretdin A., Chetvernin S., Ciufo S., Fedorov B., Kiryutin B., O'Neill K., Resch W., Resenchuk S., et al. The National Center for Biotechnology Information's Protein Clusters Database. Nucleic Acids Res. 2009;37:D216–D223. - PMC - PubMed

[9] Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA Homology searches. Bioinformatics. 2013;29:2933–2935. - PMC - PubMed

[10] Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA Homology searches. Bioinformatics. 2013;29:2933–2935. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NCBI prokaryotic genome annotation pipeline

Affiliations

NCBI prokaryotic genome annotation pipeline

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources