The National Center for Biotechnology Information's Protein Clusters Database

doi:10.1093/nar/gkn734

. 2009 Jan;37(Database issue):D216-23.

doi: 10.1093/nar/gkn734. Epub 2008 Oct 21.

The National Center for Biotechnology Information's Protein Clusters Database

William Klimke¹, Richa Agarwala, Azat Badretdin, Slava Chetvernin, Stacy Ciufo, Boris Fedorov, Boris Kiryutin, Kathleen O'Neill, Wolfgang Resch, Sergei Resenchuk, Susan Schafer, Igor Tolstoy, Tatiana Tatusova

Affiliations

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA. klimke@ncbi.nlm.nih.gov

PMID: 18940865
PMCID: PMC2686591
DOI: 10.1093/nar/gkn734

The National Center for Biotechnology Information's Protein Clusters Database

William Klimke et al. Nucleic Acids Res. 2009 Jan.

. 2009 Jan;37(Database issue):D216-23.

doi: 10.1093/nar/gkn734. Epub 2008 Oct 21.

Authors

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA. klimke@ncbi.nlm.nih.gov

PMID: 18940865
PMCID: PMC2686591
DOI: 10.1093/nar/gkn734

Abstract

Rapid increases in DNA sequencing capabilities have led to a vast increase in the data generated from prokaryotic genomic studies, which has been a boon to scientists studying micro-organism evolution and to those who wish to understand the biological underpinnings of microbial systems. The NCBI Protein Clusters Database (ProtClustDB) has been created to efficiently maintain and keep the deluge of data up to date. ProtClustDB contains both curated and uncurated clusters of proteins grouped by sequence similarity. The May 2008 release contains a total of 285 386 clusters derived from over 1.7 million proteins encoded by 3806 nt sequences from the RefSeq collection of complete chromosomes and plasmids from four major groups: prokaryotes, bacteriophages and the mitochondrial and chloroplast organelles. There are 7180 clusters containing 376 513 proteins with curated gene and protein functional annotation. PubMed identifiers and external cross references are collected for all clusters and provide additional information resources. A suite of web tools is available to explore more detailed information, such as multiple alignments, phylogenetic trees and genomic neighborhoods. ProtClustDB provides an efficient method to aggregate gene and protein annotation for researchers and is available at http://www.ncbi.nlm.nih.gov/sites/entrez?db=proteinclusters.

PubMed Disclaimer

Figures

**Figure 1.**
Cluster overview display. (A) Overview of one of the curated elongation factor Tu clusters (PRK12735). All expandable panels are marked with an arrowhead. (A1) Cluster Accession, curation status and protein name, either curated or automatically chosen from existing names for uncurated clusters. Curated gene names would appear at the right. (A2) The cluster info panel includes basic statistics for the cluster including protein, paralog, genera and publication counts. (A3) Cluster tool panel for launching separate analysis tools (shown in detail in Figure 2). (A4) Cross-references to NCBI and external databases from both curated and automatically collected information. NCBI links include references to the COG, conserved domain (CDD) and structure (MMDB) and other Entrez databases (collapsed in current view—gene, protein, nucleotide, genome, PubMed and taxonomy). External links are described in the text. When there is more than one link in a category, the full list is shown when clicking on that particular category and a single link can be chosen. (A5) Curated functional descriptions, domain description from NCBI CDD, COG functional category and KEGG BRITE hierarchy. (A6) Publication categories. The full set of publications is available as a link to PubMed for the full set or each subset separately. Publications may occur in multiple categories. (A7) Related clusters section shows up to 10 related curated and uncurated clusters from all four cluster groups. The full nonredundant set is available from the link showing the total number of related clusters. (A8) Top cluster pattern. The pattern tool collects patterns of conserved clusters (present in at least three genomes) with the most conserved pattern displayed on the overview page. All patterns are available by clicking on the image and from the cluster tool (Figure 2D). (B) Protein table for curated cluster PRK05506 (bifunctional sulfate adenylyltransferase subunit 1/adenylylsulfate kinase protein). The list of proteins is displayed below the cluster overview. (B1) Column headers. This section includes tools to control the list of proteins such as collapsing all organism groups (this can also be done individually for each group). Paralogs (two or more proteins encoded by the same nucleotide sequence) can be highlighted in yellow, or the entire protein table can be limited to paralogs only. (B2) List of organism groups and organisms. Checkboxes are used to highlight groups or individual proteins which can be used to broadcast selections to highlight proteins in the cluster tool displays (Figure 2). Two proteins from *Frankia* genomes have been selected in order to highlight them in the alignment tool (Figure 2A). (B3) The list of current protein names reflects the current set of names from RefSeq proteins. Once all proteins in a cluster are updated with the curated name then all protein names will be the same (as they are in this image). (B4) Protein RefSeq Accession Number and local genomic neighborhood. Genes encoding a protein in the current cluster are examined in both upstream and downstream flanking genes in each genome to check for cluster assignment. Genes in a cluster are shown with that cluster accession, those clusters with a COG association are shown color-coded by functional category. Unclustered genes or RNA genes or pseudogenes are not shown at all. This provides a quick snapshot of the local genomic neighborhood for each gene in the cluster. In this image, all upstream genes encode proteins that belong to curated cluster PRK05253 (sulfate adenylyltransferase subunit 2). (B5) Links to Entrez Gene by locus tag (unique gene identifier), the protein length and Blink results for each protein [BLast link—pre-computed BLAST results for proteins—blue diamond; (24)]. (B6) Alignment schematic. Aligned regions are shown as shaded gray bars with domain information drawn as color-coded bars below each protein (the color is randomly chosen). Sequences that are absolutely identical to each other are framed with a box.

**Figure 2.**
Cluster tools. (A) Detailed multiple alignment view for cluster PRK05506 (bifunctional sulfate adenylyltransferase subunit 1/adenylylsulfate kinase protein—Figure 1B). The detailed alignment view provides the capability to display the alignment that is color-coded by conserved amino acid property, which highlights residues at 80% or greater in the following redundant groups: aromatic (FHWY); aliphatic (ILVA); hydrophobic (ACFILMVWY); alcohol-containing (STC); charged (DEHKR); positive (HKR); negative (DE); polar (CDEHKNQRST); tiny (AGS); small (ACDGNPSTV); or bulky (EFIKLMQRWY); or by consensus mode as shown in the next panel. (B) The top panel includes information and controls for the alignment as well as a download button (FASTA + gap). Domains and features aligned against each protein (drawn as colored bars under the protein sequence) are from CDD. In this example, two domains are displayed in the alignment drawn as colored boxes below the sequence for the two highlighted proteins from *Frankia*: cd04095, domain II of ATP sulfurylase, brown on the left and cd0207—adenosine 5′-phosphosulfate kinase, blue on the right, with a ligand-binding site in the feature row above the protein sequences. (C) Phylogenetic tree for PRK12351 (methylcitrate synthase). At the top is the toolbar with information and controls for distance method, tree construction method and the collapse level (by taxonomic rank). Below is the tree which in this image has been rerooted, showing archaeal proteins highlighted in red (in this case from checkboxes from the protein table for this cluster) and expanded to show every leaf. Transformations of the tree can be done by clicking on the tree itself (reroot, squeeze, collapse and expand). (D) Cluster pattern view for PRK05325 (hypothetical protein). The pattern tool allows for exploration of conserved gene neighborhoods. Whereas, the protein table and ProtMap shows the complete genomic region around each gene encoding a protein in a cluster, the pattern tool collects conserved patterns that occur in three or more genomes, in a maximum window of 40 genes upstream or downstream. The most conserved pattern is shown at the top (and on the overview page—Figure 1A8) and the number of conserved proteins which is the number of sequences contributing to the same pattern (which may be from the same nucleotide sequence if present as paralogs in the same cluster), number of clusters in the conserved pattern and common taxonomic node are shown in the table to the left of the patterns. The pattern itself shows all clusters in each pattern and is pseudo-aligned, with the same cluster in each row aligned. Clusters are color-coded according to COG functional categories and the accession is linked to the cluster, the cluster pattern or the ProtMap for that particular cluster. Gray boxes indicate an insertion into the pseudo-alignment for alignment purposes and does not reflect a cluster (gene/protein) at that position in the genome. The size of each box is not proportional to the size of the gene as the size of the arrows is in ProtMap. The gene neighborhood around the genes encoding the hypothetical proteins for PRK05325 (no function yet determined) show conservation of association with putative serine protein kinases (the yellow category apparently involved in signal transduction—a set of uncurated clusters encoded by genes 5′ of the genes encoding proteins in PRK05325). (E) Limited ProtMap view for PRK08568 (preprotein translocase subunit SecY). The ProtMap view shows the full gene neighborhood in a limited horizontal window, unlike the cluster pattern tool which shows a more condensed and taxonomically conserved view of the same information but with a potentially wider window. Note that the genes are drawn to scale in this view. In this example, the *Methonococcus* spp. RefSeq Nucleotide Accession Numbers are highlighted in yellow on the left to show that the *secY* gene (cluster PRK08568) is found upstream of a glycosyl transferase encoding gene (CLS1191473—color-coded yellow for cell wall biogenesis); whereas, in most other organisms *secY* is upstream of adenylate kinase (PRK04040—colored blue for nucleotide transport and metabolism). Note that PRK04040 contains a large set of contributing sequences that are not shown in the image for brevity. The pattern tool can be used to control the display of the ProtMap, directing the display to only show the ProtMap for a particular pattern.

See this image and copyright information in PMC

Cited by

Mixed waste contamination selects for a mobile genetic element population enriched in multiple heavy metal resistance genes.
Goff JL, Lui LM, Nielsen TN, Poole FL, Smith HJ, Walker KF, Hazen TC, Fields MW, Arkin AP, Adams MWW. Goff JL, et al. ISME Commun. 2024 May 9;4(1):ycae064. doi: 10.1093/ismeco/ycae064. eCollection 2024 Jan. ISME Commun. 2024. PMID: 38800128 Free PMC article.
Streptococcus taonis sp. nov., a novel bacterial species isolated from a blood culture of a patient.
Lee CY, Chan CK, Chida M, Miyashita M, Lee YS, Wu HC, Chang YC, Lin WT, Chen YS. Lee CY, et al. Arch Microbiol. 2024 Mar 15;206(4):168. doi: 10.1007/s00203-024-03884-x. Arch Microbiol. 2024. PMID: 38489085
GDPF: a data resource for the distribution of prokaryotic protein families across the global biosphere.
Pan Z, Li DD, Li P, Geng Y, Jiang Y, Liu Y, Li YZ, Zhang Z. Pan Z, et al. Nucleic Acids Res. 2024 Jan 5;52(D1):D724-D731. doi: 10.1093/nar/gkad869. Nucleic Acids Res. 2024. PMID: 37823598 Free PMC article.
Draft Genome Sequences of 17 Campylobacter coli Strains Isolated from Animal and Food Sources in Brazil.
Gomes CN, Duque SDS, Balkey M, Allard MW, Falcão JP. Gomes CN, et al. Microbiol Resour Announc. 2023 Jul 18;12(7):e0031223. doi: 10.1128/mra.00312-23. Epub 2023 Jun 12. Microbiol Resour Announc. 2023. PMID: 37306576 Free PMC article.
The conserved domain database in 2023.
Wang J, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu S, Marchler GH, Song JS, Thanki N, Yamashita RA, Yang M, Zhang D, Zheng C, Lanczycki CJ, Marchler-Bauer A. Wang J, et al. Nucleic Acids Res. 2023 Jan 6;51(D1):D384-D388. doi: 10.1093/nar/gkac1096. Nucleic Acids Res. 2023. PMID: 36477806 Free PMC article.

See all "Cited by" articles

References

1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. - PubMed
1. Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
1. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. - PMC - PubMed
1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2007;35:D5–D12. - PMC - PubMed
1. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. - PubMed

[2] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. - PubMed

[3] Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed

[4] Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed

[5] Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. - PMC - PubMed

[6] Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41. - PMC - PubMed

[7] Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2007;35:D5–D12. - PMC - PubMed

[8] Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2007;35:D5–D12. - PMC - PubMed

[9] Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. - PMC - PubMed

[10] Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The National Center for Biotechnology Information's Protein Clusters Database

Affiliation

The National Center for Biotechnology Information's Protein Clusters Database

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous