-
PDF
- Split View
-
Views
-
Cite
Cite
Yanbin Yin, Xizeng Mao, Jincai Yang, Xin Chen, Fenglou Mao, Ying Xu, dbCAN: a web resource for automated carbohydrate-active enzyme annotation, Nucleic Acids Research, Volume 40, Issue W1, 1 July 2012, Pages W445–W451, https://doi.org/10.1093/nar/gks479
- Share Icon Share
Abstract
Carbohydrate-active enzymes (CAZymes) are very important to the biotech industry, particularly the emerging biofuel industry because CAZymes are responsible for the synthesis, degradation and modification of all the carbohydrates on Earth. We have developed a web resource, dbCAN (http://csbl.bmb.uga.edu/dbCAN/annotate.php), to provide a capability for automated CAZyme signature domain-based annotation for any given protein data set (e.g. proteins from a newly sequenced genome) submitted to our server. To accomplish this, we have explicitly defined a signature domain for every CAZyme family, derived based on the CDD (conserved domain database) search and literature curation. We have also constructed a hidden Markov model to represent the signature domain of each CAZyme family. These CAZyme family-specific HMMs are our key contribution and the foundation for the automated CAZyme annotation.
INTRODUCTION
Carbohydrate-active enzymes (CAZyme), responsible for the synthesis, degradation and modification of all the carbohydrates on Earth, are an important class of proteins, particularly for the biotech industry, such as the biofuel industry. The CAZy database (short as CAZyDB hereafter) represents the currently most comprehensive database (http://www.cazy.org) for CAZyme proteins, which consists of 308 CAZyme families as of April 2011 (excluding nine deprecated ones and five unclassified families, e.g. GT0), grouped into five functional classes: glycoside hydrolases (GHs), glycosyltransferases (GTs), polysaccharide lyases (PLs), carbohydrate esterases (CEs) and the non-catalytic carbohydrate-binding modules (CBMs). CAZyDB is updated every few weeks, mainly to add new families to keep up with the most recent literature. The popularity of the database along with its classification scheme is obvious based on its high citation number (1).
While popular, we see three issues with CAZyDB based on our own experience in using it. First, CAZyDB maintains a list of proteins from GenBank and UniProt belonging to each CAZyme family but does not provide an easy way to query, search or download the sequence, structure and annotation data. Second, the database does not explicitly define the ‘signature domain’ for any of the CAZyme families; so from a user’s perspective, it is unknown what the defining (signature) domain is for each family and where the domain is located in a full-length protein. Last and most importantly, CAZyDB does not provide a way for an automated annotation of the CAZyme members in a given genome, which becomes increasingly needed with more and more genomes and metagenomes being sequenced at an increasing rate.
A common practice now when trying to annotate a genome is to BLAST the genome against the annotated full-length CAZyme proteins in CAZyDB (2–4). Often this does not work well for annotating CAZymes, many of which are multiple-domain proteins, e.g. searching for short CBM regions in GHs. Another approach is to use Pfam models that are associated with CAZyme families for domain-based annotation (4–7). The CAZyme Annotation Toolbox (CAT) (6) falls into this category, which was recently developed to address the automated annotation issue. It combines a BLAST search and a Pfam domain-based search; to extend the Pfam search result, an association rule learning algorithm was used to find the correspondence between Pfam domains and CAZyme families. The main problems with the CAT program include: (i) it did not define a signature domain for each CAZyme, the key information needed for accurate and reliable annotation of CAZyme proteins in an automated fashion and (ii) its Pfam domain-based search covers only 46% (142/308) of the CAZyme families.
For a comprehensive and accurate annotation of the CAZyme families, users often have to contact the developers of CAZyDB for their semi-automatic annotations (1,8–10). This is clearly becoming a bottleneck and is not consistent with the way the other popular protein domain/family databases such Pfam (11), InterPro (12) and CDD (13) handle the annotation needs, which all provide data and automated services through their websites. Clearly, there is an urgent need for an accurate and reliable tool for automated and comprehensive annotation of CAZyme proteins.
To fully address the issues outlined above, we developed a web resource, dbCAN (http://csbl.bmb.uga.edu/dbCAN/), based on the classification scheme of CAZyDB. We aimed to provide a solution for automated CAZyme annotation for any given genome, as well as an easy and convenient access to sequences, domain models, alignments and phylogeny data of CAZyme-related enzyme families and functional modules, hence addressing all the three issues discussed above. The basis for dbCAN’s automated and comprehensive annotation is the clearly defined signature domain models of all the 308 CAZyme families, which are not provided by any existing tools, including CAZyDB and CAT. In addition to the current five CAZyme classes, we also included in dbCAN three additional domain modules: dockerin, cohesin and SLH (S-layer homology domain), which are critical for forming cellulosomes, a multi-protein complex that can efficiently degradate carbohydrate-rich biomasses (14).
IDENTIFICATION AND DEFINITION OF SIGNATURE DOMAINS
In order to define a signature domain for each CAZyme family, we have identified an annotated functional domain by referring to the CDD (Conserved Domain Database) (13) search result and the published literature (Figure 1) of the member GenBank proteins in that family. Specifically, we analyzed the CDD search results (by RPS-BLAST) of the member proteins to select a CDD model that matches most of these proteins with significant sequence similarities. The underlying assumption is that proteins of the same CAZyme family must share a common region, which might be represented by some annotated functional domain in the public protein domain databases. Moreover, we manually reviewed the functional description of the top CDD models to ensure that the selected model indeed represents the similar functional activities of the CAZyme family. For instance, family CE2 was assigned the CDD domain cd01831 (Endoglucanase_E_like) as this domain covers all GenBank proteins in this family with very significant E-values. It is worth noting that there are redundant CDD models as cd01831, e.g. pfam00657 (Lipase_GDSL), cd00229 (SGNH_hydrolase) and COG2755 (TesA, Lysophospholipase L1 and related esterases). Although these models describe different biochemical activities, they all match significantly overlapped regions in the member proteins of CAZyme family CE2.
![Flowchart of our procedure for identifying and defining signature domain models for an example CAZyme family. Here this family contains four full-length proteins with different lengths. The red box is the signature domain regions defining the CAZyme family. It could be either identified by searching against annotated functional domain models in the CDD database or retrieved from literature curation. Boxes in other colors are non-overlapped domain regions annotated by other CDD models. The CDD search is done by RPS-BLAST, the multiple sequence alignment is done by MAFFT (default parameters), the building of HMM is done by hmmbuild and all other processes are done by self-developed perl scripts.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479f1.jpeg?Expires=1724225487&Signature=nUqGPMhRoZYaNONYYvV6WB5txDmefXu3VD9K49NYbsupepEppqWLztBHxMEvoz15mH3yZ~E68L5zmXizEWd7ONOGFV0GMiXpGMFJ1Zz7eSoNxoxR3efCDNd6YgUv2usxehSQUlR-EGSW2VTUy4Yq~8-yAbbgNX9sqcSACWtLTcPUQ-RNkXbGIh2FaTk-Pl-jwQBNInYQbf2vkFB8K-7~8fGE4lAyE~RJhmoosIbfjsgZKpvSbRKnifgLJMsUQy3pSQIdFxxt7SyRt~RYw-iFkOK4yfKWzfRsEqJ55cXt87fEOfeOWo32jMok-2NSK~BV2KvSw6F7K-CDpD4ofX34sA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Flowchart of our procedure for identifying and defining signature domain models for an example CAZyme family. Here this family contains four full-length proteins with different lengths. The red box is the signature domain regions defining the CAZyme family. It could be either identified by searching against annotated functional domain models in the CDD database or retrieved from literature curation. Boxes in other colors are non-overlapped domain regions annotated by other CDD models. The CDD search is done by RPS-BLAST, the multiple sequence alignment is done by MAFFT (default parameters), the building of HMM is done by hmmbuild and all other processes are done by self-developed perl scripts.
We were able to find a CDD model (defined as a position-specific scoring matrix) for 248 CAZyme families out of the total of 308 (Supplementary Data S1). Since CDD is a general protein domain database containing over 40 000 models defined based on the alignment of some seed proteins, the selected CDD models are not exactly CAZyme family specific. In addition, analyses of these CDD models indicate multiple CAZyme families may share the same CDD model. To build CAZyme family-specific models, we first identified the domain regions in the component GenBank proteins of each CAZyme family based on its selected CDD model using hmmsearch (a command in the HMMER 3.0 package, hmmer.org) and then generated a hidden Markov model (HMM, by hmmbuild, hmmer.org) based on the multiple sequence alignment [by MAFFT v6.603b (15)] of the identified CDD domain regions, which gives rise to a unique HMM for each of the 248 families.
The other 60 CAZyme families did not have a CDD model since no model covers the majority (80%) of the component GenBank proteins for each of these families (Supplementary Data S1). For 20 of them (including 15 CBM families, Supplementary Data S2), we were able to identify an initial signature domain for some characterized GenBank proteins in each family through manual curation of the published literature; we then populated the domain regions by retrieving them (by BLASTP) from all component proteins of the family and finally we were able to build an HMM specifically for the family using the aforementioned procedure (i.e. MAFFT + HMMER). For the remaining 40 families (mostly small and non-CBM families), CDD and literature search did not provide any signature domain information. For each such family, we generated a multiple sequence alignment (by MAFFT) among all component full-length GenBank proteins and then manually edited the alignment by removing long gaps and ambiguously aligned regions. Based on these carefully edited alignments, we then built an HMM (hmmbuild) for each of these families to represent its signature domain.
Overall we were able to generate a unique and family-specific signature HMM for each of the 308 CAZyme families. Using these HMMs to search against the CAZyme component (GenBank) proteins, we were able to correctly identify at least 95% of the component proteins from each of the 308 CAZyme families (Supplementary Data S1).
EVALUATION OF ANNOTATION ACCURACY
With the signature domain HMMs available, we are now able to perform hmmscan of any given protein data set against the 308 dbCAN HMMs for an automated CAZyme annotation.
![formula](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479um1.gif?Expires=1724225487&Signature=VdPEr7ftSrsUalc8tsDcK1o-kN9qmw5uYrnNvZTrw1vCjDyI4Rn4zcBFWdRb3o7QekTfReILkQYMbkVSOAqhAVGgL8j1UYK--WKVg5IrBh1Cj~hRJPqT1aGjtbGdoBMO1nT6Kl-xx3uGc0Tqsh2oipnHsw9oz4IcrS3pBJlQDaZE6s5kFx4ImDSsK4hHeJWCkUSOgyMpszBQVgziVK-BSjzNjTzjd5aEbFmTNR6Dn~lWrb0OlZpkIkmE-BiT-JrutBEYghxengc-iIQ13qj4QTUcOBECTgKkhbfTX~Z~Ns1sUhqDh4-Z4wo~ospOOswbG~sWoFNh3GBPSBxs3NSnWQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![formula](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479um2.gif?Expires=1724225487&Signature=H4MZoPsXMkJYu6Ynw5ahcYffpxCAeL1c7kfmd8klzVebPcyA~e4Q11GGmE1lx1DUxwm6IEa7JrzWuY1SpcY2gAY1r8Fq62ltB72tJxvmDFCKLD5SzkG0svaySo8UeAe1X6I4gILDyNzpiRSOy8n-RqwNAZ4oBKjCqvC61JnUoVgovQkBAXuCLameaJxo~YEBF5Cz1DEB638AsgpjoXF3~vTg~p7-TCd9rAAVWsmV80jtFRNkqMhpKEbwQzIDAlDaeTUyWHIZQZbogL8cQpf-DjFfgWeQZT7j~MmG1fXhsa3kmNfUKgj8x9M3WMeZ~F9DRJK4w0E2SLe06SLCvm3NfA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![formula](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479um3.gif?Expires=1724225487&Signature=j4uVHNa9shwRI6QFE385kqXQlSJAlpk55aGUydQ3xDoAqLvxwr2NutYXYS6i1NDaEnEYOe~S00N1jZpX-cq3df1vOn6F~cF3HYgMt~Af0fbKA5zkTRQjBsu3w5h6mFWPc1EPjbuxO1kR9jjcjCvTtBbWxNPKzinqLTNHfo3omI5T-2mmw4USGlAFCrN-YZ2SL7NN3P~FQBmZp8H56DwIhSFXHmDHKyyl6rUZvHHoDx6yXIlXCQtAAQ2wUScQTqYFqh6g-W6HKoWwNyscR4e8hwf1AvQzvWa3fMIBARdVRphqZ7Eg2MdXCEghfaI7vzT8wpqTY3BPNkQKWZKPgfomiw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![formula](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479um4.gif?Expires=1724225487&Signature=VQ63X4c9e7Vq-qEay3SnMqtGmAosWm5tLdZ0dF5th6teVCHXPleKS-583hmlCFrjLIx7kDq3x3azLej9GKjpfkIovt~qcuuIm66XFJ4ubu1d5qB2mqZRveb2atPVKwJnI-uOjJWIXiVM1Wuni8WWCvfgz61m4HGESBp8lXAZ9gvffZl2aB5TapF6g8feSOj3203jauvdK2ora4e5kQHv2Lso1nSfizA5Y40-XXntUhyDLtEQSnNwzoMBNTQgSr9EzUFVSuVLg1RUL0rgXgAcjo9eWmDYQOfSdClHQC4KvEvxrQGmp9vgzwT3ol7wUMzggAomwPO-yOhh-FOC70ptCQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![formula](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479um5.gif?Expires=1724225487&Signature=vUEVM07oa6kPDH1lkan6p1gchl5-Fm6j~q5zbsZP-8QwUAF4VfAHjMlYobD17KrlaLpQVxdowmkpHXuQpuitllNRG7yTCnZ~pl8hTTJRS7qDNfwYH-U-A2Q~DfkuFuWiuBDfJ4moJ-kx~RLr1Cu-swxeDvpoxdGxvOzVXm63sM~evper3q8cJkSG2DZ80YCEx0TXo82KOmBQAPiHoEJZCQarE1IN2e5lrFXYD97d4c3I57po7lZ8CQtLZgACPb53E3MpWw~6yZ2rA5q7s6UmnsH81sFEv-o3PVvEUYA-l1pRsSQcqsBVV~pZpaovolbIWNKeXTWGKBBY2IFusRDt1g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Basically we regarded all CAZyme proteins of the two genomes annotated by CAZyDB as true positives. Assuming the annotated CAZyme protein list by CAZyDB are accurate and complete, we found that our automated annotation has the best overall performance for C. thermocellum (sensitivity = 99.3% and PPV = 89.4%) using alignment coverage >0.5 as the threshold, while for A. thaliana (sensitivity = 96.3% and PPV = 78.8%) using alignment coverage >0.3 as the threshold (Supplementary Data S9).
We also performed a similar assessment using a set of rebuilt HMMs without including any information from the two genomes C. thermocellum and A. thaliana that we were testing against. Specifically, we removed all the proteins of the two genomes from the list of all 308 CAZyme families and rebuilt the HMMs based on this reduced protein list. The performance of the new HMMs is as follows: sensitivity = 98.6% and PPV = 86.1% for C. thermocellum and sensitivity = 95.6% and PPV = 76.6% for A. thaliana (Supplementary Data S10). While the performance dropped slightly, the results indicate the robustness of dbCAN’s HMMs for CAZyme annotation.
The detailed comparison results in terms of the TP, TN and FP values for the two genomes are summarized in Supplementary Data S3–S8. Obviously, we did well in identifying most CAZymes from the two organisms, but in the meantime included many FP proteins. However, another possibility is that these ‘FP’ proteins may be real CAZyme proteins but missed by the CAZyDB, since we noticed that many of the FP proteins have very significant E-values against the CAZyme family HMMs. For example, Cthe_1186 of C. thermocellum was found to match CE10 family HMM with an E-value = 1.20e − 64 and AT1G29660.1 of A. thaliana matched CE16 HMM with an E-value = 1.6e − 28 (see Supplementary Data S11 for the alignment). The real truth can only be found out through experimental studies on these proteins.
COMPARISON WITH BLAST-BASED AND CDD-BASED SEARCH STRATEGIES
We also compared our HMM-based annotation with the other two often used annotation strategies. The first is using BLASTP to search the proteins of C. thermocellum and A. thaliana against the CAZyDB (after excluding the proteins of the two genomes). Similar to domain-based hmmscan search, we also processed BLAST search results by considering two parameters: E-value and bit-score. Specifically, we used the same E-value cutoffs as above and then tried different bit-score cutoffs to parse the BLAST outputs. Supplementary Data S12 shows that for C. thermocellum using bit-score >425 as cutoff gave the most balanced performance (sensitivity = 92.4%, PPV = 96.4% and average of the two = 94.4%) and that for A. thaliana using bit-score >350 as cutoff gave the most balanced performance (sensitivity = 78.8%, PPV = 66.7% and average = 72.7%). These numbers appear to be similar to those of dbCAN’s performance for C. thermocellum (sensitivity = 99.3%, PPV = 89.4% and average = 94.3%) while they are much worse than those of dbCAN’s for A. thaliana (sensitivity = 96.3%, PPV = 78.8% and average = 87.6%).
More importantly, a key drawback with BLAST-based strategy is that it can only tell if the query protein has a very significant hit in CAZyDB and then transfer the CAZyme family assignment from the hit to the query protein. Supposing the query protein has a GH and a CBM domain in reality, while the hit has only a GH domain, the BLAST annotation will only assign the query to the GH family while miss the CBM family assignment. We can imagine even more complex situations with multiple such domains. In contrast, dbCAN annotation provides much richer information such as which and how many CAZyme domains (including, e.g. repetitive CBM domains) a query protein has and where the boundaries of these domains are in the full-length protein. Therefore, overall dbCAN offers much better and more comprehensive CAZyme annotation than the simple BLAST search.
For the 248 CAZyme families having a selected CDD domain model, we checked if the CDD models can lead to accurate CAZyme annotation. We found that the CDD-based search was able to identify 94.3% (C. thermocellum) and 87.1% (A. thaliana) CAZyme homologs that are identified by our HMMs. However, one major issue with the CDD-based search is that CDD models are not specifically built for CAZyme families. There are cases of multiple CAZyme families mapped to the same CDD model, e.g. GT2, GT12 and GT45 families all pointing to pfam00535; hence one cannot tell which specific CAZyme family a query protein belongs to if it matches the CDD model pfam00535. Furthermore, 60 CAZyme families do not have a CDD model so the CDD-based CAZyme annotation is incomplete.
Overall our CAZyme family-specific HMMs-based method provides a significantly better solution to the automated CAZyme annotation problem than these simpler strategies.
DESCRIPTION OF THE dbCAN ANNOTATION SERVER
dbCAN provides a capability for automated CAZyme annotation for any given genome or set of protein sequences. Like most of the public protein databases such as Pfam and CDD, we make all the HMMs available through our website. Users can download the HMMs and run hmmscan on their interested proteins/genomes against these domain models. We have built a web server (Figure 2A) so that users can upload their protein or genome sequences for CAZyme annotation. A submitted job is processed on a Linux cluster with 100 computing nodes. For small bacterial genomes such as C. thermocellum, it normally takes <10 min to finish the annotation. A result page (Figure 2B) will be returned showing the detailed information of the locations of the identified CAZyme domains and a diagram of the domain architecture, which is very useful for viewing multi-domain proteins.
![Snapshots of dbCAN annotation server. (A) The query page, where users can paste some FASTA format protein sequences in the text box or upload a text file containing the FASTA sequences. Clicking on ‘submit’ will invoke the hmmscan program in the backend server to search the queried sequences against the dbCAN HMMs. (B) The result page, where users can download the raw output from the hmmscan run and view the processed tabular format output (if alignment length >80 amino acids, use E-value < 1e − 5, otherwise use E-value < 1e − 3). A diagram is shown in the bottom to illustrate the CAZyme domain architecture according to the positional information in the tabular output.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/40/W1/10.1093_nar_gks479/2/m_gks479f2.jpeg?Expires=1724225487&Signature=DJUxs62sPBkuuwORgzvcQL6IsJ-YP202iHxl5Hf5-H7uZ87nmPOwuvp9wrqncJJABHW9DHDe06FA5dKiYUbhe9Tc~oJ0U6hDyzRBp5VEBhTKfc2Uj~jS7PYdozXnQdmKDuvSkMnicSCqe~hi3zVqRs1lIv0ZcewkeBm-t75sjsln4WS~kwBkI95MSn~xdBZCwBXkU60jGadIr~CIuoeAxukQY5BM6j4pxJLwHJo900MSUiOkjupl7lEADGt17qb3nBwIaskVbEhTmYRPJqmA6Sl7pTGjLODyT87rgG8rRoR77nMP2LI0LCQO8JNkDH8BiqnbepDliOgQ8bEfkBIS6Q__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Snapshots of dbCAN annotation server. (A) The query page, where users can paste some FASTA format protein sequences in the text box or upload a text file containing the FASTA sequences. Clicking on ‘submit’ will invoke the hmmscan program in the backend server to search the queried sequences against the dbCAN HMMs. (B) The result page, where users can download the raw output from the hmmscan run and view the processed tabular format output (if alignment length >80 amino acids, use E-value < 1e − 5, otherwise use E-value < 1e − 3). A diagram is shown in the bottom to illustrate the CAZyme domain architecture according to the positional information in the tabular output.
In addition, dbCAN provides pre-computed sequence alignments, HMMs and phylogenies of the signature domains in each and every CAZyme family, downloadable from the dbCAN website and it also provides the following capabilities: CAZyme family-based browsing, genome-based browsing, keyword search, BLAST search as well as detailed functional annotation for every sequence included in dbCAN.
APPLICATION TO METAGENOME DATA SETS
Metagenomes, mixture of genomic DNAs from uncultured environmental microorganisms (16), represent a new source of enormously large gene pools containing potentially many new catalytic enzymes that could be of use for biotechnology (17–19). We have applied the 308 HMMs to search against a number of metagenomes such as the JGI metagenomes (20), the CAMERA marine metagenomes (21–23) and two recently published animal gut metagenomes (5,24). Using E-value < 1e − 5 as the cutoff, we obtained over one million (1 038 912) full-length CAZyme homologous proteins containing 1 209 177 CAZyme domain regions, all of which are accessible from the dbCAN website. This is about three times of the number of CAZyme homologs (358 959) in the NCBI-nr database, indicating that there are many new CAZyme related proteins in the environmental metagenomes awaiting further investigation (manuscript in preparation); many of them may represent new catalytic enzymes that could be of good use for the biotech industry (17,25).
DISCUSSION
dbCAN is designed to offer a free, easy-to-use and public service of automated CAZyme annotation to users worldwide. Such a service will be highly useful to researchers who sequenced biotech-related genomes and metagenomes and will be very valuable in helping to find novel catalysts, e.g. (2–4,6–10). A key unique feature of the dbCAN database is its collection of the CAZyme family-specific HMMs, which are built based on the annotated CAZyme proteins by CAZyDB. The following is worth noting about using dbCAN.
dbCAN models are different from the selected CDD models, for we just used CDDs to locate the CAZyme signature domain regions and build our own models based on the domain regions in the annotated CAZyme proteins. Therefore, dbCAN models are CAZyme-specific and each CAZyme family has a unique HMM. In addition, 60 dbCAN models are new and have not been described in CDD.
dbCAN is built upon CAZyDB but not meant to be a substitute of CAZyDB. dbCAN aimed to enable automated CAZyme annotation at a genome scale, while CAZyDB is the original database that created all the CAZyme families since early 1990s and will continue to create new families. The creation of the new families is often done by the coordination between experimentalists and the CAZyDB team. We will add new dbCAN HMMs as soon as CAZyDB adds new CAZyme families, to provide a service complementary to that by CAZyDB.
CAZyDB may have the domain models internally for many if not all CAZyme families, but do not release them to the public. This might be because these models are constantly updated or are considered to be not good for the use of automated annotation. In fact, CAZyDB performs the semi-automatic annotation for newly sequenced genomes. However, this led to the reality that the entire annotation process is invisible to the users, i.e. without providing any guidance to the users about how they can do automated CAZy annotation when they desire so.
dbCAN annotation explicitly offers the positions of each CAZyme domain in each full-length protein, which are missing for all annotated proteins in CAZyDB. However, it should be noted that the exact domain boundaries in each protein annotated by dbCAN might be slightly different from those in CAZyDB.
CONCLUSION
In summary, through dbCAN, we have made two key contributions: (i) we recovered and defined a signature domain model for each and every CAZyme family and (ii) we release all models freely to the community and build a web server to facilitate efficient annotation of CAZyme proteins at a genome scale. With dbCAN models and the web server, users can easily obtain a comprehensive and automated CAZyme annotation, on which they can perform their own manual curation if they choose to do so.
FUNDING
U.S. Department of Energy [DE-PS02-06ER64304]; National Science Foundation [DEB-0830024]; Office of Biological and Environmental Research in the DOE Office of Science [to The BioEnergy Science Center]. Funding for open access charge: The BioEnergy Science Center.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
We acknowledge the Research Computing Center of the University of Georgia for providing computing facility. We thank Wen-Chi Chou for helping with the curation of CAZyme signature domains in the early stage of this project.
REFERENCES
Author notes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Comments
Thank you for pointing this out. We understand and respect that the high quality of CAZyDB annotation stands on its manual curation, which will avoid the problem found in this example from automated annotation. But from a bioinformatics perspective, such a very low E-evalue does mean that it is a very significant match to the CE10 HMM model, which is built based on the annotated CE10 protein sequences in CAZyDB (though it is not updated since 2002). Again, this is just a prediction.
Conflict of Interest:
None declared
In support of their methods, the authors (right column, p.W447) report that many proteins of CAZy family CE10 have apparently been missed by the CAZy database. The reason for this apparent incompleteness is to be found upon simple reading of the header of family CE10 in the CAZy database (http://www.cazy.org/CE10.html) : CAZy has stopped updating this family since August 2002 for lack of evidence that it actually contained bona fide carbohydrate esterases. In consequence the many CE10 homologues that have appeared since August 2002 have not been "missed" by the CAZy database, they have just been left out.
Conflict of Interest:
None declared
We found that reference (9) was improperly cited in the fourth paragraph of the Introduction and we wish to remove it from the following sentence:
"For a comprehensive and accurate annotation of the CAZyme families, users often have to contact the developers of CAZyDB for their semi- automatic annotations (1,8-10)."
Conflict of Interest:
None declared