Learn more: PMC Disclaimer | PMC Copyright Notice
ChemProt: a disease chemical biology database
Abstract
Systems pharmacology is an emergent area that studies drug action across multiple scales of complexity, from molecular and cellular to tissue and organism levels. There is a critical need to develop network-based approaches to integrate the growing body of chemical biology knowledge with network biology. Here, we report ChemProt, a disease chemical biology database, which is based on a compilation of multiple chemical–protein annotation resources, as well as disease-associated protein–protein interactions (PPIs). We assembled more than 700 000 unique chemicals with biological annotation for 30 578 proteins. We gathered over 2-million chemical–protein interactions, which were integrated in a quality scored human PPI network of 428 429 interactions. The PPI network layer allows for studying disease and tissue specificity through each protein complex. ChemProt can assist in the in silico evaluation of environmental chemicals, natural products and approved drugs, as well as the selection of new compounds based on their activity profile against most known biological targets, including those related to adverse drug events. Results from the disease chemical biology database associate citalopram, an antidepressant, with osteogenesis imperfect and leukemia and bisphenol A, an endocrine disruptor, with certain types of cancer, respectively. The server can be accessed at http://www.cbs.dtu.dk/services/ChemProt/.
INTRODUCTION
The old drug design paradigm, i.e. drugs interact selectively with one or two targets (proteins), resulting in treatment and prevention of disease, is now challenged by several studies that show most drugs interacting with multiple targets (‘polypharmacology’) (1,2). For example, celecoxib, often considered a selective cyclooxygenase-2 non-steroidal anti-inflammatory drug (NSAID), has been documented to be active on at least two additional targets, namely carbonic anhydrase II and 5-lipoxygenase (3). Rosiglitazone, which has been used for the treatment of type II diabetes mellitus, not only stimulates the peroxisome proliferator activated receptor γ, but also blocks interferon gamma-induced chemokine expression in Graves disease or ophthalmopathy (4). Polypharmacology is not always beneficial, as it often causes side effects: Cisapride, which acts as a serotonergic 5-HT4 receptor agonist, as well as astemizole, which blocks histamine H1 receptors (H1Rs), have both been withdrawn from all markets due to the risk of fatal cardiac arrhythmia associated with their blockade of the hERG potassium ion channel, an unanticipated and undesirable ‘anti-target’ associated to QT prolongation and ‘torsades de pointes’ (5). However, ‘target’ and ‘anti-targets’ are dynamic attributes, as exemplified by the case of H1R antagonists and their (in)ability to achieve clinically significant levels in the brain, influenced by the ATP-binding cassette transporter ABCB1 (also known as P-glycoprotein), which effluxes some of these drugs from the brain (6). Acquiring knowledge of the complete pharmacology profile has inspired new strategies to predict and to characterize drug-target associations in order to improve the success rates of current drug discovery paradigms, i.e. increase the efficacy and reduce toxicity and adverse effects (2).
As large-scale chemical bioactivity databases are being assembled, the polypharmacology (i.e. high affinity bioactivity across related targets) and promiscuity (i.e. low affinity across multiple families) of chemicals are expanding the chemical space for druggable targets (7). These studies are often focused on specific protein families, such as G-protein coupled receptors (8), nuclear receptors (9) and kinases (10), but global pharmacology profiles of chemicals are considered as well (1,2). Recent chemoinformatics advances support the development of polypharmacology data mining, e.g. via iPHACE, an integrative web-based tool that enables pharmacological space navigation for small molecule drugs (11) or based on a Similarity Ensemble Approach (SEA) to relate protein pharmacology by ligand chemistry (12). Biological information can also be retrieved for a large set of chemical compounds through PubChem (13), CheBI and ChEMBL (14).
Two conceptual developments support polypharmacology: systems pharmacology, aimed at drug actions in the context of regulatory networks (15); and systems chemical biology (16), which introduces chemical awareness in systems biology. Since proteins rarely operate in isolation inside and outside cells, but rather function in highly interconnected cellular pathways, interactome networks have been developed by data integration. Yildirim et al. (17) combined FDA-approved drugs with a human protein–protein interaction (PPI) network (human interactome) in order to analyze the interrelationships between drug targets and disease–gene products i.e. disease–proteins. Similar work has been based on PubChem bioassays as source of polypharmacology (18). The use of side-effect similarity has been proposed on the assumption that drugs with similar side-effects are likely to interact with similar target proteins (19). Recent advances include a protein–protein association network based on the chemical toxicology of environmental chemicals (20) and a human disease network linking disorders and disease genes to various known phenotypes (21).
Our goal in the present work was to develop a disease chemical biology server, called ChemProt, based on the integration of chemical–protein annotation resources that are now accessible from large repositories, and curated disease-linked PPI data (22). ChemProt is designed to assist the elucidation of drug actions in the context of cellular and disease networks. Further to that, it allows the identification of additional genes that may play major roles in modulating chemical response i.e. to drugs, environmental chemicals and natural products, thus leading to new options in drug discovery and environmental chemical evaluation. Lastly, the ChemProt server could contribute to drug repurposing as well as to the investigation of chemicals related to anti-targets and adverse drug events.
IMPLEMENTATION
Data sources
We first gathered chemical–protein interaction data from different open source databases i.e. ChEMBL (version chembl_05) (14), BindingDB (23), PDSP Ki Database (24), DrugBank (version2.5) (25), PharmGKB (26) and two commercial databases, WOMBAT (version 2009) and WOMBAT-PK (version 2008) (7). Active compounds from the PubChem bioassay (2010) have been collected as well (13). We considered only active compounds from ‘confirmatory’ assays in order to capture high-confidence chemical–protein annotations from PubChem. These databases provide experimental evidence of chemical–protein interactions. Drug-target information was collected from DrugBank and PharmGKB. In addition, we integrated chemical–protein associations from CTD (version 2009) (27) and STITCH (version STITCH 2.0) (28). These last two databases consider the effect or modulation (positive or negative) of a chemical on proteins, other than that defined as binding activity. Examples include gene expression or pathway data, where the deregulation of a gene by a chemical may be not due to a physical interaction between the two entities but a response at a cellular level. Duplicate chemicals from the multiple databases were found by using InChI keys and were merged into a single ChemProt ID. However, the biological information associated to each chemical was conserved for users looking on selective databases. Overall, the final database contains 700 000 distinct molecules annotated for 30 578 proteins.
Descriptors and similarity measurement
The chemical structure of the molecules was encoded using two rather different types of fingerprints. The 166 MACCS keys, encode the presence or absence of predefined substructural or functional groups (29). On the other hand, a more complex 3-point pharmacophore fingerprint (GpiDAPH3) is based on an expansion of the PATTY pharmacophore feature recognition scheme of a 2D structure (30). This scheme assigns one or more pharmacophore feature types to all atoms in a molecule using a predefined list of SMART queries. The list of pharmacophore feature types comprises: hydrogen-bond donor (D), hydrogen-bond acceptor (A), polar (P) and hydrophobic (H). In addition, an extra label (p or pi) is added to each feature if the originating atom or group is sp2-hybridized or planar for other reasons. The GpiDAPH3 pharmacophore feature scheme is expressed in 2D as triplet feature combinations with a graph based inter-atom distance binning scheme. Both fingerprints are implemented in the Molecular Operating Environment (MOE, version 2008.10) (31). The similarity between two molecules is measured using the Tanimoto coefficient (Tc), a method of choice for the computation of fingerprint-based similarity (32). The Tc is defined as the number of bits in common divided by the total number of used bits in both molecules. For any pair of chemicals, Tc assumes values between 0 and 1. A high Tc represents high similarity.
PPI network
The human interactome used is an in-house protein–protein interaction network inferred from experiments in both humans and model organisms (22). Using an elaborate scoring scheme, all interactions have been validated against a gold standard (33). The current interactome contains 428 429 unique protein–proteins interactions derived from source databases such as BIND (34), GRID (35), MINT (36), dip_full (37), HPRD (38), intact (39), mppi (40), MPact (41), Reactome (42) and KEGG (43). Data are transferred between organisms by using the Inparanoid orthology database (44). In total the human interactome comprises 22 997 genes.
Human disease genes and complexes
Based on a previous study (45), disease-associated protein complexes were associated to the chemical–protein annotation by mining OMIM (46) and GeneCards (47), two data resources for genes association to diseases, we collected a list of 2227 unique disease-related proteins and mapped the complexes of genes to disease. Similarly, complexes of genes were mapped to Gene Ontology (GO) terms (48) and tissues by using the expression data from 73 non-disease tissues from the Novartis Research Foundation Gene Expression Database (GNF) (49) and Human Protein Atlas (50). Users of ChemProt can thus retrieve gene complexes that are related to a query chemical and visualize the annotations of each complex.
APPLICATIONS
Chemical–protein interactions
Chemicals can be searched using a common name, SMILES and by drawing the 2D structure, or retrieved through their annotation to a protein. Users can then choose the descriptor space and the Tc threshold to be used for similarity search. Following a successful query, hits grouped by species will be returned, together with computed physico-chemical properties such as Molecular Weight, LogP, the number of hydrogen bond donors and acceptors, the number of rigid bonds and the number of rings, based on the Marvin applet from Chemaxon (51). Hits are provided separately for known annotations, and for prediction of small molecule bioactivity, respectively. The biochemical and pharmacological effects of a chemical, e.g. substrate, inhibitor, agonist or antagonist, are provided if such information is available, together with hyperlinks to UniProt and Ensembl, which lead to more information on protein sequence and function, respectively.
From chemical–protein interactions to complex protein–disease associations
The unique feature of ChemProt is that it offers the user the possibility to get information at a cellular level, by linking chemically-induced biological perturbations to specific tissues and phenotypes.
Proteins that are both affected by a chemical and participate in one or more protein complexes are highlighted in the results table of the ChemProt server. By clicking on the protein, the user is redirected to the ‘Disease complexes’ server and has to choose which complex to visualize. On the ‘Disease complexes’ server, size and illustrations of the protein network are provided. Additionally, enrichment analysis results of the proteins in the complex are shown, with respect to disease association (OMIM, BioAlma), GO terms (biological process, cellular component) and tissue specificity (Human Protein Atlas, GNF). To ensure that the complexes were biologically relevant entities, the enrichment of the biological terms (OMIM, GO,…) was compared to randomly generated complexes (1.0e6). The significances were calculated using a hyper-geometric test and the P-value for the most significant enriched term for each of the data types was calculated as previously described (45). The table presenting the OMIM enrichment results is interactively linked with an illustration of the protein complex where proteins associated with the selected disease are colored yellow.
Output of the chemical–proteins interactions and disease complexes can be downloaded from the ChemProt website. In addition, the ‘Reflect’ service provides further information on chemicals and genes (52). ‘Reflect’ tags gene, protein and small molecule names in text and offers the opportunity to quickly view additional information on the ChemProt results, including synonyms, protein sequences, domains, 3D structures and subcellular location.
EXAMPLES
With the integration of several databases, ChemProt not only provides pharmacological information, but also includes biological data associated to environmental chemicals and natural products. As seen in the examples below, ChemProt can be queried for drugs as well as environmental chemicals. A search for citalopram, an antidepressant, illustrates the complementarity of the integrated databases within ChemProt (Figure 1). Marketed as a selective serotonin reuptake inhibitor (SSRI) (DrugBank), this drug displays bioactivity on seven human proteins (ChEMBL). Via ChemProt, four other proteins (DRD3, 5HT1B, 5HT3, ADRA2A) are retrieved from the Ki database. Additional information on drug-target associations is provided by STITCH and CTD. From the first annotation to the D4 dopamine receptor (DRD4), the disease term (under Disease Complexes) is highlighted, indicating that protein–protein interaction information for this protein is available. Using the link to the Disease Complexes server, one finds that DRD4 interacts with three proteins (SRC, GRB2 and NCK1). According to OMIM, this protein network is associated to osteogenesis imperfecta and leukemia and, according to BioAlma, to several psychotic disorders. GO enrichment indicates significant association of the protein complex to signal complex formation and vesicle membrane. Furthermore, tissue annotation suggests that this complex is mainly expressed in follicle and non-follicle cells (HPA) and dentritic cells (GNF). Although it might be surprising to see a connection between antidepressant and leukemia, it has been shown recently that antidepressants such as chlomipramine and fluoxetine reduce the growth of B-cell malignancies in leukemia (53).
Chemical–protein annotation and disease associations retrieved from ChemProt for the compound citalopram. (1) The compound can be queried using different formats (name, SMILES and structure). (2) A query results in a table showing protein annotations and bioactivity predictions for the compound. (3) Finally, a protein–protein interaction network (protein–complex) for a target protein can be depicted and disease associations (OMIM and BioAlma) and other biological components (GO terms, HPA and mRNA expression) are displayed.
The second query, ‘bisphenol A’ (BPA), is an environmental pollutant used as plasticizer (54). BPA has biological activity on the estrogen receptor α (ESR1), the androgen receptor (AR) and the estrogen related receptor gamma (ERR3). However, several other proteins are retrieved from CTD and STITCH based on association data with this chemical. Looking at ESR1 in the Disease Complexes server, a complex of 17 proteins is depicted (complex 265) with significant associations to Li-FRAUMENI syndrome, breast cancer and neoplasms. Enrichment analysis indicates that the complex is found in the nucleus (GO cellular component), involved in the regulation of metabolic processes and transcriptionally regulated by the RNA polymerase II promoter (GO biological process). Furthermore, data from immunohistochemistry studies suggest that the complex is mainly located in the endometrium and the cerebral cortex (HPA). The disease chemical biology network for BPA indicates that, under certain conditions, this chemical may be associated with certain types of cancers.
We have illustrated that ChemProt integrates molecular, cellular and phenotypic data associated to small molecules, which can lead to novel links and suggest new avenues for research. We envisage that the ChemProt server will find applications within a variety of chemogenomics, polypharmacology and systems chemical biology studies. ChemProt will be updated once a year with new compounds, new interactions and more sophisticated descriptors.
FUNDING
EU (DEER); Innovative Medicines Initiative Joint Undertaking (eTOX); Danish Research Council for Technology and Production Sciences; Lundbeck foundation and the Villum Rasmussen Foundation. Funding for open access charge: DEER.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
Sunset Molecular Discovery LLC (www.sunsetmolecular.com) contributed with the WOMBAT databases.