Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Jun;14(6):1085-94.
doi: 10.1101/gr.1910904.

Coexpression analysis of human genes across many microarray data sets

Affiliations

Coexpression analysis of human genes across many microarray data sets

Homin K Lee et al. Genome Res. 2004 Jun.

Abstract

We present a large-scale analysis of mRNA coexpression based on 60 large human data sets containing a total of 3924 microarrays. We sought pairs of genes that were reliably coexpressed (based on the correlation of their expression profiles) in multiple data sets, establishing a high-confidence network of 8805 genes connected by 220,649 "coexpression links" that are observed in at least three data sets. Confirmed positive correlations between genes were much more common than confirmed negative correlations. We show that confirmation of coexpression in multiple data sets is correlated with functional relatedness, and show how cluster analysis of the network can reveal functionally coherent groups of genes. Our findings demonstrate how the large body of accumulated microarray data can be exploited to increase the reliability of inferences about gene function.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of the methodology. Only two data sets are shown here; our analysis made use of 60 data sets. The schematic outlines the analysis of a hypothetical “Gene X” in two data sets. First (top) in data set 1 we seek genes with expression profiles that are similar to that of Gene X, generating a set of raw “coexpression links.” Only links that are deemed statistically significant in the context of data set 1 are stored. Then, we repeat this analysis in data set 2 (bottom). We then seek coexpression links that are common between the two data sets. This procedure is then repeated for each gene, and in more data sets. It is important to note that the profiles themselves need not be similar between data sets, nor do the profiles need to be “relevant” to any sample groups in the data sets. The data sets can also be from different microarray platforms, tissues, or species (though we present only human comparisons here). See Methods for details.
Figure 2
Figure 2
General properties of coexpression confirmation in the database. (A) Distribution of links at different levels of confirmation. The vertical dashed line marks the total number of data sets analyzed (60). Most links are not confirmed, but some links are confirmed in up to 31 data sets. (B) The number of “raw links” (those that are confirmed or not) plotted against the number of links that are confirmed in at least three data sets. Each point represents one gene. Genes with many raw links tend to have more confirmed links. (C) Degree distribution of links confirmed in at least three data sets.
Figure 3
Figure 3
Relationship between link confirmation on semantic similarity of the selected genes. The x-axis indicates GO term overlap (see Methods). The cumulative distributions of semantic similarity scores for sets of links selected by different criteria are plotted. The dashed line indicates the distribution for randomly selected pairs of genes. Each solid curve is the cumulative probability distribution measured for pairs of genes identified by coexpression links at varying levels of confirmation (including both positive and negative correlations). The black curve is the distribution for coexpression pairs that are not confirmed. Confirmed links tend to have higher levels of GO term overlap. The x-axis is truncated at 30 (there are only 694 2+ pairs with more than 30 terms in common; the maximum is 95, for one pair).
Figure 4
Figure 4
Hierarchical clustering of the coexpression network at a high level of confirmation. The left-hand side of the figure is the (diagonally symmetric) interaction matrix for 506 genes. Each color-coded entry is an interaction that is seen in seven or more data sets. The colored boxes indicate the main clusters, which are labeled according to their functional theme. A light blue box indicates a large diffuse cluster that dominates the upper half of the figure. A second box (orange) indicates several immune system-related clusters that are placed near each other. Blue lines connect many of the smaller clusters to the right-hand side of the figure, which depicts GO annotations for the same genes. On the right-hand side, each column represents a different GO term. The columns (495 GO terms) were arranged by hierarchical clustering, placing terms with similar annotation patterns together. The entries of the matrix are colored according to the status of the cluster-GO term association for the gene and term (see Methods). Green indicates term-cluster associations that were significant. Dark gray indicates the best GO term-gene cluster associations but that did not meet all criteria. Light gray points indicate GO terms-gene combinations that were not associated with a high-scoring cluster. These groups were used to define the cluster labels in the left half of the figure.
Figure 5
Figure 5
Examples of clusters extracted from the 3+ network with MCODE. See text for details. Increasing thickness of lines denotes increasing numbers of data sets in which the link was observed.

Similar articles

Cited by

References

    1. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503–511. - PubMed
    1. Allander, S.V., Nupponen, N.N., Ringner, M., Hostetter, G., Maher, G.W., Goldberger, N., Chen, Y., Carpten, J., Elkahloun, A.G., and Meltzer, P.S. 2001. Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. Cancer Res. 61: 8624–8628. - PubMed
    1. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., and Korsmeyer, S.J. 2002. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30: 41–47. - PubMed
    1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25–29. - PMC - PubMed
    1. Bader, G.D. and Hogue, C.W. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4: 2. - PMC - PubMed

WEB SITE REFERENCES

    1. http://microarray.cpmc.columbia.edu/tmm; Database and additional resources for analysis of coexpression across data sets.
    1. http://genetics.stanford.edu/∼sherlock/cluster.html; Clustering software.

Publication types

MeSH terms

-