Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2009 Jan 26:10:34.
doi: 10.1186/1471-2105-10-34.

Sparse canonical methods for biological data integration: application to a cross-platform study

Affiliations
Comparative Study

Sparse canonical methods for biological data integration: application to a cross-platform study

Kim-Anh Lê Cao et al. BMC Bioinformatics. .

Abstract

Background: In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines.

Results: We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results.

Conclusion: sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Rd. Cumulative explained variance (Rd criterion) of each data set in relation to its component score (CCA-EN, CIA) or latent variable (sPLS).
Figure 2
Figure 2
Hierarchical clustering of the two data sets using all expression profiles. Hierarchical clustering of the cell lines with Ward method and correlation distance using the expression profiles from the Ross (left) and Staunton (right) data sets. The tissues of origin of the cell lines are coded as BR = Breast, CNS = Central Nervous System, CO = Colon, LE = Leukaemia, ME = Melanoma, LU = Lung, OV = Ovarian, PR = Prostate, RE = Renal. The Ward method maximizes the between-cluster inertia and minimizes the within-cluster inertia for each step of the clustering algorithm. Height represents the loss of between-cluster inertia for each clustering step. Dashed lines cut the dendrograms to highlight the three main clusters.
Figure 3
Figure 3
Graphical representations of the samples using CCA-EN. Graphical representations of the cell lines by plotting the component scores from CCA-EN from dimension 1 and 2 (a) or 1 and 3 (b). The component scores computed on each data set are displayed in a superimposed manner, where the start of the arrow shows the location of the Ross samples, and the tip the Staunton samples. Short arrows indicate if both data sets strongly agree. The colors indicate the tissues of origin of the cell lines with BR = Breast, CNS = Central Nervous System, CO = Colon, LE = Leukaemia, ME = Melanoma, LU = Lung, OV = Ovarian, PR = Prostate, RE = Renal.
Figure 4
Figure 4
Graphical representations of the samples using sPLS. Graphical representations of the cell lines by plotting the latent variable vectors from sPLS from dimension 1 and 2 (a) or 1 and 3 (b). The latent variable vectors computed on each data set are displayed in a superimposed manner, where the start of the arrow shows the location of the Ross samples, and the tip the Staunton samples. Short arrows indicate if both data sets strongly agree. The colors indicate the tissues of origin of the cell lines with BR = Breast, CNS = Central Nervous System, CO = Colon, LE = Leukaemia, ME = Melanoma, NS = Lung, OV = Ovarian, PR = Prostate, RE = Renal.
Figure 5
Figure 5
Graphical representations of the variables selected by sPLS, Ross data set. Example of graphical representation of the genes selected on the first two sPLS dimensions. The coordinates of each gene are obtained by computing the correlation between the latent variable vectors (ξ1, ξ2) and the original Ross data set. The selected cDNAs are then projected onto correlation circles where highly correlated cDNAs cluster together. These graphics help identifying correlated genes between the two platforms (by superimposing graphics from Figures 5 and 6). They also allow for the association between the gene clusters and a type of tumor cell lines by combining the information contained in Figure 4. The labels of the cDNAs can be plotted interactively in R to facilitate their identification. Subsets of the selected genes may also be displayed alone to focus on specific, user-defined, gene groups.
Figure 6
Figure 6
Graphical representations of the variables selected by sPLS, Staunton data set. Example of graphical representation of the genes selected on the first two sPLS dimensions. The coordinates of each gene are obtained by computing the correlation between the latent variable vectors (ω1, ω2) and the original Staunton data set. The selected Affymetrix probes are then projected onto correlation circles where highly correlated probes cluster together. These graphics help identifying correlated genes between the two platforms (by superimposing graphics from Figures 5 and 6). They also allow for the association between the gene clusters and a type of tumor cell lines by combining the information contained in Figure 4. The labels of the Affymetrix probes can be plotted interactively in R to facilitate their identification. Subsets of the selected genes may also be displayed alone to focus on specific, user-defined, gene groups.
Figure 7
Figure 7
Venn Diagrams. Venn diagrams for 100 selected genes associated to melanoma vs. the other cell lines for each data set (top). These lists were then decomposed into up and down regulated genes (bottom).

Similar articles

Cited by

References

    1. Wold H. In: Multivariate Analysis. krishnaiah pr, editor. Academic Press, New York, Wiley; 1966.
    1. Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377.
    1. Krämer N. An overview of the shrinkage properties of partial least squares regression. Computational Statistics. 2007;22:249–273. doi: 10.1007/s00180-007-0038-z. - DOI
    1. Chun H, Keles S. Tech rep. Department of Statistics, University of Wisconsin, Madison, USA; 2007. Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis.
    1. Bylesjö M, Eriksson D, Kusano M, Moritz T, Trygg J. Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. The Plant Journal. 2007;52:1181–1191. - PubMed

Publication types

-