Sparse canonical methods for biological data integration: application to a cross-platform study

doi:10.1186/1471-2105-10-34

Comparative Study

. 2009 Jan 26:10:34.

doi: 10.1186/1471-2105-10-34.

Sparse canonical methods for biological data integration: application to a cross-platform study

Kim-Anh Lê Cao¹, Pascal G P Martin, Christèle Robert-Granié, Philippe Besse

Affiliations

PMID: 19171069
PMCID: PMC2640358
DOI: 10.1186/1471-2105-10-34

Comparative Study

Sparse canonical methods for biological data integration: application to a cross-platform study

Kim-Anh Lê Cao et al. BMC Bioinformatics. 2009.

. 2009 Jan 26:10:34.

doi: 10.1186/1471-2105-10-34.

Authors

Kim-Anh Lê Cao¹, Pascal G P Martin, Christèle Robert-Granié, Philippe Besse

Affiliation

¹ Station d'Amélioration Génétique des Animaux UR 631, Institut National de Recherche Agronomique, F-31326 Castanet, France. k.lecao@imb.uq.edu.au

PMID: 19171069
PMCID: PMC2640358
DOI: 10.1186/1471-2105-10-34

Abstract

Background: In the context of systems biology, few sparse approaches have been proposed so far to integrate several data sets. It is however an important and fundamental issue that will be widely encountered in post genomic studies, when simultaneously analyzing transcriptomics, proteomics and metabolomics data using different platforms, so as to understand the mutual interactions between the different data sets. In this high dimensional setting, variable selection is crucial to give interpretable results. We focus on a sparse Partial Least Squares approach (sPLS) to handle two-block data sets, where the relationship between the two types of variables is known to be symmetric. Sparse PLS has been developed either for a regression or a canonical correlation framework and includes a built-in procedure to select variables while integrating data. To illustrate the canonical mode approach, we analyzed the NCI60 data sets, where two different platforms (cDNA and Affymetrix chips) were used to study the transcriptome of sixty cancer cell lines.

Results: We compare the results obtained with two other sparse or related canonical correlation approaches: CCA with Elastic Net penalization (CCA-EN) and Co-Inertia Analysis (CIA). The latter does not include a built-in procedure for variable selection and requires a two-step analysis. We stress the lack of statistical criteria to evaluate canonical correlation methods, which makes biological interpretation absolutely necessary to compare the different gene selections. We also propose comprehensive graphical representations of both samples and variables to facilitate the interpretation of the results.

Conclusion: sPLS and CCA-EN selected highly relevant genes and complementary findings from the two data sets, which enabled a detailed understanding of the molecular characteristics of several groups of cell lines. These two approaches were found to bring similar results, although they highlighted the same phenomenons with a different priority. They outperformed CIA that tended to select redundant information.

PubMed Disclaimer

Figures

**Figure 1**
Rd. Cumulative explained variance (Rd criterion) of each data set in relation to its component score (CCA-EN, CIA) or latent variable (sPLS).

**Figure 2**
**Hierarchical clustering of the two data sets using all expression profiles**. Hierarchical clustering of the cell lines with Ward method and correlation distance using the expression profiles from the Ross (left) and Staunton (right) data sets. The tissues of origin of the cell lines are coded as BR = Breast, CNS = Central Nervous System, CO = Colon, LE = Leukaemia, ME = Melanoma, LU = Lung, OV = Ovarian, PR = Prostate, RE = Renal. The Ward method maximizes the between-cluster inertia and minimizes the within-cluster inertia for each step of the clustering algorithm. Height represents the loss of between-cluster inertia for each clustering step. Dashed lines cut the dendrograms to highlight the three main clusters.

**Figure 3**
**Graphical representations of the samples using CCA-EN**. Graphical representations of the cell lines by plotting the component scores from CCA-EN from dimension 1 and 2 **(a)** or 1 and 3 **(b)**. The component scores computed on each data set are displayed in a superimposed manner, where the start of the arrow shows the location of the Ross samples, and the tip the Staunton samples. Short arrows indicate if both data sets strongly agree. The colors indicate the tissues of origin of the cell lines with BR = Breast, CNS = Central Nervous System, CO = Colon, LE = Leukaemia, ME = Melanoma, LU = Lung, OV = Ovarian, PR = Prostate, RE = Renal.

**Figure 4**
**Graphical representations of the samples using sPLS**. Graphical representations of the cell lines by plotting the latent variable vectors from sPLS from dimension 1 and 2 **(a)** or 1 and 3 **(b)**. The latent variable vectors computed on each data set are displayed in a superimposed manner, where the start of the arrow shows the location of the Ross samples, and the tip the Staunton samples. Short arrows indicate if both data sets strongly agree. The colors indicate the tissues of origin of the cell lines with BR = Breast, CNS = Central Nervous System, CO = Colon, LE = Leukaemia, ME = Melanoma, NS = Lung, OV = Ovarian, PR = Prostate, RE = Renal.

**Figure 5**
**Graphical representations of the variables selected by sPLS, Ross data set**. Example of graphical representation of the genes selected on the first two sPLS dimensions. The coordinates of each gene are obtained by computing the correlation between the latent variable vectors (ξ₁, ξ₂) and the original Ross data set. The selected cDNAs are then projected onto correlation circles where highly correlated cDNAs cluster together. These graphics help identifying correlated genes between the two platforms (by superimposing graphics from Figures 5 and 6). They also allow for the association between the gene clusters and a type of tumor cell lines by combining the information contained in Figure 4. The labels of the cDNAs can be plotted interactively in R to facilitate their identification. Subsets of the selected genes may also be displayed alone to focus on specific, user-defined, gene groups.

**Figure 6**
**Graphical representations of the variables selected by sPLS, Staunton data set**. Example of graphical representation of the genes selected on the first two sPLS dimensions. The coordinates of each gene are obtained by computing the correlation between the latent variable vectors (ω₁, ω₂) and the original Staunton data set. The selected Affymetrix probes are then projected onto correlation circles where highly correlated probes cluster together. These graphics help identifying correlated genes between the two platforms (by superimposing graphics from Figures 5 and 6). They also allow for the association between the gene clusters and a type of tumor cell lines by combining the information contained in Figure 4. The labels of the Affymetrix probes can be plotted interactively in R to facilitate their identification. Subsets of the selected genes may also be displayed alone to focus on specific, user-defined, gene groups.

**Figure 7**
**Venn Diagrams**. Venn diagrams for 100 selected genes associated to melanoma vs. the other cell lines for each data set (top). These lists were then decomposed into up and down regulated genes (bottom).

See this image and copyright information in PMC

Cited by

Network dynamics and therapeutic aspects of mRNA and protein markers with the recurrence sites of pancreatic cancer.
Acharjee A, Okyere D, Nath D, Nagar S, Gkoutos GV. Acharjee A, et al. Heliyon. 2024 May 17;10(10):e31437. doi: 10.1016/j.heliyon.2024.e31437. eCollection 2024 May 30. Heliyon. 2024. PMID: 38803850 Free PMC article.
Chemometric analysis illuminates the relationship among browning, polyphenol degradation, Maillard reaction and flavor variation of 5 jujube fruits during air-impingement jet drying.
Li W, Liang C, Bao F, Zhang T, Cheng Y, Zhang W, Lu Y. Li W, et al. Food Chem X. 2024 Apr 28;22:101425. doi: 10.1016/j.fochx.2024.101425. eCollection 2024 Jun 30. Food Chem X. 2024. PMID: 38736979 Free PMC article.
Diet-omics in the Study of Urban and Rural Crohn disease Evolution (SOURCE) cohort.
Braun T, Feng R, Amir A, Levhar N, Shacham H, Mao R, Hadar R, Toren I, Algavi Y, Abu-Saad K, Zhuo S, Efroni G, Malik A, Picard O, Yavzori M, Agranovich B, Liu TC, Stappenbeck TS, Denson L, Kalter-Leibovici O, Gottlieb E, Borenstein E, Elinav E, Chen M, Ben-Horin S, Haberman Y. Braun T, et al. Nat Commun. 2024 May 4;15(1):3764. doi: 10.1038/s41467-024-48106-6. Nat Commun. 2024. PMID: 38704361 Free PMC article.
Integrative Analysis of Differentially Expressed Genes in Time-Course Multi-Omics Data with MINT-DE.
Xue H, Delbare SYN, Wells MT, Basu S, Clark AG. Xue H, et al. Res Sq [Preprint]. 2024 Jan 1:rs.3.rs-3806701. doi: 10.21203/rs.3.rs-3806701/v1. Res Sq. 2024. PMID: 38260696 Free PMC article. Preprint.
Rapid intestinal and systemic metabolic reprogramming in an immunosuppressed environment.
Ma B, Gavzy SJ, France M, Song Y, Lwin HW, Kensiski A, Saxena V, Piao W, Lakhan R, Iyyathurai J, Li L, Paluskievicz C, Wu L, WillsonShirkey M, Mongodin EF, Mas VR, Bromberg JS. Ma B, et al. BMC Microbiol. 2023 Dec 9;23(1):394. doi: 10.1186/s12866-023-03141-z. BMC Microbiol. 2023. PMID: 38066426 Free PMC article.

See all "Cited by" articles

References

1. Wold H. In: Multivariate Analysis. krishnaiah pr, editor. Academic Press, New York, Wiley; 1966.
1. Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377.
1. Krämer N. An overview of the shrinkage properties of partial least squares regression. Computational Statistics. 2007;22:249–273. doi: 10.1007/s00180-007-0038-z. - DOI
1. Chun H, Keles S. Tech rep. Department of Statistics, University of Wisconsin, Madison, USA; 2007. Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis.
1. Bylesjö M, Eriksson D, Kusano M, Moritz T, Trygg J. Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. The Plant Journal. 2007;52:1181–1191. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Wold H. In: Multivariate Analysis. krishnaiah pr, editor. Academic Press, New York, Wiley; 1966.

[2] Wold H. In: Multivariate Analysis. krishnaiah pr, editor. Academic Press, New York, Wiley; 1966.

[3] Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377.

[4] Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377.

[5] Krämer N. An overview of the shrinkage properties of partial least squares regression. Computational Statistics. 2007;22:249–273. doi: 10.1007/s00180-007-0038-z. - DOI

[6] Krämer N. An overview of the shrinkage properties of partial least squares regression. Computational Statistics. 2007;22:249–273. doi: 10.1007/s00180-007-0038-z. - DOI

[7] Chun H, Keles S. Tech rep. Department of Statistics, University of Wisconsin, Madison, USA; 2007. Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis.

[8] Chun H, Keles S. Tech rep. Department of Statistics, University of Wisconsin, Madison, USA; 2007. Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis.

[9] Bylesjö M, Eriksson D, Kusano M, Moritz T, Trygg J. Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. The Plant Journal. 2007;52:1181–1191. - PubMed

[10] Bylesjö M, Eriksson D, Kusano M, Moritz T, Trygg J. Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. The Plant Journal. 2007;52:1181–1191. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sparse canonical methods for biological data integration: application to a cross-platform study

Affiliation

Sparse canonical methods for biological data integration: application to a cross-platform study

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials