TY - JOUR
T1 - Statistical integration of two omics datasets using GO2PLS
AU - Gu, Zhujie
AU - El Bouhaddani, Said
AU - Pei, Jiayi
AU - Houwing-Duistermaat, Jeanine
AU - Uh, Hae-Won
N1 - Funding Information:
The authors would like to thank M. Harakalova, and M. Mokry from the Dept. of Cardiology, UMC Utrecht for providing the CVON-DOSIS data and discussion on the analysis of the CVON-DOSIS datasets. We thank M. Michels and J. van der Velden for providing the HCM tissues, the biobank of UMC Utrecht, the biobank of the Washington University School of Medicine, and the Sydney Heart Bank for providing non-failing donor tissue. This work has received support from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, Chronic Disease Research Foundation (CDRF), Zoe Global Ltd and the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy?s and St Thomas? NHS Foundation Trust in partnership with King?s College London.
Funding Information:
The research leading to these results has received funding and support from the European Union’s Horizon 2020 research and innovation programme IMforFUTURE under H2020-MSCA-ITN grant agreement number 721815, from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074), and from the ERA-Net for Research Programmes on Rare Diseases (E-rare 3 – MSAomics project). The funding bodies did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Funding Information:
The authors would like to thank M. Harakalova, and M. Mokry from the Dept. of Cardiology, UMC Utrecht for providing the CVON-DOSIS data and discussion on the analysis of the CVON-DOSIS datasets. We thank M. Michels and J. van der Velden for providing the HCM tissues, the biobank of UMC Utrecht, the biobank of the Washington University School of Medicine, and the Sydney Heart Bank for providing non-failing donor tissue. This work has received support from the EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant (116074). TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, Chronic Disease Research Foundation (CDRF), Zoe Global Ltd and the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London.
Publisher Copyright:
© 2021, The Author(s).
PY - 2021/12
Y1 - 2021/12
N2 - BACKGROUND: Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace.RESULTS: The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease.CONCLUSIONS: GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.
AB - BACKGROUND: Nowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Since omics data are heterogeneous, the joint components in PLS will contain variation specific to each dataset. To account for this, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while it might be of interest to identify a small subset relevant to the research question. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that utilizes biological information on group structures among variables and performs group selection in the joint subspace.RESULTS: The simulation study showed that introducing sparsity improved the feature selection performance. Furthermore, incorporating group structures increased robustness of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. We applied GO2PLS to datasets from two studies: TwinsUK (a population study) and CVON-DOSIS (a small case-control study). In the first, we incorporated biological information on the group structures of the methylation CpG sites when integrating the methylation dataset with the IgG glycomics data. The targeted genes of the selected methylation groups turned out to be relevant to the immune system, in which the IgG glycans play important roles. In the second, we selected regulatory regions and transcripts that explained the covariance between regulomics and transcriptomics data. The corresponding genes of the selected features appeared to be relevant to heart muscle disease.CONCLUSIONS: GO2PLS integrates two omics datasets to help understand the underlying system that involves both omics levels. It incorporates external group information and performs group selection, resulting in a small subset of features that best explain the relationship between two omics datasets for better interpretability.
KW - Dimension reduction
KW - Feature selection
KW - Group structure
KW - Integration of Omics data
KW - O2PLS
UR - http://www.scopus.com/inward/record.url?scp=85102797281&partnerID=8YFLogxK
U2 - 10.1186/s12859-021-03958-3
DO - 10.1186/s12859-021-03958-3
M3 - Article
C2 - 33736604
SN - 1471-2105
VL - 22
SP - 1
EP - 18
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 131
ER -