TY - JOUR
T1 - Model selection for metabolomics
T2 - predicting diagnosis of coronary artery disease using automated machine learning
AU - Orlenko, Alena
AU - Kofink, Daniel
AU - Lyytikäinen, Leo-Pekka
AU - Nikus, Kjell
AU - Mishra, Pashupati
AU - Kuukasjärvi, Pekka
AU - Karhunen, Pekka J
AU - Kähönen, Mika
AU - Laurikka, Jari O
AU - Lehtimäki, Terho
AU - Asselbergs, Folkert W
AU - Moore, Jason H
N1 - Funding Information:
This work was supported by grant [R01 LM010098] from the National Institutes of Health (USA).
Publisher Copyright:
© The Author(s) 2019.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/3/1
Y1 - 2020/3/1
N2 - Motivation: Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programing. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES). Results: We analyzed nuclear magnetic resonance-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT-generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.
AB - Motivation: Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programing. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES). Results: We analyzed nuclear magnetic resonance-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT-generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.
KW - Coronary Artery Disease
KW - Humans
KW - Machine Learning
KW - Metabolome
KW - Metabolomics
UR - http://www.scopus.com/inward/record.url?scp=85082146175&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btz796
DO - 10.1093/bioinformatics/btz796
M3 - Article
C2 - 31702773
SN - 1367-4811
VL - 36
SP - 1772
EP - 1778
JO - Bioinformatics (Oxford, England)
JF - Bioinformatics (Oxford, England)
IS - 6
ER -