TY - JOUR
T1 - A workflow for missing values imputation of untargeted metabolomics data
AU - Faquih, Tariq
AU - van Smeden, Maarten
AU - Luo, Jiao
AU - Le Cessie, Saskia
AU - Kastenmüller, Gabi
AU - Krumsiek, Jan
AU - Noordam, Raymond
AU - van Heemst, Diana
AU - Rosendaal, Frits R.
AU - Vlieg, Astrid van Hylckama
AU - van Dijk, Ko Willems
AU - Mook-Kanamori, Dennis O.
N1 - Funding Information:
Funding: The NEO study is supported by the participating Departments, the Division and the Board of Directors of the Leiden University Medical Centre, and by the Leiden University, Research Profile Area ‘Vascular and Regenerative Medicine’. The analyses of metabolites are funded by the VENI grant (ZonMW-VENI Grant 916.14.023) of D.O.M.-K. D.v.H. and R.N. were supported by a grant of the VELUX Stiftung [grant number 1156]. J.L. was supported by the China Scholarship Counsel [No. 201808500155]. T.F. was supported by the King Abdullah Scholarship Program and King Faisal Specialist Hospital & Research Center [No. 1012879283].
Publisher Copyright:
© 2020 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2020/12
Y1 - 2020/12
N2 - Metabolomics studies have seen a steady growth due to the development and implementation of affordable and high-quality metabolomics platforms. In large metabolite panels, measurement values are frequently missing and, if neglected or sub-optimally imputed, can cause biased study results. We provided a publicly available, user-friendly R script to streamline the imputation of missing endogenous, unannotated, and xenobiotic metabolites. We evaluated the multivariate imputation by chained equations (MICE) and k-nearest neighbors (kNN) analyses implemented in our script by simulations using measured metabolites data from the Netherlands Epidemiology of Obesity (NEO) study (n = 599). We simulated missing values in four unique metabolites from different pathways with different correlation structures in three sample sizes (599, 150, 50) with three missing percentages (15%, 30%, 60%), and using two missing mechanisms (completely at random and not at random). Based on the simulations, we found that for MICE, larger sample size was the primary factor decreasing bias and error. For kNN, the primary factor reducing bias and error was the metabolite correlation with its predictor metabolites. MICE provided consistently higher performance measures particularly for larger datasets (n > 50). In conclusion, we presented an imputation workflow in a publicly available R script to impute untargeted metabolomics data. Our simulations provided insight into the effects of sample size, percentage missing, and correlation structure on the accuracy of the two imputation methods.
AB - Metabolomics studies have seen a steady growth due to the development and implementation of affordable and high-quality metabolomics platforms. In large metabolite panels, measurement values are frequently missing and, if neglected or sub-optimally imputed, can cause biased study results. We provided a publicly available, user-friendly R script to streamline the imputation of missing endogenous, unannotated, and xenobiotic metabolites. We evaluated the multivariate imputation by chained equations (MICE) and k-nearest neighbors (kNN) analyses implemented in our script by simulations using measured metabolites data from the Netherlands Epidemiology of Obesity (NEO) study (n = 599). We simulated missing values in four unique metabolites from different pathways with different correlation structures in three sample sizes (599, 150, 50) with three missing percentages (15%, 30%, 60%), and using two missing mechanisms (completely at random and not at random). Based on the simulations, we found that for MICE, larger sample size was the primary factor decreasing bias and error. For kNN, the primary factor reducing bias and error was the metabolite correlation with its predictor metabolites. MICE provided consistently higher performance measures particularly for larger datasets (n > 50). In conclusion, we presented an imputation workflow in a publicly available R script to impute untargeted metabolomics data. Our simulations provided insight into the effects of sample size, percentage missing, and correlation structure on the accuracy of the two imputation methods.
KW - Imputation
KW - K-nearest neighbors
KW - Metabolon
KW - Multiple imputation using chained equations
KW - Simulation
KW - Untargeted metabolomics
KW - Workflow
UR - http://www.scopus.com/inward/record.url?scp=85098159619&partnerID=8YFLogxK
U2 - 10.3390/metabo10120486
DO - 10.3390/metabo10120486
M3 - Article
AN - SCOPUS:85098159619
SN - 2218-1989
VL - 10
SP - 1
EP - 23
JO - Metabolites
JF - Metabolites
IS - 12
M1 - 486
ER -