TY - JOUR
T1 - Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data
T2 - The Impact of Data Preparation Choices on Model Performance
AU - van Os, Hendrikus J.A.
AU - Kanning, Jos P.
AU - Wermer, Marieke J.H.
AU - Chavannes, Niels H.
AU - Numans, Mattijs E.
AU - Ruigrok, Ynte M.
AU - van Zwet, Erik W.
AU - Putter, Hein
AU - Steyerberg, Ewout W.
AU - Groenwold, Rolf H.H.
N1 - Publisher Copyright:
Copyright © 2022 van Os, Kanning, Wermer, Chavannes, Numans, Ruigrok, van Zwet, Putter, Steyerberg and Groenwold.
PY - 2022
Y1 - 2022
N2 - Objective: To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR). Study Design and Setting: Cox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set. Results: We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83–0.84), while complete case analysis led to overestimation (calibration curve intercept: −0.52; 95% CI: −0.53 to −0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination. Conclusion: Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.
AB - Objective: To quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR). Study Design and Setting: Cox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set. Results: We included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83–0.84), while complete case analysis led to overestimation (calibration curve intercept: −0.52; 95% CI: −0.53 to −0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination. Conclusion: Data preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.
KW - clinical prediction model
KW - data preparation
KW - electronic health records (EHRs)
KW - model performance
KW - model transportability
KW - prediction model
UR - http://www.scopus.com/inward/record.url?scp=85181000145&partnerID=8YFLogxK
U2 - 10.3389/fepid.2022.871630
DO - 10.3389/fepid.2022.871630
M3 - Article
AN - SCOPUS:85181000145
VL - 2
JO - Frontiers in Epidemiology
JF - Frontiers in Epidemiology
M1 - 871630
ER -