TY - JOUR
T1 - Developing more generalizable prediction models from pooled studies and large clustered data sets
AU - de Jong, Valentijn M.T.
AU - Moons, Karel G.M.
AU - Eijkemans, Marinus J.C.
AU - Riley, Richard D.
AU - Debray, Thomas P.A.
N1 - Funding Information:
The authors thank the anonymous peer reviewers and editor for their thoughtful comments that have improved this manuscript. The authors gratefully acknowledge the following authors for sharing of individual participant data from the deep vein thrombosis (DVT) studies: G. J. Geersing, N. P. A. Zuithoff, C. Kearon, D. R. Anderson, A. J. ten Cate‐Hoek, J. L. Elf, S. M. Bates, A. W. Hoes, R. A. Kraaijenhagen, R. Oudega, R. E. G. Schutgens, S. M. Stevens, S. C. Woller, P. S. Wells, and K. G. M. Moons. This project is financially supported by the Netherlands Organization for Health Research and Development grant 91617050. This project has received funding from the European Union's Horizon 2020 research and innovation programme under ReCoDID grant agreement No 825746.
Funding Information:
information Directorate-General for Research and Innovation, 825746; ZonMw, 91617050The authors thank the anonymous peer reviewers and editor for their thoughtful comments that have improved this manuscript. The authors gratefully acknowledge the following authors for sharing of individual participant data from the deep vein thrombosis (DVT) studies: G. J. Geersing, N. P. A. Zuithoff, C. Kearon, D. R. Anderson, A. J. ten Cate-Hoek, J. L. Elf, S. M. Bates, A. W. Hoes, R. A. Kraaijenhagen, R. Oudega, R. E. G. Schutgens, S. M. Stevens, S. C. Woller, P. S. Wells, and K. G. M. Moons. This project is financially supported by the Netherlands Organization for Health Research and Development grant 91617050. This project has received funding from the European Union's Horizon 2020 research and innovation programme under ReCoDID grant agreement No 825746.
Publisher Copyright:
© 2021 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
PY - 2021/7/10
Y1 - 2021/7/10
N2 - Prediction models often yield inaccurate predictions for new individuals. Large data sets from pooled studies or electronic healthcare records may alleviate this with an increased sample size and variability in sample characteristics. However, existing strategies for prediction model development generally do not account for heterogeneity in predictor-outcome associations between different settings and populations. This limits the generalizability of developed models (even from large, combined, clustered data sets) and necessitates local revisions. We aim to develop methodology for producing prediction models that require less tailoring to different settings and populations. We adopt internal-external cross-validation to assess and reduce heterogeneity in models' predictive performance during the development. We propose a predictor selection algorithm that optimizes the (weighted) average performance while minimizing its variability across the hold-out clusters (or studies). Predictors are added iteratively until the estimated generalizability is optimized. We illustrate this by developing a model for predicting the risk of atrial fibrillation and updating an existing one for diagnosing deep vein thrombosis, using individual participant data from 20 cohorts (N = 10 873) and 11 diagnostic studies (N = 10 014), respectively. Meta-analysis of calibration and discrimination performance in each hold-out cluster shows that trade-offs between average and heterogeneity of performance occurred. Our methodology enables the assessment of heterogeneity of prediction model performance during model development in multiple or clustered data sets, thereby informing researchers on predictor selection to improve the generalizability to different settings and populations, and reduce the need for model tailoring. Our methodology has been implemented in the R package metamisc.
AB - Prediction models often yield inaccurate predictions for new individuals. Large data sets from pooled studies or electronic healthcare records may alleviate this with an increased sample size and variability in sample characteristics. However, existing strategies for prediction model development generally do not account for heterogeneity in predictor-outcome associations between different settings and populations. This limits the generalizability of developed models (even from large, combined, clustered data sets) and necessitates local revisions. We aim to develop methodology for producing prediction models that require less tailoring to different settings and populations. We adopt internal-external cross-validation to assess and reduce heterogeneity in models' predictive performance during the development. We propose a predictor selection algorithm that optimizes the (weighted) average performance while minimizing its variability across the hold-out clusters (or studies). Predictors are added iteratively until the estimated generalizability is optimized. We illustrate this by developing a model for predicting the risk of atrial fibrillation and updating an existing one for diagnosing deep vein thrombosis, using individual participant data from 20 cohorts (N = 10 873) and 11 diagnostic studies (N = 10 014), respectively. Meta-analysis of calibration and discrimination performance in each hold-out cluster shows that trade-offs between average and heterogeneity of performance occurred. Our methodology enables the assessment of heterogeneity of prediction model performance during model development in multiple or clustered data sets, thereby informing researchers on predictor selection to improve the generalizability to different settings and populations, and reduce the need for model tailoring. Our methodology has been implemented in the R package metamisc.
KW - Calibration
KW - Humans
KW - Research Design
KW - heterogeneity
KW - individual participant data
KW - internal-external cross-validation
KW - prediction
UR - http://www.scopus.com/inward/record.url?scp=85105060785&partnerID=8YFLogxK
U2 - 10.1002/sim.8981
DO - 10.1002/sim.8981
M3 - Article
C2 - 33948970
AN - SCOPUS:85105060785
SN - 0277-6715
VL - 40
SP - 3533
EP - 3559
JO - Statistics in Medicine
JF - Statistics in Medicine
IS - 15
ER -