Plug-and-play use of tree-based methods: Consequences for clinical prediction modelling

Lotta M Meijerink*, Ewoud Schuit, Karel G M Moons, Tuur Leeuwenberg

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

5 Downloads (Pure)

Abstract

Objectives: Tree-based models, such as random forest and XGBoost, are increasingly being used for clinical prediction, but certain aspects of their behavior are often overlooked. This article aims to illustrate these aspects and discuss the implications of plug-and-play use of tree-based models for clinical prediction. We focus on their ability to learn smooth, monotonic (ie, consistent predictor effect where an increase in predictor leads to an increase in predicted risk), and additive predictor-outcome associations (ie, each predictor independently and additively contributes to the outcome) and how they behave when making predictions outside the range of observed data (extrapolation). Study Design and Setting: We illustrated the behavior of plug-and-play use of tree-based models in a simulation study where we sampled predictors from standard normal distributions and binary outcomes determined by the logistic function of the predictors, and translate this into potential clinical implications in a real-world clinical example of post-radiotherapy toxicity prediction. To show the generalizability of our findings, we also assessed the model's behavior in a publicly available dataset of patients with head and neck cancer. For each analysis we visualized the learned predictor-outcome associations across different sample sizes. Results: In the simulation study, the models show stepwise fluctuations in their learned continuous predictor-outcome associations, which is caused by the inherent categorization of continuous predictors in a decision tree. Even with a large data size, the associations were not smooth or monotonic. Furthermore, because tree-based models can only split orthogonally to the axes, they struggle to learn an additive effect. Additionally, tree-based models extrapolate in a somewhat unintuitive way, by predicting a constant value beyond the observed data, regardless of further increases in predictor values. Using the clinical example and case study, we highlight that the learned associations are biologically implausible and may lead to issues regarding generalizability and trustworthiness. Conclusion: Using tree-based models in a plug-and-play manner for clinical prediction may result in undesirable predictor-outcome associations. Therefore, we recommend carefully taking their behavior into account during modeling decisions and evaluations. Further research is needed to explore the potential value of recent developments in decision tree literature, such as using constraints to incorporate prior knowledge and using soft-split decision trees. Plain Language Summary: In healthcare, statistical models can be used to predict a patient's risk of having or developing certain medical conditions. These predictions can be used for diagnosis, treatment planning, or informing patients. While there are many ways to build these models, a method that uses decision trees (and collections of these trees like random forest and XGBoost) has become increasingly popular. In our work, we used simulations and real patient data, to illustrate how these tree-based models learn from data. We show that these models inherently learn associations that are often biologically unrealistic, even when a lot of data are available. For example, when a predictive variable (e.g., body mass index) increases, we would expect smooth, gradual changes in risk. Instead, tree-based models tend to make sudden, unrealistic jumps or unexpected drops in their predictions. To discuss why this is relevant, we examined a case study in radiotherapy, where doctors use models to predict treatment side effects, based on variables such as the planned radiation dose. Here, the behavior of tree-based models sometimes leads to predictions where a tiny increase in radiation dose would either suddenly decrease the risk of side effects (which is not biologically plausible) or cause a dramatic increase in risk. Such biologically implausible predictions could result in a lack of trust from clinicians and potentially lead to suboptimal treatment decisions. While tree-based models are powerful methods, we argue that researchers should be aware of how these models learn and carefully evaluate whether this is acceptable for each specific clinical application. We also illustrate some promising solutions, such as forcing the models to incorporate existing medical knowledge, for example, by requiring that higher radiation doses must result in higher risks of side effects. Our work helps researchers understand the limitations and potential of these increasingly popular tree-based prediction methods and support their responsible use.

Original languageEnglish
Article number111834
JournalJournal of Clinical Epidemiology
Volume184
DOIs
Publication statusPublished - Aug 2025

Keywords

  • Decision tree
  • Machine learning
  • Prediction models
  • Random Forest
  • Tree-based model
  • Trustworthiness

Fingerprint

Dive into the research topics of 'Plug-and-play use of tree-based methods: Consequences for clinical prediction modelling'. Together they form a unique fingerprint.

Cite this