TY - JOUR
T1 - Patient-specific uncertainty calibration of deep learning-based autosegmentation networks for adaptive MRI-guided lung radiotherapy
AU - Rabe, Moritz
AU - Meliadò, Ettore F.
AU - Marschner, Sebastian N.
AU - Belka, Claus
AU - Corradini, Stefanie
AU - van den Berg, Cornelis A.T.
AU - Landry, Guillaume
AU - Kurz, Christopher
N1 - Publisher Copyright:
© 2025 The Author(s). Published on behalf of Institute of Physics and Engineering in Medicine by IOP Publishing Ltd.
PY - 2025/5/18
Y1 - 2025/5/18
N2 - Objective. Uncertainty assessment of deep learning autosegmentation (DLAS) models can support contour corrections in adaptive radiotherapy (ART), e.g. by utilizing Monte Carlo Dropout (MCD) uncertainty maps. However, poorly calibrated uncertainties at the patient level often render these clinically nonviable. We evaluated population-based and patient-specific DLAS accuracy and uncertainty calibration and propose a patient-specific post-training uncertainty calibration method for DLAS in ART. Approach. The study included 122 lung cancer patients treated with a low-field MR-linac (80/19/23 training/validation/test cases). Ten single-label 3D-U-Net population-based baseline models (BM) were trained with dropout using planning MRIs (pMRIs) and contours for nine organs-at-riks (OARs) and gross tumor volumes (GTVs). Patient-specific models (PS) were created by fine-tuning BMs with each test patient’s pMRI. Model uncertainty was assessed with MCD, averaged into probability maps. Uncertainty calibration was evaluated with reliability diagrams and expected calibration error (ECE). A proposed post-training calibration method rescaled MCD probabilities for fraction images in BM (calBM) and PS (calPS) after fitting reliability diagrams from pMRIs. All models were evaluated on fraction images using Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95) and ECE. Metrics were compared among models for all OARs combined (n = 163), and the GTV (n = 23), using Friedman and posthoc-Nemenyi tests (α = 0.05). Main results. For the OARs, patient-specific fine-tuning significantly (p < 0.001) increased median DSC from 0.78 (BM) to 0.86 (PS) and reduced HD95 from 14 mm (BM) to 6.0 mm (PS). Uncertainty calibration achieved substantial reductions in ECE, from 0.25 (BM) to 0.091 (calBM) and 0.22 (PS) to 0.11 (calPS) (p < 0.001), without significantly affecting DSC or HD95 (p > 0.05). For the GTV, BM performance was poor (DSC = 0.05) but significantly (p < 0.001) improved with PS training (DSC = 0.75) while uncertainty calibration reduced ECE from 0.22 (PS) to 0.15 (calPS) (p = 0.45). Significance. Post-training uncertainty calibration yields geometrically accurate DLAS models with well-calibrated uncertainty estimates, crucial for ART applications.
AB - Objective. Uncertainty assessment of deep learning autosegmentation (DLAS) models can support contour corrections in adaptive radiotherapy (ART), e.g. by utilizing Monte Carlo Dropout (MCD) uncertainty maps. However, poorly calibrated uncertainties at the patient level often render these clinically nonviable. We evaluated population-based and patient-specific DLAS accuracy and uncertainty calibration and propose a patient-specific post-training uncertainty calibration method for DLAS in ART. Approach. The study included 122 lung cancer patients treated with a low-field MR-linac (80/19/23 training/validation/test cases). Ten single-label 3D-U-Net population-based baseline models (BM) were trained with dropout using planning MRIs (pMRIs) and contours for nine organs-at-riks (OARs) and gross tumor volumes (GTVs). Patient-specific models (PS) were created by fine-tuning BMs with each test patient’s pMRI. Model uncertainty was assessed with MCD, averaged into probability maps. Uncertainty calibration was evaluated with reliability diagrams and expected calibration error (ECE). A proposed post-training calibration method rescaled MCD probabilities for fraction images in BM (calBM) and PS (calPS) after fitting reliability diagrams from pMRIs. All models were evaluated on fraction images using Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95) and ECE. Metrics were compared among models for all OARs combined (n = 163), and the GTV (n = 23), using Friedman and posthoc-Nemenyi tests (α = 0.05). Main results. For the OARs, patient-specific fine-tuning significantly (p < 0.001) increased median DSC from 0.78 (BM) to 0.86 (PS) and reduced HD95 from 14 mm (BM) to 6.0 mm (PS). Uncertainty calibration achieved substantial reductions in ECE, from 0.25 (BM) to 0.091 (calBM) and 0.22 (PS) to 0.11 (calPS) (p < 0.001), without significantly affecting DSC or HD95 (p > 0.05). For the GTV, BM performance was poor (DSC = 0.05) but significantly (p < 0.001) improved with PS training (DSC = 0.75) while uncertainty calibration reduced ECE from 0.22 (PS) to 0.15 (calPS) (p = 0.45). Significance. Post-training uncertainty calibration yields geometrically accurate DLAS models with well-calibrated uncertainty estimates, crucial for ART applications.
KW - adaptive radiotherapy
KW - autosegmentation
KW - deep learning
KW - epistemic uncertainty
KW - Monte Carlo dropout
KW - MR-linac
KW - uncertainty calibration
UR - http://www.scopus.com/inward/record.url?scp=105005552691&partnerID=8YFLogxK
U2 - 10.1088/1361-6560/add640
DO - 10.1088/1361-6560/add640
M3 - Article
C2 - 40340988
AN - SCOPUS:105005552691
SN - 0031-9155
VL - 70
JO - Physics in medicine and biology
JF - Physics in medicine and biology
IS - 10
M1 - 105018
ER -