TY - JOUR
T1 - An Order-Sensitive Hierarchical Neural Model for Early Lung Cancer Detection Using Dutch Primary Care Notes and Structured Data
AU - Vagliano, Iacopo
AU - Rios, Miguel
AU - Abukmeil, Mohanad
AU - Schut, Martijn C.
AU - Luik, Torec T.
AU - van Asselt, Kristel M.
AU - van Weert, Henk C.P.M.
AU - Abu-Hanna, Ameen
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/4
Y1 - 2025/4
N2 - Background: Improving prediction models to timely detect lung cancer is paramount. Our aim is to develop and validate prediction models for early detection of lung cancer in primary care, based on free-text consultation notes, that exploit the order and context among words and sentences. Methods: Data of all patients enlisted in 49 general practices between 2002 and 2021 were assessed, and we included those older than 30 years with at least one free-text note. We developed two models using a hierarchical architecture that relies on attention and bidirectional long short-term memory networks. One model used only text, while the other combined text with clinical variables. The models were trained on data excluding the five months leading up to the diagnosis, using target replication and a tuning set, and were tested on a separate dataset for discrimination, PPV, and calibration. Results: A total of 250,021 patients were enlisted, with 1507 having a lung cancer diagnosis. Included in the analysis were 183,012 patients, of which 712 had the diagnosis. From the two models, the combined model showed slightly better performance, achieving an AUROC on the test set of 0.91, an AUPRC of 0.05, and a PPV of 0.034 (0.024, 0.043), and showed good calibration. To early detect one cancer patient, 29 high-risk patients would require additional diagnostic testing. Conclusions: Our models showed excellent discrimination by leveraging the word and sentence structure. Including clinical variables in addition to text slightly improved performance. The number needed to treat holds promise for clinical practice. Investigating external validation and model suitability in clinical practice is warranted.
AB - Background: Improving prediction models to timely detect lung cancer is paramount. Our aim is to develop and validate prediction models for early detection of lung cancer in primary care, based on free-text consultation notes, that exploit the order and context among words and sentences. Methods: Data of all patients enlisted in 49 general practices between 2002 and 2021 were assessed, and we included those older than 30 years with at least one free-text note. We developed two models using a hierarchical architecture that relies on attention and bidirectional long short-term memory networks. One model used only text, while the other combined text with clinical variables. The models were trained on data excluding the five months leading up to the diagnosis, using target replication and a tuning set, and were tested on a separate dataset for discrimination, PPV, and calibration. Results: A total of 250,021 patients were enlisted, with 1507 having a lung cancer diagnosis. Included in the analysis were 183,012 patients, of which 712 had the diagnosis. From the two models, the combined model showed slightly better performance, achieving an AUROC on the test set of 0.91, an AUPRC of 0.05, and a PPV of 0.034 (0.024, 0.043), and showed good calibration. To early detect one cancer patient, 29 high-risk patients would require additional diagnostic testing. Conclusions: Our models showed excellent discrimination by leveraging the word and sentence structure. Including clinical variables in addition to text slightly improved performance. The number needed to treat holds promise for clinical practice. Investigating external validation and model suitability in clinical practice is warranted.
KW - early detection
KW - hierarchical attention network
KW - lung cancer
KW - machine learning
KW - natural language processing
KW - prediction models
KW - primary care
KW - word embeddings
UR - http://www.scopus.com/inward/record.url?scp=105002613574&partnerID=8YFLogxK
U2 - 10.3390/cancers17071151
DO - 10.3390/cancers17071151
M3 - Article
AN - SCOPUS:105002613574
SN - 2072-6694
VL - 17
JO - Cancers
JF - Cancers
IS - 7
M1 - 1151
ER -