TY - JOUR
T1 - Investigating De-Identification Methodologies in Dutch Medical Texts
T2 - A Replication Study of Deduce and Deidentify
AU - Mosteiro, Pablo
AU - Wang, Ruilin
AU - Scheepers, Floortje
AU - Spruit, Marco
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/4
Y1 - 2025/4
N2 - Deidentifying sensitive information in electronic health records (EHRs) is increasingly important as legal obligations to data privacy evolve along with the need to protect patient and institutional confidentiality. This study aims to comparatively evaluate the performance of two state-of-the-art deidentification systems, Deduce and Deidentify, on both real-world and synthetic Dutch medical texts, thereby providing insights into their relative strengths and limitations in preserving privacy while maintaining data utility. We employ a replication-extension research design, utilizing two distinct datasets: (1) the Annotation-Based Dataset from the Utrecht University Medical Center (UMC Utrecht), comprising manually annotated patient records spanning 1987 to 2021, and (2) the Synthetic Dataset, generated using a two-step process involving OpenAI’s GPT-4 model. Utilizing precision, recall, and (Formula presented.) scores as evaluation metrics, we uncover the relative strengths and limitations of the two methods. Our findings indicate that both techniques show variable performance across different entities of deidentifying text information. Deduce outperforms Deidentify in overall accuracy by a margin of 0.42 on the synthetic datasets. On the real-world annotation-based dataset, the generalization ability of Deidentify is lower than Deduce by 0.2. However, the performance of both techniques is affected by the limitations of the dataset. In conclusion, this study provides valuable insights into the comparative performance of Deduce and Deidentify for deidentifying Dutch EHRs, contributing to the development of more effective privacy preservation techniques in the healthcare domain.
AB - Deidentifying sensitive information in electronic health records (EHRs) is increasingly important as legal obligations to data privacy evolve along with the need to protect patient and institutional confidentiality. This study aims to comparatively evaluate the performance of two state-of-the-art deidentification systems, Deduce and Deidentify, on both real-world and synthetic Dutch medical texts, thereby providing insights into their relative strengths and limitations in preserving privacy while maintaining data utility. We employ a replication-extension research design, utilizing two distinct datasets: (1) the Annotation-Based Dataset from the Utrecht University Medical Center (UMC Utrecht), comprising manually annotated patient records spanning 1987 to 2021, and (2) the Synthetic Dataset, generated using a two-step process involving OpenAI’s GPT-4 model. Utilizing precision, recall, and (Formula presented.) scores as evaluation metrics, we uncover the relative strengths and limitations of the two methods. Our findings indicate that both techniques show variable performance across different entities of deidentifying text information. Deduce outperforms Deidentify in overall accuracy by a margin of 0.42 on the synthetic datasets. On the real-world annotation-based dataset, the generalization ability of Deidentify is lower than Deduce by 0.2. However, the performance of both techniques is affected by the limitations of the dataset. In conclusion, this study provides valuable insights into the comparative performance of Deduce and Deidentify for deidentifying Dutch EHRs, contributing to the development of more effective privacy preservation techniques in the healthcare domain.
KW - deep learning methods
KW - Dutch medical records
KW - machine learning
KW - named entity recognition
KW - natural language processing
KW - privacy information
UR - http://www.scopus.com/inward/record.url?scp=105003649115&partnerID=8YFLogxK
U2 - 10.3390/electronics14081636
DO - 10.3390/electronics14081636
M3 - Article
AN - SCOPUS:105003649115
SN - 2079-9292
VL - 14
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 8
M1 - 1636
ER -