TY - JOUR
T1 - Haematology dimension reduction, a large scale application to regular care haematology data
AU - Joosse, Huibert-Jan
AU - Chumsaeng-Reijers, Chontira
AU - Huisman, Albert
AU - Hoefer, Imo E
AU - van Solinge, Wouter W
AU - Haitjema, Saskia
AU - van Es, Bram
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/2/12
Y1 - 2025/2/12
N2 - BACKGROUND: The routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data that cannot be directly visualised. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.RESULTS: We found that, for clinical hematology data, PCA performs systematically better than UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.CONCLUSIONS: We conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.
AB - BACKGROUND: The routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data that cannot be directly visualised. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.RESULTS: We found that, for clinical hematology data, PCA performs systematically better than UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.CONCLUSIONS: We conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.
KW - Clustering
KW - Data preservation
KW - Dimension reduction
KW - Haematology
KW - Routine care data
UR - http://www.scopus.com/inward/record.url?scp=85218478593&partnerID=8YFLogxK
U2 - 10.1186/s12911-025-02899-8
DO - 10.1186/s12911-025-02899-8
M3 - Article
C2 - 39939843
SN - 1472-6947
VL - 25
JO - BMC Medical Informatics and Decision Making
JF - BMC Medical Informatics and Decision Making
IS - 1
M1 - 75
ER -