TY - JOUR
T1 - Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources
T2 - Strategy from the EMIF Project
AU - Roberto, Giuseppe
AU - Leal, Ingrid
AU - Sattar, Naveed
AU - Loomis, A Katrina
AU - Avillach, Paul
AU - Egger, Peter
AU - van Wijngaarden, Rients
AU - Ansell, David
AU - Reisberg, Sulev
AU - Tammesoo, Mari-Liis
AU - Alavere, Helene
AU - Pasqua, Alessandro
AU - Pedersen, Lars
AU - Cunningham, James
AU - Tramontan, Lara
AU - Mayer, Miguel A
AU - Herings, Ron
AU - Coloma, Preciosa
AU - Lapi, Francesco
AU - Sturkenboom, Miriam
AU - van der Lei, Johan
AU - Schuemie, Martijn J
AU - Rijnbeek, Peter
AU - Gini, Rosa
PY - 2016
Y1 - 2016
N2 - Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data derivation procedure was tested in a set of heterogeneous data sources. Identification of subjects with type 2 diabetes (T2DM) was the test case. We included three primary care data sources (PCDs), three record linkage of administrative and/or registry data sources (RLDs), one hospital and one biobank. Overall, data from 12 million subjects from six European countries were extracted. Based on a shared event definition, sixteeen standard algorithms (components) useful to identify T2DM cases were generated through a top-down/bottom-up iterative approach. Each component was based on one single data domain among diagnoses, drugs, diagnostic test utilization and laboratory results. Diagnoses-based components were subclassified considering the healthcare setting (primary, secondary, inpatient care). The Unified Medical Language System was used for semantic harmonization within data domains. Individual components were extracted and proportion of population identified was compared across data sources. Drug-based components performed similarly in RLDs and PCDs, unlike diagnoses-based components. Using components as building blocks, logical combinations with AND, OR, AND NOT were tested and local experts recommended their preferred data source-tailored combination. The population identified per data sources by resulting algorithms varied from 3.5% to 15.7%, however, age-specific results were fairly comparable. The impact of individual components was assessed: diagnoses-based components identified the majority of cases in PCDs (93-100%), while drug-based components were the main contributors in RLDs (81-100%). The proposed data derivation procedure allowed the generation of data source-tailored case-finding algorithms in a standardized fashion, facilitated transparent documentation of the process and benchmarking of data sources, and provided bases for interpretation of possible inter-data source inconsistency of findings in future studies.
AB - Due to the heterogeneity of existing European sources of observational healthcare data, data source-tailored choices are needed to execute multi-data source, multi-national epidemiological studies. This makes transparent documentation paramount. In this proof-of-concept study, a novel standard data derivation procedure was tested in a set of heterogeneous data sources. Identification of subjects with type 2 diabetes (T2DM) was the test case. We included three primary care data sources (PCDs), three record linkage of administrative and/or registry data sources (RLDs), one hospital and one biobank. Overall, data from 12 million subjects from six European countries were extracted. Based on a shared event definition, sixteeen standard algorithms (components) useful to identify T2DM cases were generated through a top-down/bottom-up iterative approach. Each component was based on one single data domain among diagnoses, drugs, diagnostic test utilization and laboratory results. Diagnoses-based components were subclassified considering the healthcare setting (primary, secondary, inpatient care). The Unified Medical Language System was used for semantic harmonization within data domains. Individual components were extracted and proportion of population identified was compared across data sources. Drug-based components performed similarly in RLDs and PCDs, unlike diagnoses-based components. Using components as building blocks, logical combinations with AND, OR, AND NOT were tested and local experts recommended their preferred data source-tailored combination. The population identified per data sources by resulting algorithms varied from 3.5% to 15.7%, however, age-specific results were fairly comparable. The impact of individual components was assessed: diagnoses-based components identified the majority of cases in PCDs (93-100%), while drug-based components were the main contributors in RLDs (81-100%). The proposed data derivation procedure allowed the generation of data source-tailored case-finding algorithms in a standardized fashion, facilitated transparent documentation of the process and benchmarking of data sources, and provided bases for interpretation of possible inter-data source inconsistency of findings in future studies.
KW - Data Mining/methods
KW - Databases, Factual
KW - Diabetes Mellitus, Type 2/epidemiology
KW - Europe/epidemiology
KW - Female
KW - Humans
KW - Male
U2 - 10.1371/journal.pone.0160648
DO - 10.1371/journal.pone.0160648
M3 - Article
C2 - 27580049
SN - 1932-6203
VL - 11
SP - e0160648
JO - PLoS ONE
JF - PLoS ONE
IS - 8
ER -