Abstract
Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
Original language | English |
---|---|
Article number | 40 |
Journal | BMC medical informatics and decision making [E] |
Volume | 17 |
Issue number | 1 |
DOIs | |
Publication status | Published - 13 Apr 2017 |
Keywords
- Data management
- Epidemiology
- Logic regression
- Meta-analysis
Fingerprint
Dive into the research topics of 'Automatic identification of variables in epidemiological datasets using logic regression'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
}
In: BMC medical informatics and decision making [E], Vol. 17, No. 1, 40, 13.04.2017.
Research output: Contribution to journal › Article › Academic › peer-review
TY - JOUR
T1 - Automatic identification of variables in epidemiological datasets using logic regression
AU - Lorenz, Matthias W.
AU - Abdi, Negin Ashtiani
AU - Scheckenbach, Frank
AU - Pflug, Anja
AU - Bülbül, Alpaslan
AU - Catapano, Alberico L.
AU - Agewall, Stefan
AU - Ezhov, Marat
AU - Bots, Michiel L.
AU - Kiechl, Stefan
AU - Orth, Andreas
AU - Norata, Giuseppe D.
AU - Empana, Jean Philippe
AU - Lin, Hung Ju
AU - McLachlan, Stela
AU - Bokemark, Lena
AU - Ronkainen, Kimmo
AU - Amato, Mauro
AU - Schminke, Ulf
AU - Srinivasan, Sathanur R.
AU - Lind, Lars
AU - Kato, Akihiko
AU - Dimitriadis, Chrystosomos
AU - Przewlocki, Tadeusz
AU - Okazaki, Shuhei
AU - Stehouwer, C. D.A.
AU - Lazarevic, Tatjana
AU - Willeit, Peter
AU - Yanez, David N.
AU - Steinmetz, Helmuth
AU - Sander, Dirk
AU - Poppert, Holger
AU - Desvarieux, Moise
AU - Ikram, M. Arfan
AU - Bevc, Sebastjan
AU - Staub, Daniel
AU - Sirtori, Cesare R.
AU - Iglseder, Bernhard
AU - Engström, Gunnar
AU - Tripepi, Giovanni
AU - Beloqui, Oscar
AU - Lee, Moo Sik
AU - Friera, Alfonsa
AU - Xie, Wuxiang
AU - Grigore, Liliana
AU - Plichart, Matthieu
AU - Su, Ta Chen
AU - Robertson, Christine
AU - Schmidt, Caroline
AU - Tuomainen, Tomi Pekka
AU - Veglia, Fabrizio
AU - Völzke, Henry
AU - Nijpels, Giel
AU - Jovanovic, Aleksandar
AU - Willeit, Johann
AU - Sacco, Ralph L.
AU - Franco, Oscar H.
AU - Hojs, Radovan
AU - Uthoff, Heiko
AU - Hedblad, Bo
AU - Park, Hyun Woong
AU - Suarez, Carmen
AU - Zhao, Dong
AU - Catapano, Alberico
AU - Ducimetiere, Pierre
AU - Chien, Kuo Liong
AU - Price, Jackie F.
AU - Bergström, Göran
AU - Kauhanen, Jussi
AU - Tremoli, Elena
AU - Dörr, Marcus
AU - Berenson, Gerald
AU - Papagianni, Aikaterini
AU - Kablak-Ziembicka, Anna
AU - Kitagawa, Kazuo
AU - Dekker, Jaqueline M.
AU - Stolic, Radojica
AU - Polak, Joseph F.
AU - Sitzer, Matthias
AU - Bickel, Horst
AU - Rundek, Tatjana
AU - Hofman, Albert
AU - Ekart, Robert
AU - Frauchiger, Beat
AU - Castelnuovo, Samuela
AU - Rosvall, Maria
AU - Zoccali, Carmine
AU - Landecho, Manuel F.
AU - Bae, Jang Ho
AU - Gabriel, Rafael
AU - Liu, Jing
AU - Baldassarre, Damiano
AU - Kavousi, Maryam
N1 - Funding Information: We thank Ingo Ruczinski, Charles Kooperberg, and Michael LeBlanc at the Fred Hutchinson Cancer Research Center in Seattle for providing the public license CRAN software package, and the related documentation. This manuscript was prepared using a limited access dataset of the Atherosclerosis Risk In Communities (ARIC) study, obtained from the National Heart, Lung and Blood Institute (NHLBI). The ARIC study is conducted and supported by NHLBI in collaboration with the ARIC Study investigators. This manuscript does not necessarily reflect the opinions or views of the ARIC study or the NHLBI. The Bruneck study was supported by the Pustertaler Verein zur Praevention von Herz-und Hirngefaesserkrankungen, Gesundheitsbezirk Bruneck, and the Assessorat fuer Gesundheit, Province of Bolzano, Italy. The Carotid Atherosclerosis Progression Study (CAPS) was supported by the Stiftung Deutsche Schlaganfall-Hilfe. The PLIC Study is supported by a grant from SISA Sezione Regionale Lombarda. This manuscript was prepared using data from the Cardiovascular Health Study (CHS). The research reported in this article was supported by contracts N01-HC-85079 through N01-HC-85086, N01-HC-35129, N01 HC-15103, N01 HC-55222, and U01 HL080295 from the National Heart, Lung, and Blood Institute, with additional contribution from the National Institute of Neurological Disorders and Stroke. A full list of participating CHS investigators and institutions can be found at http:// www.chs-nhlbi.org. The EVA Study was organized under an agreement between INSERM and the Merck, Sharp, and Dohme-Chibret Company. The Edinburgh Artery Study (EAS) was funded by the British Heart Foundation. The IMPROVE study was supported by the European Commission (Contract number: QLG1-CT-2002-00896), Ministero della Salute Ricerca Corrente, Italy, the Swedish Heart-Lung Foundation, the Swedish Research Council (projects 8691 and 0593), the Foundation for Strategic Research, the Stockholm County Council (project 562183), the Foundation for Strategic Research, the Academy of Finland (Grant #110413) and the British Heart Foundation (RG2008/014). The INVADE study was supported by the AOK Bayern. This manuscript was prepared using data from the Northern Manhattan Study (NOMAS) and the Oral Infections, Carotid Atherosclerosis and Stroke (INVEST) study. The NOMAS is funded by the National Institute of Neurological Disorders and Stroke (NINDS) grant R37 NS 029993 and INVEST by the National Institute of Dental and Craniofacial Research (NIDCR) grant R01 DE 13094. The Rotterdam Study was supported by the Netherlands Foundation for Scientific Research (NWO), ZonMw, Vici 918-76-619. The Study of Health in Pomerania (SHIP; http://ship.community-medicine.de) is part of the Community Medicine Research net (CMR) of the University of Greifswald, Germany. Collaborators within the PROG-IMT study group: Giuseppe D. Norata, PhD1,2, Jean Philippe Empana, MD, PhD3, Hung-Ju Lin, MD4, Stela McLachlan, PhD5, Lena Bokemark, MD, PhD6, Kimmo Ronkainen, MSc7, Mauro Amato, PhD8, Ulf Schminke, MD, Prof9, Sathanur R. Srinivasan, PhD, Prof.10, Lars Lind, MD, PhD, Prof11, Akihiko Kato, MD, Prof.12, Chrystosomos Dimitriadis, MD13, Tadeusz Przewlocki, MD, PhD, Prof.14, Shuhei Okazaki, MD15, CDA Stehouwer, MD, PhD, FESC16, Tatjana Lazarevic, MA17, Peter Willeit, PhD18,19, David N. Yanez, PhD, Assoc. Prof20, Helmuth Steinmetz, MD, Prof21, Dirk Sander, MD, Prof22, Holger Poppert, MD, PhD23, Moise Desvarieux, MD, PhD, Assoc. Prof.24, M. Arfan Ikram, MD, PhD, Assoc. Prof.25-27, Sebastjan Bevc, MD, PhD, Assist Prof28, Daniel Staub, MD, Prof.29, Cesare R. Sirtori, MD, PhD, Prof.30, Bernhard Iglseder, MD, Prof31,32, Gunnar Engström, MD, PhD, Prof. 33, Giovanni Tripepi, MSc34, Oscar Beloqui, MD, PhD35, Moo-Sik Lee, MD., PhD., Prof.36,37, Alfonsa Friera, MD38, Wuxiang Xie, MD, PhD, Assist. Prof.39, Liliana Grigore, MD40, Matthieu Plichart, MD, PhD41, Ta-Chen Su, MD, PhD, Assoc. Prof.4, Christine Robertson, MBChB5, Caroline Schmidt, PhD, Assoc. Prof.42, Tomi-Pekka Tuomainen, MD, PhD, Prof7, Fabrizio Veglia, PhD8, Henry Völzke, MD, Prof43,44, Giel Nijpels, MD, PhD45,46, Aleksandar Jovanovic, MD, PhD, Prof47, Johann Willeit, MD, Prof.18, Ralph L. Sacco, MD, MS, Prof.48, Oscar H. Franco, MD, PhD, FESC, FFPH, Prof. 49, Radovan Hojs, MD, PhD, Prof28,50, Heiko Uthoff, MD29, Bo Hedblad, MD, PhD, Prof33, Hyun Woong Park, M.D.36, Carmen Suarez, MD, PhD51, Dong Zhao, MD, PhD, Prof.39, Alberico Catapano, PhD, Prof.52,53, Pierre Ducimetiere, Prof.54, Kuo-Liong Chien, MD, Prof55, Jackie F. Price, MD5, Göran Bergström, MD, PhD, Prof56, Jussi Kauhanen, MD, Prof7, Elena Tremoli, PhD, Prof8,57, Marcus Dörr, MD, Prof.58, Gerald Berenson, MD, Prof.59, Aikaterini Papagianni, MD, Assoc. Prof.13, Anna Kablak-Ziembicka, MD, PhD, Prof.14, Kazuo Kitagawa, MD, PhD60, Jaqueline.M. Dekker, Prof61, Radojica Stolic, MD, PhD, Prof17, Stefan Kiechl, MD, Prof18, Joseph F. Polak, MD, MPH, Prof62, Matthias Sitzer, MD, Prof.63, Horst Bickel, PhD64, Tatjana Rundek, MD, PhD, Prof.48, Albert Hofman, MD, PhD, Prof.25, Robert Ekart, MD, PhD, Assist. Prof65, Beat Frauchiger, MD, Prof.66, Samuela Castelnuovo, PhD67, Maria Rosvall, MD, PhD, Assoc. Prof.68, Carmine Zoccali, MD, Prof.34, Manuel F Landecho, MD, PhD35, Jang-Ho Bae, MD.,PhD.,FACC.36,69, Rafael Gabriel, Prof., MD, Phd70, Jing Liu, MD, PhD, Prof.39, Damiano Baldassarre, PhD, Prof8, Maryam Kavousi, MD, PhD71. Funding Information: The PROG-IMT project was funded by the Deutsche Forschungsgemeinschaft (DFG Lo 1569/2-1 and DFG Lo 1569/2-3). Publisher Copyright: © 2017 The Author(s).
PY - 2017/4/13
Y1 - 2017/4/13
N2 - Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
AB - Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
KW - Data management
KW - Epidemiology
KW - Logic regression
KW - Meta-analysis
UR - http://www.scopus.com/inward/record.url?scp=85018523489&partnerID=8YFLogxK
U2 - 10.1186/s12911-017-0429-1
DO - 10.1186/s12911-017-0429-1
M3 - Article
C2 - 28407816
AN - SCOPUS:85018523489
SN - 1472-6947
VL - 17
JO - BMC medical informatics and decision making [E]
JF - BMC medical informatics and decision making [E]
IS - 1
M1 - 40
ER -