Search CORE

2 research outputs found

Automatic identification of variables in epidemiological datasets using logic regression

Author: Abdi N.A. (Negin Ashtiani)
Agewall Stefan
Amato M. (Mauro)
Bae J.-H. (Jang-Ho)
Baldassarre Damiano
Beloqui O. (Oscar)
Berenson G. (Gerald)
Bergstrom Goran
Bevc S. (Sebastjan)
Bickel Horst
Bokemark Lena
Bots Michiel
Bulbul Alpaslan
Castelnuovo S. (Samuela)
Catapano A. (Alberico)
Catapano Alberico
Chien K.-L. (Kuo-Liong)
Dekker Jacqueline
Desvarieux Moise
Dimitriadis C. (Chrystosomos)
Ducimetiere P.
Dörr Marcus
Ekart R. (Robert)
Empana Jean Philippe
Engström G.
Ezhov Marat
Franco Oscar
Frauchiger B. (Beat)
Friera Alfonsa
Gabriel R. (Rafael)
Grigore Liliana
Hedblad Bo
Hofman Albert
Hojs R. (Radovan)
Iglseder Bernhard
Ikram Arfan
Jovanovic A. (Aleksandar)
Kablak-Ziembicka A. (Anna)
Kato A. (Akihiko)
Kauhanen Jussi
Kavousi Maryam
Kiechl Stefan
Kitagawa K. (Kazuo)
Landecho M.F. (Manuel F.)
Lazarevic T. (Tatjana)
Lee M.-S. (Moo-Sik)
Lin H.-J. (Hung-Ju)
Lind Lars
Liu J. (Jing)
Lorenz Matthias W.
McLachlan Stela
Nijpels Giel
Norata Giuseppe
Okazaki S. (Shuhei)
Orth A. (Andreas)
Papagianni Aikaterini
Park H.W. (Hyun Woong)
Pflug Anja
Plichart Matthieu
Polak Joseph F.
Poppert Holger
Price J.F. (Jackie F.)
Przewlocki T. (Tadeusz)
Robertson Christine M
Ronkainen Kimmo
Rosvall Maria
Rundek Tatjana
Sacco R.L. (Ralph L.)
Sander D. (Dirk)
Scheckenbach Frank
Schmidt Caroline
Schminke Ulf
Sirtori C.R. (Cesare R.)
Sitzer Matthias
Srinivasan S.R. (Sathanur R.)
Staub D. (Daniel)
Stehouwer Coen
Steinmetz helmuth
Stolic R. (Radojica)
Su T.-C. (Ta-Chen)
Suarez C. (Carmen)
Tremoli Elena
Tripepi Giovanni
Tuomainen T.-P. (Tomi-Pekka)
Uthoff H. (Heiko)
Veglia Fabrizio
Völzke Henry
Willeit Johann
Willeit Johann
Xie Wuxiang
Yanez D.N. (David N.)
Zhao D. (Dong)
Zoccali Carmine
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

textabstractBackground: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies

Directory of Open Access Journals

Edinburgh Research Explorer

University of Miami: Scholarship Miami

Erasmus University Digital Repository

NORA - Norwegian Open Research Archives

espace@Curtin

Crossref

AIR Universita degli studi di Milano

PubMed Central

EUR Research Repository

Utrecht University Repository

Leicester Research Archive

Automatic identification of variables in epidemiological datasets using logic regression

Author: Abdi Negin Ashtiani
Agewall Stefan
Amato Mauro
Bae Jang Ho
Baldassarre Damiano
Beloqui Oscar
Berenson Gerald
Bergström Göran
Bevc Sebastjan
Bickel Horst
Bokemark Lena
Bots Michiel L.
Bülbül Alpaslan
Castelnuovo Samuela
Catapano Alberico
Catapano Alberico L.
Chien Kuo Liong
Dekker Jaqueline M.
Desvarieux Moise
Dimitriadis Chrystosomos
Ducimetiere Pierre
Dörr Marcus
Ekart Robert
Empana Jean Philippe
Engström Gunnar
Ezhov Marat
Franco Oscar H.
Frauchiger Beat
Friera Alfonsa
Gabriel Rafael
Grigore Liliana
Hedblad Bo
Hofman Albert
Hojs Radovan
Iglseder Bernhard
Ikram M. Arfan
Jovanovic Aleksandar
Kablak-Ziembicka Anna
Kato Akihiko
Kauhanen Jussi
Kavousi Maryam
Kiechl Stefan
Kitagawa Kazuo
Landecho Manuel F.
Lazarevic Tatjana
Lee Moo Sik
Lin Hung Ju
Lind Lars
Liu Jing
Lorenz Matthias W.
McLachlan Stela
Nijpels Giel
Norata Giuseppe D.
Okazaki Shuhei
Orth Andreas
Papagianni Aikaterini
Park Hyun Woong
Pflug Anja
Plichart Matthieu
Polak Joseph F.
Poppert Holger
Price Jackie F.
Przewlocki Tadeusz
Robertson Christine
Ronkainen Kimmo
Rosvall Maria
Rundek Tatjana
Sacco Ralph L.
Sander Dirk
Scheckenbach Frank
Schmidt Caroline
Schminke Ulf
Sirtori Cesare R.
Sitzer Matthias
Srinivasan Sathanur R.
Staub Daniel
Stehouwer C. D.A.
Steinmetz Helmuth
Stolic Radojica
Su Ta Chen
Suarez Carmen
Tremoli Elena
Tripepi Giovanni
Tuomainen Tomi Pekka
Uthoff Heiko
Veglia Fabrizio
Völzke Henry
Willeit Johann
Willeit Peter
Xie Wuxiang
Yanez David N.
Zhao Dong
Zoccali Carmine
Publication venue
Publication date: 13/04/2017
Field of study

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies

Utrecht University Repository