Automatic identification of variables in epidemiological datasets using logic regression

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated ide...

Full description

Bibliographic Details
Main Authors: Lorenz, M., Abdi, N., Scheckenbach, F., Pflug, A., Bülbül, A., Catapano, A., Agewall, S., Ezhov, M., Bots, M., Kiechl, S., Orth, A., Norata, Giuseppe, Empana, J., Lin, H., McLachlan, S., Bokemark, L., Ronkainen, K., Amato, M., Schminke, U., Srinivasan, S., Lind, L., Kato, A., Dimitriadis, C., Przewlocki, T., Okazaki, S., Stehouwer, C., Lazarevic, T., Willeit, P., Yanez, D., Steinmetz, H., Sander, D., Poppert, H., Desvarieux, M., Ikram, M., Bevc, S., Staub, D., Sirtori, C., Iglseder, B., Engström, G., Tripepi, G., Beloqui, O., Lee, M., Friera, A., Xie, W., Grigore, L., Plichart, M., Su, T., Robertson, C., Schmidt, C., Tuomainen, T., Veglia, F., Völzke, H., Nijpels, G., Jovanovic, A., Willeit, J., Sacco, R., Franco, O., Hojs, R., Uthoff, H., Hedblad, B., Park, H., Suarez, C., Zhao, D., Ducimetiere, P., Chien, K., Price, J., Bergström, G., Kauhanen, J., Tremoli, E., Dörr, M., Berenson, G., Papagianni, A., Kablak-Ziembicka, A., Kitagawa, K., Dekker, J., Stolic, R., Polak, J., Sitzer, M., Bickel, H., Rundek, T., Hofman, A., Ekart, R., Frauchiger, B., Castelnuovo, S., Rosvall, M., Zoccali, C., Landecho, M., Bae, J., Gabriel, R., Liu, J., Baldassarre, D., Kavousi, M.
Format: Journal Article
Published: Biomed Central Ltd 2017
Online Access:http://hdl.handle.net/20.500.11937/55818
_version_ 1848759715232743424
author Lorenz, M.
Abdi, N.
Scheckenbach, F.
Pflug, A.
Bülbül, A.
Catapano, A.
Agewall, S.
Ezhov, M.
Bots, M.
Kiechl, S.
Orth, A.
Norata, Giuseppe
Empana, J.
Lin, H.
McLachlan, S.
Bokemark, L.
Ronkainen, K.
Amato, M.
Schminke, U.
Srinivasan, S.
Lind, L.
Kato, A.
Dimitriadis, C.
Przewlocki, T.
Okazaki, S.
Stehouwer, C.
Lazarevic, T.
Willeit, P.
Yanez, D.
Steinmetz, H.
Sander, D.
Poppert, H.
Desvarieux, M.
Ikram, M.
Bevc, S.
Staub, D.
Sirtori, C.
Iglseder, B.
Engström, G.
Tripepi, G.
Beloqui, O.
Lee, M.
Friera, A.
Xie, W.
Grigore, L.
Plichart, M.
Su, T.
Robertson, C.
Schmidt, C.
Tuomainen, T.
Veglia, F.
Völzke, H.
Nijpels, G.
Jovanovic, A.
Willeit, J.
Sacco, R.
Franco, O.
Hojs, R.
Uthoff, H.
Hedblad, B.
Park, H.
Suarez, C.
Zhao, D.
Catapano, A.
Ducimetiere, P.
Chien, K.
Price, J.
Bergström, G.
Kauhanen, J.
Tremoli, E.
Dörr, M.
Berenson, G.
Papagianni, A.
Kablak-Ziembicka, A.
Kitagawa, K.
Dekker, J.
Stolic, R.
Polak, J.
Sitzer, M.
Bickel, H.
Rundek, T.
Hofman, A.
Ekart, R.
Frauchiger, B.
Castelnuovo, S.
Rosvall, M.
Zoccali, C.
Landecho, M.
Bae, J.
Gabriel, R.
Liu, J.
Baldassarre, D.
Kavousi, M.
author_facet Lorenz, M.
Abdi, N.
Scheckenbach, F.
Pflug, A.
Bülbül, A.
Catapano, A.
Agewall, S.
Ezhov, M.
Bots, M.
Kiechl, S.
Orth, A.
Norata, Giuseppe
Empana, J.
Lin, H.
McLachlan, S.
Bokemark, L.
Ronkainen, K.
Amato, M.
Schminke, U.
Srinivasan, S.
Lind, L.
Kato, A.
Dimitriadis, C.
Przewlocki, T.
Okazaki, S.
Stehouwer, C.
Lazarevic, T.
Willeit, P.
Yanez, D.
Steinmetz, H.
Sander, D.
Poppert, H.
Desvarieux, M.
Ikram, M.
Bevc, S.
Staub, D.
Sirtori, C.
Iglseder, B.
Engström, G.
Tripepi, G.
Beloqui, O.
Lee, M.
Friera, A.
Xie, W.
Grigore, L.
Plichart, M.
Su, T.
Robertson, C.
Schmidt, C.
Tuomainen, T.
Veglia, F.
Völzke, H.
Nijpels, G.
Jovanovic, A.
Willeit, J.
Sacco, R.
Franco, O.
Hojs, R.
Uthoff, H.
Hedblad, B.
Park, H.
Suarez, C.
Zhao, D.
Catapano, A.
Ducimetiere, P.
Chien, K.
Price, J.
Bergström, G.
Kauhanen, J.
Tremoli, E.
Dörr, M.
Berenson, G.
Papagianni, A.
Kablak-Ziembicka, A.
Kitagawa, K.
Dekker, J.
Stolic, R.
Polak, J.
Sitzer, M.
Bickel, H.
Rundek, T.
Hofman, A.
Ekart, R.
Frauchiger, B.
Castelnuovo, S.
Rosvall, M.
Zoccali, C.
Landecho, M.
Bae, J.
Gabriel, R.
Liu, J.
Baldassarre, D.
Kavousi, M.
author_sort Lorenz, M.
building Curtin Institutional Repository
collection Online Access
description Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
first_indexed 2025-11-14T10:04:17Z
format Journal Article
id curtin-20.500.11937-55818
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T10:04:17Z
publishDate 2017
publisher Biomed Central Ltd
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-558182017-10-19T01:02:00Z Automatic identification of variables in epidemiological datasets using logic regression Lorenz, M. Abdi, N. Scheckenbach, F. Pflug, A. Bülbül, A. Catapano, A. Agewall, S. Ezhov, M. Bots, M. Kiechl, S. Orth, A. Norata, Giuseppe Empana, J. Lin, H. McLachlan, S. Bokemark, L. Ronkainen, K. Amato, M. Schminke, U. Srinivasan, S. Lind, L. Kato, A. Dimitriadis, C. Przewlocki, T. Okazaki, S. Stehouwer, C. Lazarevic, T. Willeit, P. Yanez, D. Steinmetz, H. Sander, D. Poppert, H. Desvarieux, M. Ikram, M. Bevc, S. Staub, D. Sirtori, C. Iglseder, B. Engström, G. Tripepi, G. Beloqui, O. Lee, M. Friera, A. Xie, W. Grigore, L. Plichart, M. Su, T. Robertson, C. Schmidt, C. Tuomainen, T. Veglia, F. Völzke, H. Nijpels, G. Jovanovic, A. Willeit, J. Sacco, R. Franco, O. Hojs, R. Uthoff, H. Hedblad, B. Park, H. Suarez, C. Zhao, D. Catapano, A. Ducimetiere, P. Chien, K. Price, J. Bergström, G. Kauhanen, J. Tremoli, E. Dörr, M. Berenson, G. Papagianni, A. Kablak-Ziembicka, A. Kitagawa, K. Dekker, J. Stolic, R. Polak, J. Sitzer, M. Bickel, H. Rundek, T. Hofman, A. Ekart, R. Frauchiger, B. Castelnuovo, S. Rosvall, M. Zoccali, C. Landecho, M. Bae, J. Gabriel, R. Liu, J. Baldassarre, D. Kavousi, M. Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. 2017 Journal Article http://hdl.handle.net/20.500.11937/55818 10.1186/s12911-017-0429-1 http://creativecommons.org/licenses/by/4.0/ Biomed Central Ltd fulltext
spellingShingle Lorenz, M.
Abdi, N.
Scheckenbach, F.
Pflug, A.
Bülbül, A.
Catapano, A.
Agewall, S.
Ezhov, M.
Bots, M.
Kiechl, S.
Orth, A.
Norata, Giuseppe
Empana, J.
Lin, H.
McLachlan, S.
Bokemark, L.
Ronkainen, K.
Amato, M.
Schminke, U.
Srinivasan, S.
Lind, L.
Kato, A.
Dimitriadis, C.
Przewlocki, T.
Okazaki, S.
Stehouwer, C.
Lazarevic, T.
Willeit, P.
Yanez, D.
Steinmetz, H.
Sander, D.
Poppert, H.
Desvarieux, M.
Ikram, M.
Bevc, S.
Staub, D.
Sirtori, C.
Iglseder, B.
Engström, G.
Tripepi, G.
Beloqui, O.
Lee, M.
Friera, A.
Xie, W.
Grigore, L.
Plichart, M.
Su, T.
Robertson, C.
Schmidt, C.
Tuomainen, T.
Veglia, F.
Völzke, H.
Nijpels, G.
Jovanovic, A.
Willeit, J.
Sacco, R.
Franco, O.
Hojs, R.
Uthoff, H.
Hedblad, B.
Park, H.
Suarez, C.
Zhao, D.
Catapano, A.
Ducimetiere, P.
Chien, K.
Price, J.
Bergström, G.
Kauhanen, J.
Tremoli, E.
Dörr, M.
Berenson, G.
Papagianni, A.
Kablak-Ziembicka, A.
Kitagawa, K.
Dekker, J.
Stolic, R.
Polak, J.
Sitzer, M.
Bickel, H.
Rundek, T.
Hofman, A.
Ekart, R.
Frauchiger, B.
Castelnuovo, S.
Rosvall, M.
Zoccali, C.
Landecho, M.
Bae, J.
Gabriel, R.
Liu, J.
Baldassarre, D.
Kavousi, M.
Automatic identification of variables in epidemiological datasets using logic regression
title Automatic identification of variables in epidemiological datasets using logic regression
title_full Automatic identification of variables in epidemiological datasets using logic regression
title_fullStr Automatic identification of variables in epidemiological datasets using logic regression
title_full_unstemmed Automatic identification of variables in epidemiological datasets using logic regression
title_short Automatic identification of variables in epidemiological datasets using logic regression
title_sort automatic identification of variables in epidemiological datasets using logic regression
url http://hdl.handle.net/20.500.11937/55818