Sociodemographic differences in linkage error: An examination of four large-scale datasets

© 2018 The Author(s). Background: Record linkage is an important tool for epidemiologists and health planners. Record linkage studies will generally contain some level of residual record linkage error, where individual records are either incorrectly marked as belonging to the same individual, or inc...

Full description

Bibliographic Details
Main Authors: Randall, Sean, Brown, Adrian, Boyd, James, Schnell, R., Borgs, C., Ferrante, Anna
Format: Journal Article
Published: BioMed Central 2018
Online Access:http://hdl.handle.net/20.500.11937/72626
_version_ 1848762800530259968
author Randall, Sean
Brown, Adrian
Boyd, James
Schnell, R.
Borgs, C.
Ferrante, Anna
author_facet Randall, Sean
Brown, Adrian
Boyd, James
Schnell, R.
Borgs, C.
Ferrante, Anna
author_sort Randall, Sean
building Curtin Institutional Repository
collection Online Access
description © 2018 The Author(s). Background: Record linkage is an important tool for epidemiologists and health planners. Record linkage studies will generally contain some level of residual record linkage error, where individual records are either incorrectly marked as belonging to the same individual, or incorrectly marked as belonging to separate individuals. A key question is whether errors in linkage quality are distributed evenly throughout the population, or whether certain subgroups will exhibit higher rates of error. Previous investigations of this issue have typically compared linked and un-linked records, which can conflate bias caused by record linkage error, with bias caused by missing records (data capture errors). Methods: Four large administrative datasets were individually de-duplicated, with results compared to an available 'gold-standard' benchmark, allowing us to avoid methodological issues with comparing linked and un-linked records. Results were compared by gender, age, geographic remoteness (major cities, regional or remote) and socioeconomic status. Results: Results varied between datasets, and by sociodemographic characteristic. The most consistent findings were worse linkage quality for younger individuals (seen in all four datasets) and worse linkage quality for those living in remote areas (seen in three of four datasets). The linkage quality within sociodemographic categories varied between datasets, with the associations with linkage error reversed across different datasets due to quirks of the specific data collection mechanisms and data sharing practices. Conclusions: These results suggest caution should be taken both when linking younger individuals and those in remote areas, and when analysing linked data from these subgroups. Further research is required to determine the ramifications of worse linkage quality in these subpopulations on research outcomes.
first_indexed 2025-11-14T10:53:19Z
format Journal Article
id curtin-20.500.11937-72626
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T10:53:19Z
publishDate 2018
publisher BioMed Central
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-726262021-01-05T08:07:07Z Sociodemographic differences in linkage error: An examination of four large-scale datasets Randall, Sean Brown, Adrian Boyd, James Schnell, R. Borgs, C. Ferrante, Anna © 2018 The Author(s). Background: Record linkage is an important tool for epidemiologists and health planners. Record linkage studies will generally contain some level of residual record linkage error, where individual records are either incorrectly marked as belonging to the same individual, or incorrectly marked as belonging to separate individuals. A key question is whether errors in linkage quality are distributed evenly throughout the population, or whether certain subgroups will exhibit higher rates of error. Previous investigations of this issue have typically compared linked and un-linked records, which can conflate bias caused by record linkage error, with bias caused by missing records (data capture errors). Methods: Four large administrative datasets were individually de-duplicated, with results compared to an available 'gold-standard' benchmark, allowing us to avoid methodological issues with comparing linked and un-linked records. Results were compared by gender, age, geographic remoteness (major cities, regional or remote) and socioeconomic status. Results: Results varied between datasets, and by sociodemographic characteristic. The most consistent findings were worse linkage quality for younger individuals (seen in all four datasets) and worse linkage quality for those living in remote areas (seen in three of four datasets). The linkage quality within sociodemographic categories varied between datasets, with the associations with linkage error reversed across different datasets due to quirks of the specific data collection mechanisms and data sharing practices. Conclusions: These results suggest caution should be taken both when linking younger individuals and those in remote areas, and when analysing linked data from these subgroups. Further research is required to determine the ramifications of worse linkage quality in these subpopulations on research outcomes. 2018 Journal Article http://hdl.handle.net/20.500.11937/72626 10.1186/s12913-018-3495-x http://creativecommons.org/licenses/by/4.0/ BioMed Central fulltext
spellingShingle Randall, Sean
Brown, Adrian
Boyd, James
Schnell, R.
Borgs, C.
Ferrante, Anna
Sociodemographic differences in linkage error: An examination of four large-scale datasets
title Sociodemographic differences in linkage error: An examination of four large-scale datasets
title_full Sociodemographic differences in linkage error: An examination of four large-scale datasets
title_fullStr Sociodemographic differences in linkage error: An examination of four large-scale datasets
title_full_unstemmed Sociodemographic differences in linkage error: An examination of four large-scale datasets
title_short Sociodemographic differences in linkage error: An examination of four large-scale datasets
title_sort sociodemographic differences in linkage error: an examination of four large-scale datasets
url http://hdl.handle.net/20.500.11937/72626