A simple sampling method for estimating the accuracy of large scale record linkage projects

Background: Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality t...

Full description

Bibliographic Details
Main Authors:	Boyd, James, Guiver, T., Randall, Sean, Ferrante, Anna, Semmens, James, Anderson, P., Dickinson, T.
Format:	Journal Article
Published:	Schattauer Publishers 2016
Online Access:	http://hdl.handle.net/20.500.11937/26908

_version_	1848752118352052224
author	Boyd, James Guiver, T. Randall, Sean Ferrante, Anna Semmens, James Anderson, P. Dickinson, T.
author_facet	Boyd, James Guiver, T. Randall, Sean Ferrante, Anna Semmens, James Anderson, P. Dickinson, T.
author_sort	Boyd, James
building	Curtin Institutional Repository
collection	Online Access
description	Background: Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. Objectives: The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. Methods: In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. Results: The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). Conclusions: This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions.
first_indexed	2025-11-14T08:03:32Z
format	Journal Article
id	curtin-20.500.11937-26908
institution	Curtin University Malaysia
institution_category	Local University
last_indexed	2025-11-14T08:03:32Z
publishDate	2016
publisher	Schattauer Publishers
recordtype	eprints
repository_type	Digital Repository
spelling	curtin-20.500.11937-269082019-02-19T05:35:40Z A simple sampling method for estimating the accuracy of large scale record linkage projects Boyd, James Guiver, T. Randall, Sean Ferrante, Anna Semmens, James Anderson, P. Dickinson, T. Background: Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. Objectives: The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. Methods: In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. Results: The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). Conclusions: This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions. 2016 Journal Article http://hdl.handle.net/20.500.11937/26908 10.3414/ME15-01-0152 Schattauer Publishers fulltext
spellingShingle	Boyd, James Guiver, T. Randall, Sean Ferrante, Anna Semmens, James Anderson, P. Dickinson, T. A simple sampling method for estimating the accuracy of large scale record linkage projects
title	A simple sampling method for estimating the accuracy of large scale record linkage projects
title_full	A simple sampling method for estimating the accuracy of large scale record linkage projects
title_fullStr	A simple sampling method for estimating the accuracy of large scale record linkage projects
title_full_unstemmed	A simple sampling method for estimating the accuracy of large scale record linkage projects
title_short	A simple sampling method for estimating the accuracy of large scale record linkage projects
title_sort	simple sampling method for estimating the accuracy of large scale record linkage projects
url	http://hdl.handle.net/20.500.11937/26908

A simple sampling method for estimating the accuracy of large scale record linkage projects

Similar Items