The effect of data cleaning on record linkage quality

Background: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has bee...

Full description

Bibliographic Details
Main Authors: Randall, Sean, Ferrante, Anna, Boyd, James, Semmens, James
Format: Journal Article
Published: Biomed Central Ltd 2013
Subjects:
Online Access:http://www.biomedcentral.com/1472-6947/13/64
http://hdl.handle.net/20.500.11937/17174
_version_ 1848749390630486016
author Randall, Sean
Ferrante, Anna
Boyd, James
Semmens, James
author_facet Randall, Sean
Ferrante, Anna
Boyd, James
Semmens, James
author_sort Randall, Sean
building Curtin Institutional Repository
collection Online Access
description Background: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.Methods: A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.Results: Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.Conclusions: Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process.
first_indexed 2025-11-14T07:20:11Z
format Journal Article
id curtin-20.500.11937-17174
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T07:20:11Z
publishDate 2013
publisher Biomed Central Ltd
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-171742017-05-30T08:12:17Z The effect of data cleaning on record linkage quality Randall, Sean Ferrante, Anna Boyd, James Semmens, James Data cleaning Medical record linkage Data quality Background: Within the field of record linkage, numerous data cleaning and standardisation techniques are employed to ensure the highest quality of links. While these facilities are common in record linkage software packages and are regularly deployed across record linkage units, little work has been published demonstrating the impact of data cleaning on linkage quality.Methods: A range of cleaning techniques was applied to both a synthetically generated dataset and a large administrative dataset previously linked to a high standard. The effect of these changes on linkage quality was investigated using pairwise F-measure to determine quality.Results: Data cleaning made little difference to the overall linkage quality, with heavy cleaning leading to a decrease in quality. Further examination showed that decreases in linkage quality were due to cleaning techniques typically reducing the variability – although correct records were now more likely to match, incorrect records were also more likely to match, and these incorrect matches outweighed the correct matches, reducing quality overall.Conclusions: Data cleaning techniques have minimal effect on linkage quality. Care should be taken during the data cleaning process. 2013 Journal Article http://hdl.handle.net/20.500.11937/17174 http://www.biomedcentral.com/1472-6947/13/64 Biomed Central Ltd fulltext
spellingShingle Data cleaning
Medical record linkage
Data quality
Randall, Sean
Ferrante, Anna
Boyd, James
Semmens, James
The effect of data cleaning on record linkage quality
title The effect of data cleaning on record linkage quality
title_full The effect of data cleaning on record linkage quality
title_fullStr The effect of data cleaning on record linkage quality
title_full_unstemmed The effect of data cleaning on record linkage quality
title_short The effect of data cleaning on record linkage quality
title_sort effect of data cleaning on record linkage quality
topic Data cleaning
Medical record linkage
Data quality
url http://www.biomedcentral.com/1472-6947/13/64
http://hdl.handle.net/20.500.11937/17174