Attribute-Based Semantic Type Detection and Data Quality Assessment

The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute...

Full description

Bibliographic Details
Main Authors: Silva, Marcelo Valentim, Herrmann, Hannes, Maxville, Valerie
Format: Conference Paper
Published: IEEE 2024
Online Access:http://hdl.handle.net/20.500.11937/97536
_version_ 1848766294209331200
author Silva, Marcelo Valentim
Herrmann, Hannes
Maxville, Valerie
author_facet Silva, Marcelo Valentim
Herrmann, Hannes
Maxville, Valerie
author_sort Silva, Marcelo Valentim
building Curtin Institutional Repository
collection Online Access
description The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value).
first_indexed 2025-11-14T11:48:51Z
format Conference Paper
id curtin-20.500.11937-97536
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T11:48:51Z
publishDate 2024
publisher IEEE
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-975362025-05-09T06:34:25Z Attribute-Based Semantic Type Detection and Data Quality Assessment Silva, Marcelo Valentim Herrmann, Hannes Maxville, Valerie The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value). 2024 Conference Paper http://hdl.handle.net/20.500.11937/97536 10.1109/BDCAT63179.2024.00030 IEEE fulltext
spellingShingle Silva, Marcelo Valentim
Herrmann, Hannes
Maxville, Valerie
Attribute-Based Semantic Type Detection and Data Quality Assessment
title Attribute-Based Semantic Type Detection and Data Quality Assessment
title_full Attribute-Based Semantic Type Detection and Data Quality Assessment
title_fullStr Attribute-Based Semantic Type Detection and Data Quality Assessment
title_full_unstemmed Attribute-Based Semantic Type Detection and Data Quality Assessment
title_short Attribute-Based Semantic Type Detection and Data Quality Assessment
title_sort attribute-based semantic type detection and data quality assessment
url http://hdl.handle.net/20.500.11937/97536