Attribute-Based Semantic Type Detection and Data Quality Assessment
The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute...
| Main Authors: | , , |
|---|---|
| Format: | Conference Paper |
| Published: |
IEEE
2024
|
| Online Access: | http://hdl.handle.net/20.500.11937/97536 |
| _version_ | 1848766294209331200 |
|---|---|
| author | Silva, Marcelo Valentim Herrmann, Hannes Maxville, Valerie |
| author_facet | Silva, Marcelo Valentim Herrmann, Hannes Maxville, Valerie |
| author_sort | Silva, Marcelo Valentim |
| building | Curtin Institutional Repository |
| collection | Online Access |
| description | The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value). |
| first_indexed | 2025-11-14T11:48:51Z |
| format | Conference Paper |
| id | curtin-20.500.11937-97536 |
| institution | Curtin University Malaysia |
| institution_category | Local University |
| last_indexed | 2025-11-14T11:48:51Z |
| publishDate | 2024 |
| publisher | IEEE |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | curtin-20.500.11937-975362025-05-09T06:34:25Z Attribute-Based Semantic Type Detection and Data Quality Assessment Silva, Marcelo Valentim Herrmann, Hannes Maxville, Valerie The increasing reliance on data-driven decision-making highlights the critical need for high-quality data. Despite advancements, data quality issues continue to impact both business strategies and scientific research. Current methods often fail to utilize the semantic richness embedded in attribute labels (or column names/headers in tables), leading to a crucial gap in comprehensive data quality evaluation.This research addresses this gap by introducing an innovative methodology focused on Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information in attribute labels, combined with rule-based analysis and comprehensive dictionaries, our approach effectively addresses four key Big Data challenges: variety, veracity, volume and value.Our method provides a practical classification system of 23 semantic types, including numerical non-negative, categorical, ID, names, strings, geographical, temporal, and complex formats like URLs, IP addresses, email, and binary values plus several numerical bounded types, such as age and percentage (variety). The approach was validated across fifty diverse datasets from the UCI Machine Learning Repository, covering multiple domains, further highlighting its adaptability (variety). We also compared our types with the ones from Sherlock, a renowned method for Semantic Type Detection.Our evaluation showcases our method's proficiency in identifying data quality issues, detecting 81 missing values out of 922 attributes, compared to only one detected by YData Profiling (veracity). One dataset, containing over 2 million records, was processed efficiently, demonstrating the scalability of our approach (volume). These results underscore the enhanced capabilities of our method in streamlining data cleaning processes, ultimately improving the efficiency and effectiveness of data-driven decision-making across various domains (value). 2024 Conference Paper http://hdl.handle.net/20.500.11937/97536 10.1109/BDCAT63179.2024.00030 IEEE fulltext |
| spellingShingle | Silva, Marcelo Valentim Herrmann, Hannes Maxville, Valerie Attribute-Based Semantic Type Detection and Data Quality Assessment |
| title | Attribute-Based Semantic Type Detection and Data Quality Assessment |
| title_full | Attribute-Based Semantic Type Detection and Data Quality Assessment |
| title_fullStr | Attribute-Based Semantic Type Detection and Data Quality Assessment |
| title_full_unstemmed | Attribute-Based Semantic Type Detection and Data Quality Assessment |
| title_short | Attribute-Based Semantic Type Detection and Data Quality Assessment |
| title_sort | attribute-based semantic type detection and data quality assessment |
| url | http://hdl.handle.net/20.500.11937/97536 |