A tree-based measure for hierarchical data in mixed databases

The structure of the data in a mixed database can be a barrier when clustering that database into meaningful groups. A hierarchically structured database necessitates efficient distance measures and clustering algorithms to locate similarities between data objects. Therefore, existing literature pro...

Full description

Bibliographic Details
Main Author: Hassan, Diman
Format: Thesis (University of Nottingham only)
Language:English
Published: 2016
Subjects:
Online Access:https://eprints.nottingham.ac.uk/34652/
_version_ 1848794903168942080
author Hassan, Diman
author_facet Hassan, Diman
author_sort Hassan, Diman
building Nottingham Research Data Repository
collection Online Access
description The structure of the data in a mixed database can be a barrier when clustering that database into meaningful groups. A hierarchically structured database necessitates efficient distance measures and clustering algorithms to locate similarities between data objects. Therefore, existing literature proposes hierarchical distance measures to measure the similarities between the records in hierarchical databases. The main contribution of this research is to create and test a new distance measure for large hierarchical databases consisting of mixed data types and attributes, based on an existing tree-based (hierarchical) distance metric, the pq-gram distance metric. Several aims and objectives were pursued to fill a number of gaps in the current body of knowledge. One of these goals was to verify the validity of the pq-gram distance metric when applied to different data sets, and to compare and combine it with a number of different distance measures to demonstrate its usefulness across large mixed databases. To achieve this, further work focused on exploring how to exploit the existing method as a measure of hierarchical data attributes in mixed data sets, and to ascertain whether the new method would produce better results with large mixed databases. For evaluation purposes, the pq-gram metric was applied to The Health Improvement Network (THIN) database to determine if it could identify similarities between the records in the database. After this, it was applied to mixed data to examine different distance measures, which include non-hierarchical and other hierarchical measures, and to combine them to create a Combined Distance Function (CDF). The CDF improved the results when applied to different data sets, such as the hierarchical National Bureau of Economic Research of United States (NBER US) Patent data set and the mixed (THIN) data set. The CDF was then modified to create a New-CDF, which used only the hierarchical pq-gram metric to measure the hierarchical attributes in the mixed data set. The New-CDF worked well, finding the most similar data records when applied to the THIN data set, and grouping them in one cluster using the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm. The quality of the clusters was explored using two internal validation indices, Silhouette and C-Index, where the values showed good compactness and quality of the clusters obtained using the new method.
first_indexed 2025-11-14T19:23:35Z
format Thesis (University of Nottingham only)
id nottingham-34652
institution University of Nottingham Malaysia Campus
institution_category Local University
language English
last_indexed 2025-11-14T19:23:35Z
publishDate 2016
recordtype eprints
repository_type Digital Repository
spelling nottingham-346522025-02-28T13:31:02Z https://eprints.nottingham.ac.uk/34652/ A tree-based measure for hierarchical data in mixed databases Hassan, Diman The structure of the data in a mixed database can be a barrier when clustering that database into meaningful groups. A hierarchically structured database necessitates efficient distance measures and clustering algorithms to locate similarities between data objects. Therefore, existing literature proposes hierarchical distance measures to measure the similarities between the records in hierarchical databases. The main contribution of this research is to create and test a new distance measure for large hierarchical databases consisting of mixed data types and attributes, based on an existing tree-based (hierarchical) distance metric, the pq-gram distance metric. Several aims and objectives were pursued to fill a number of gaps in the current body of knowledge. One of these goals was to verify the validity of the pq-gram distance metric when applied to different data sets, and to compare and combine it with a number of different distance measures to demonstrate its usefulness across large mixed databases. To achieve this, further work focused on exploring how to exploit the existing method as a measure of hierarchical data attributes in mixed data sets, and to ascertain whether the new method would produce better results with large mixed databases. For evaluation purposes, the pq-gram metric was applied to The Health Improvement Network (THIN) database to determine if it could identify similarities between the records in the database. After this, it was applied to mixed data to examine different distance measures, which include non-hierarchical and other hierarchical measures, and to combine them to create a Combined Distance Function (CDF). The CDF improved the results when applied to different data sets, such as the hierarchical National Bureau of Economic Research of United States (NBER US) Patent data set and the mixed (THIN) data set. The CDF was then modified to create a New-CDF, which used only the hierarchical pq-gram metric to measure the hierarchical attributes in the mixed data set. The New-CDF worked well, finding the most similar data records when applied to the THIN data set, and grouping them in one cluster using the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) clustering algorithm. The quality of the clusters was explored using two internal validation indices, Silhouette and C-Index, where the values showed good compactness and quality of the clusters obtained using the new method. 2016-10-15 Thesis (University of Nottingham only) NonPeerReviewed application/pdf en arr https://eprints.nottingham.ac.uk/34652/1/PhD%20Thesis-Diman%20Hassan.pdf Hassan, Diman (2016) A tree-based measure for hierarchical data in mixed databases. PhD thesis, University of Nottingham. databases database tree-based data
spellingShingle databases
database
tree-based
data
Hassan, Diman
A tree-based measure for hierarchical data in mixed databases
title A tree-based measure for hierarchical data in mixed databases
title_full A tree-based measure for hierarchical data in mixed databases
title_fullStr A tree-based measure for hierarchical data in mixed databases
title_full_unstemmed A tree-based measure for hierarchical data in mixed databases
title_short A tree-based measure for hierarchical data in mixed databases
title_sort tree-based measure for hierarchical data in mixed databases
topic databases
database
tree-based
data
url https://eprints.nottingham.ac.uk/34652/