Comparison of distance metrics for hierarchical data in medical databases

Distance metrics are broadly used in different research areas and applications, such as bio-informatics, data mining and many other fields. However, there are some metrics, like pg-gram and Edit Distance used specifically for data with a hierarchical structure. Other metrics used for non-hierarchica...

Full description

Bibliographic Details
Main Authors:	Hassan, Diman, Aickelin, Uwe, Wagner, Christian
Format:	Conference or Workshop Item
Published:	2014
Subjects:	Biomedical Informatics Data Mining Machine Learning
Online Access:	https://eprints.nottingham.ac.uk/3349/

_version_	1848791006742315008
author	Hassan, Diman Aickelin, Uwe Wagner, Christian
author_facet	Hassan, Diman Aickelin, Uwe Wagner, Christian
author_sort	Hassan, Diman
building	Nottingham Research Data Repository
collection	Online Access
description	Distance metrics are broadly used in different research areas and applications, such as bio-informatics, data mining and many other fields. However, there are some metrics, like pg-gram and Edit Distance used specifically for data with a hierarchical structure. Other metrics used for non-hierarchical data are the geometric and Hamming metrics. We have applied these metrics to The Health Improvement Network (THIN) database which has some hierarchical data. The THIN data has to be converted into a tree-like structure for the first group of metrics. For the second group of metrics, the data are converted into a frequency table or matrix, then for all metrics, all distances are found and normalised. Based on this particular data set, our research question: which of these metrics is useful for THIN data?. This paper compares the metrics, particularly the pogram metric on finding the similarities of patients' data. It also investigates the similar patients who have the same close distances as well as the metrics suitability for clustering the whole patient population. Our results show that the two groups of metrics perform differently as they represent different structures of the data. Nevertheless, all the metrics could represent some similar data of patients as well as discriminate sufficiently well in clustering the patient population using k-means clustering algorithm.
first_indexed	2025-11-14T18:21:39Z
format	Conference or Workshop Item
id	nottingham-3349
institution	University of Nottingham Malaysia Campus
institution_category	Local University
last_indexed	2025-11-14T18:21:39Z
publishDate	2014
recordtype	eprints
repository_type	Digital Repository
spelling	nottingham-33492020-05-04T16:54:35Z https://eprints.nottingham.ac.uk/3349/ Comparison of distance metrics for hierarchical data in medical databases Hassan, Diman Aickelin, Uwe Wagner, Christian Distance metrics are broadly used in different research areas and applications, such as bio-informatics, data mining and many other fields. However, there are some metrics, like pg-gram and Edit Distance used specifically for data with a hierarchical structure. Other metrics used for non-hierarchical data are the geometric and Hamming metrics. We have applied these metrics to The Health Improvement Network (THIN) database which has some hierarchical data. The THIN data has to be converted into a tree-like structure for the first group of metrics. For the second group of metrics, the data are converted into a frequency table or matrix, then for all metrics, all distances are found and normalised. Based on this particular data set, our research question: which of these metrics is useful for THIN data?. This paper compares the metrics, particularly the pogram metric on finding the similarities of patients' data. It also investigates the similar patients who have the same close distances as well as the metrics suitability for clustering the whole patient population. Our results show that the two groups of metrics perform differently as they represent different structures of the data. Nevertheless, all the metrics could represent some similar data of patients as well as discriminate sufficiently well in clustering the patient population using k-means clustering algorithm. 2014-09-04 Conference or Workshop Item PeerReviewed Hassan, Diman, Aickelin, Uwe and Wagner, Christian (2014) Comparison of distance metrics for hierarchical data in medical databases. In: Proceedings of the 2014 World Congress on Computational Intelligence (WCCI 2014), 6-11 July 2014, Beijing, China. Biomedical Informatics Data Mining Machine Learning http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6889554
spellingShingle	Biomedical Informatics Data Mining Machine Learning Hassan, Diman Aickelin, Uwe Wagner, Christian Comparison of distance metrics for hierarchical data in medical databases
title	Comparison of distance metrics for hierarchical data in medical databases
title_full	Comparison of distance metrics for hierarchical data in medical databases
title_fullStr	Comparison of distance metrics for hierarchical data in medical databases
title_full_unstemmed	Comparison of distance metrics for hierarchical data in medical databases
title_short	Comparison of distance metrics for hierarchical data in medical databases
title_sort	comparison of distance metrics for hierarchical data in medical databases
topic	Biomedical Informatics Data Mining Machine Learning
url	https://eprints.nottingham.ac.uk/3349/ https://eprints.nottingham.ac.uk/3349/

Comparison of distance metrics for hierarchical data in medical databases

Similar Items