XML document clustering using structure-preserving flat representation of XML content and structure

With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures...

Full description

Bibliographic Details
Main Authors: Hadzic, Fedja, Hecker, Michael, Tagerelli, A.
Other Authors: Deyi Li
Format: Conference Paper
Published: Springer 2011
Online Access:http://hdl.handle.net/20.500.11937/4997
_version_ 1848744671849742336
author Hadzic, Fedja
Hecker, Michael
Tagerelli, A.
author2 Deyi Li
author_facet Deyi Li
Hadzic, Fedja
Hecker, Michael
Tagerelli, A.
author_sort Hadzic, Fedja
building Curtin Institutional Repository
collection Online Access
description With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized clustering techniques are being developed to account for this difficulty, however most of them still assume that XML documents are represented using a semistructured data model. In this paper we take a simpler approach whereby XML structural aspects are extracted from the documents to generate a flat data format to which well-established clustering methods can be directly applied. Hence, the expensive process of tree/graph data mining is avoided, while the structural properties are still preserved. Our experimental evaluation using a number of real world datasets and comparing with existing structural clustering methods, has demonstrated the significance of our approach.
first_indexed 2025-11-14T06:05:11Z
format Conference Paper
id curtin-20.500.11937-4997
institution Curtin University Malaysia
institution_category Local University
last_indexed 2025-11-14T06:05:11Z
publishDate 2011
publisher Springer
recordtype eprints
repository_type Digital Repository
spelling curtin-20.500.11937-49972023-01-18T08:46:46Z XML document clustering using structure-preserving flat representation of XML content and structure Hadzic, Fedja Hecker, Michael Tagerelli, A. Deyi Li Bing Liu Charu C Aggarwal With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized clustering techniques are being developed to account for this difficulty, however most of them still assume that XML documents are represented using a semistructured data model. In this paper we take a simpler approach whereby XML structural aspects are extracted from the documents to generate a flat data format to which well-established clustering methods can be directly applied. Hence, the expensive process of tree/graph data mining is avoided, while the structural properties are still preserved. Our experimental evaluation using a number of real world datasets and comparing with existing structural clustering methods, has demonstrated the significance of our approach. 2011 Conference Paper http://hdl.handle.net/20.500.11937/4997 Springer restricted
spellingShingle Hadzic, Fedja
Hecker, Michael
Tagerelli, A.
XML document clustering using structure-preserving flat representation of XML content and structure
title XML document clustering using structure-preserving flat representation of XML content and structure
title_full XML document clustering using structure-preserving flat representation of XML content and structure
title_fullStr XML document clustering using structure-preserving flat representation of XML content and structure
title_full_unstemmed XML document clustering using structure-preserving flat representation of XML content and structure
title_short XML document clustering using structure-preserving flat representation of XML content and structure
title_sort xml document clustering using structure-preserving flat representation of xml content and structure
url http://hdl.handle.net/20.500.11937/4997