Efficient and Scalable Graph Similarity Joins in MapReduce

Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constrai...

Full description

Bibliographic Details
Main Authors:	Chen, Yifan, Zhao, Xiang, Xiao, Chuan, Zhang, Weiming, Tang, Jiuyang
Format:	Online
Language:	English
Published:	Hindawi Publishing Corporation 2014
Online Access:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121100/

id	pubmed-4121100
recordtype	oai_dc
spelling	pubmed-41211002014-08-12 Efficient and Scalable Graph Similarity Joins in MapReduce Chen, Yifan Zhao, Xiang Xiao, Chuan Zhang, Weiming Tang, Jiuyang Research Article Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results. Hindawi Publishing Corporation 2014 2014-07-08 /pmc/articles/PMC4121100/ /pubmed/25121135 http://dx.doi.org/10.1155/2014/749028 Text en Copyright © 2014 Yifan Chen et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
repository_type	Open Access Journal
institution_category	Foreign Institution
institution	US National Center for Biotechnology Information
building	NCBI PubMed
collection	Online Access
language	English
format	Online
author	Chen, Yifan Zhao, Xiang Xiao, Chuan Zhang, Weiming Tang, Jiuyang
spellingShingle	Chen, Yifan Zhao, Xiang Xiao, Chuan Zhang, Weiming Tang, Jiuyang Efficient and Scalable Graph Similarity Joins in MapReduce
author_facet	Chen, Yifan Zhao, Xiang Xiao, Chuan Zhang, Weiming Tang, Jiuyang
author_sort	Chen, Yifan
title	Efficient and Scalable Graph Similarity Joins in MapReduce
title_short	Efficient and Scalable Graph Similarity Joins in MapReduce
title_full	Efficient and Scalable Graph Similarity Joins in MapReduce
title_fullStr	Efficient and Scalable Graph Similarity Joins in MapReduce
title_full_unstemmed	Efficient and Scalable Graph Similarity Joins in MapReduce
title_sort	efficient and scalable graph similarity joins in mapreduce
description	Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out nonpromising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification, where a MapReduce-based method is proposed for GED calculation. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.
publisher	Hindawi Publishing Corporation
publishDate	2014
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121100/
_version_	1613120680524513280

Efficient and Scalable Graph Similarity Joins in MapReduce

Similar Items