TY - GEN
T1 - XML structural similarity search using MapReduce
AU - Yuan, Peisen
AU - Sha, Chaofeng
AU - Wang, Xiaoling
AU - Yang, Bin
AU - Zhou, Aoying
AU - Yang, Su
PY - 2010
Y1 - 2010
N2 - XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.
AB - XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.
UR - https://www.scopus.com/pages/publications/77955016500
U2 - 10.1007/978-3-642-14246-8_19
DO - 10.1007/978-3-642-14246-8_19
M3 - 会议稿件
AN - SCOPUS:77955016500
SN - 3642142451
SN - 9783642142451
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 169
EP - 181
BT - Web-Age Information Management - 11th International Conference, WAIM 2010, Proceedings
T2 - 11th International Conference on Web-Age Information Management, WAIM 2010
Y2 - 15 July 2010 through 17 July 2010
ER -