Search CORE

1 research outputs found

How improve Set Similarity Join based on prefix approach in distributed environment

Author: Domenico Beneventano
Giovanni Simonini
Luca Gagliardelli
Song Zhu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al. [3] proposed a MapReduce implementation of the so-called PPJoin algorithm [2], which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm [4]. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in [3]

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia