Location of Repository

Scaling Out All Pairs Similarity Search with MapReduce Regular Paper

By Gianmarco De Francisci, Claudio Lucchese and Ranieri Baraglia

Abstract

Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the com-munication costs and the load imbalance. The resulting al-gorithm gives exact results up to 5x faster than the current best known solution that employs MapReduce. 1

Year: 2015
OAI identifier: oai:CiteSeerX.psu:10.1.1.667.3472
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://ceur-ws.org/Vol-630/lsd... (external link)
  • http://ceur-ws.org/Vol-630/lsd... (external link)
  • http://citeseerx.ist.psu.edu/v... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.