Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-the-art pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the com-munication costs and the load imbalance. The resulting al-gorithm gives exact results up to 5x faster than the current best known solution that employs MapReduce. 1
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.