1 research outputs found

    Distributed Hierarchical Co-clustering and Collaborative Filtering Algorithm

    No full text
    Abstract—Petascale Analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at extremely high rates to generate new scientific insights along with positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. Real-time CF and co-clustering on highly sparse massive datasets, while achieving a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than1minute.) distributed co-clustering based collaborative filtering algorithm. Our distributed algorithm has been optimized for multi-core cluster architectures. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Using the Netflix dataset (900M training ratings with replication) as well as the Yahoo KDD Cup 1 (4.6B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a 4096node multi-core cluster architecture. Our distributed algorithm (implemented using OpenMP with MPI) demonstrates around 4 × better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy (26 ± 4 RMSE for Yahoo KDD Cup data and 0.87±0.02 for Netflix data). To the best of our knowledge, these are the best known performance results for collaborative filtering, at high prediction accuracy, for multi-core cluster architectures. I
    corecore