Using the matrix factorization technique in machine learning is very common
mainly in areas like recommender systems. Despite its high prediction accuracy
and its ability to avoid over-fitting of the data, the Bayesian Probabilistic
Matrix Factorization algorithm (BPMF) has not been widely used on large scale
data because of the prohibitive cost. In this paper, we propose a distributed
high-performance parallel implementation of the BPMF using Gibbs sampling on
shared and distributed architectures. We show by using efficient load balancing
using work stealing on a single node, and by using asynchronous communication
in the distributed version we beat state of the art implementations.Comment: arXiv admin note: substantial text overlap with arXiv:1705.0415