3 research outputs found
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning
The development of cluster computing frameworks has allowed practitioners to
scale out various statistical estimation and machine learning algorithms with
minimal programming effort. This is especially true for machine learning
problems whose objective function is nicely separable across individual data
points, such as classification and regression. In contrast, statistical
learning tasks involving pairs (or more generally tuples) of data points - such
as metric learning, clustering or ranking do not lend themselves as easily to
data-parallelism and in-memory computing. In this paper, we investigate how to
balance between statistical performance and computational efficiency in such
distributed tuplewise statistical problems. We first propose a simple strategy
based on occasionally repartitioning data across workers between parallel
computation stages, where the number of repartitioning steps rules the
trade-off between accuracy and runtime. We then present some theoretical
results highlighting the benefits brought by the proposed method in terms of
variance reduction, and extend our results to design distributed stochastic
gradient descent algorithms for tuplewise empirical risk minimization. Our
results are supported by numerical experiments in pairwise statistical
estimation and learning on synthetic and real-world datasets.Comment: 23 pages, 6 figures, ECML 201
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning
International audienceThe development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points-such as metric learning, clustering or ranking-do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartition-ing steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning
International audienceThe development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involving pairs (or more generally tuples) of data points-such as metric learning, clustering or ranking-do not lend themselves as easily to data-parallelism and in-memory computing. In this paper, we investigate how to balance between statistical performance and computational efficiency in such distributed tuplewise statistical problems. We first propose a simple strategy based on occasionally repartitioning data across workers between parallel computation stages, where the number of repartition-ing steps rules the trade-off between accuracy and runtime. We then present some theoretical results highlighting the benefits brought by the proposed method in terms of variance reduction, and extend our results to design distributed stochastic gradient descent algorithms for tuplewise empirical risk minimization. Our results are supported by numerical experiments in pairwise statistical estimation and learning on synthetic and real-world datasets