1 research outputs found
Large Scale Distributed Distance Metric Learning
In large scale machine learning and data mining problems with high feature
dimensionality, the Euclidean distance between data points can be
uninformative, and Distance Metric Learning (DML) is often desired to learn a
proper similarity measure (using side information such as example data pairs
being similar or dissimilar). However, high dimensionality and large volume of
pairwise constraints in modern big data can lead to prohibitive computational
cost for both the original DML formulation in Xing et al. (2002) and later
extensions. In this paper, we present a distributed algorithm for DML, and a
large-scale implementation on a parameter server architecture. Our approach
builds on a parallelizable reformulation of Xing et al. (2002), and an
asynchronous stochastic gradient descent optimization procedure. To our
knowledge, this is the first distributed solution to DML, and we show that, on
a system with 256 CPU cores, our program is able to complete a DML task on a
dataset with 1 million data points, 22-thousand features, and 200 million
labeled data pairs, in 15 hours; and the learned metric shows great
effectiveness in properly measuring distances