517 research outputs found
Maximum gradient embeddings and monotone clustering
Let (X,d_X) be an n-point metric space. We show that there exists a
distribution D over non-contractive embeddings into trees f:X-->T such that for
every x in X, the expectation with respect to D of the maximum over y in X of
the ratio d_T(f(x),f(y)) / d_X(x,y) is at most C (log n)^2, where C is a
universal constant. Conversely we show that the above quadratic dependence on
log n cannot be improved in general. Such embeddings, which we call maximum
gradient embeddings, yield a framework for the design of approximation
algorithms for a wide range of clustering problems with monotone costs,
including fault-tolerant versions of k-median and facility location.Comment: 25 pages, 2 figures. Final version, minor revision of the previous
one. To appear in "Combinatorica
Sliced Wasserstein Distance for Learning Gaussian Mixture Models
Gaussian mixture models (GMM) are powerful parametric tools with many
applications in machine learning and computer vision. Expectation maximization
(EM) is the most popular algorithm for estimating the GMM parameters. However,
EM guarantees only convergence to a stationary point of the log-likelihood
function, which could be arbitrarily worse than the optimal solution. Inspired
by the relationship between the negative log-likelihood function and the
Kullback-Leibler (KL) divergence, we propose an alternative formulation for
estimating the GMM parameters using the sliced Wasserstein distance, which
gives rise to a new algorithm. Specifically, we propose minimizing the
sliced-Wasserstein distance between the mixture model and the data distribution
with respect to the GMM parameters. In contrast to the KL-divergence, the
energy landscape for the sliced-Wasserstein distance is more well-behaved and
therefore more suitable for a stochastic gradient descent scheme to obtain the
optimal GMM parameters. We show that our formulation results in parameter
estimates that are more robust to random initializations and demonstrate that
it can estimate high-dimensional data distributions more faithfully than the EM
algorithm
A Submodular Optimization Framework for Imbalanced Text Classification with Data Augmentation
In the domain of text classification, imbalanced datasets are a common occurrence. The skewed distribution of the labels of these datasets poses a great challenge to the performance of text classifiers. One popular way to mitigate this challenge is to augment underwhelmingly represented labels with synthesized items. The synthesized items are generated by data augmentation methods that can typically generate an unbounded number of items. To select the synthesized items that maximize the performance of text classifiers, we introduce a novel method that selects items that jointly maximize the likelihood of the items belonging to their respective labels and the diversity of the selected items. Our proposed method formulates the joint maximization as a monotone submodular objective function, whose solution can be approximated by a tractable and efficient greedy algorithm. We evaluated our method on multiple real-world datasets with different data augmentation techniques and text classifiers, and compared results with several baselines. The experimental results demonstrate the effectiveness and efficiency of our method
- …