9 research outputs found
Robust Correlation Clustering
In this paper, we introduce and study the Robust-Correlation-Clustering problem: given a graph G = (V,E) where every edge is either labeled + or - (denoting similar or dissimilar pairs of vertices), and a parameter m, the goal is to delete a set D of m vertices, and partition the remaining vertices V D into clusters to minimize the cost of the clustering, which is the sum of the number of + edges with end-points in different clusters and the number of - edges with end-points in the same cluster. This generalizes the classical Correlation-Clustering problem which is the special case when m = 0. Correlation clustering is useful when we have (only) qualitative information about the similarity or dissimilarity of pairs of points, and Robust-Correlation-Clustering equips this model with the capability to handle noise in datasets.
In this work, we present a constant-factor bi-criteria algorithm for Robust-Correlation-Clustering on complete graphs (where our solution is O(1)-approximate w.r.t the cost while however discarding O(1) m points as outliers), and also complement this by showing that no finite approximation is possible if we do not violate the outlier budget. Our algorithm is very simple in that it first does a simple LP-based pre-processing to delete O(m) vertices, and subsequently runs a particular Correlation-Clustering algorithm ACNAlg [Ailon et al., 2005] on the residual instance. We then consider general graphs, and show (O(log n), O(log^2 n)) bi-criteria algorithms while also showing a hardness of alpha_MC on both the cost and the outlier violation, where alpha_MC is the lower bound for the Minimum-Multicut problem
Missing Mass of Rank-2 Markov Chains
Estimation of missing mass with the popular Good-Turing (GT) estimator is
well-understood in the case where samples are independent and identically
distributed (iid). In this article, we consider the same problem when the
samples come from a stationary Markov chain with a rank-2 transition matrix,
which is one of the simplest extensions of the iid case. We develop an upper
bound on the absolute bias of the GT estimator in terms of the spectral gap of
the chain and a tail bound on the occupancy of states. Borrowing tail bounds
from known concentration results for Markov chains, we evaluate the bound using
other parameters of the chain. The analysis, supported by simulations, suggests
that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared
error falling with number of samples at a rate that depends loosely on the
connectivity of the states in the chain
Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing
Pruning schemes have been widely used in practice to reduce the complexity of
trained models with a massive number of parameters. In fact, several practical
studies have shown that if a pruned model is fine-tuned with some
gradient-based updates it generalizes well to new samples. Although the above
pipeline, which we refer to as pruning + fine-tuning, has been extremely
successful in lowering the complexity of trained models, there is very little
known about the theory behind this success. In this paper, we address this
issue by investigating the pruning + fine-tuning framework on the
overparameterized matrix sensing problem with the ground truth and the overparameterized model with . We study the approximate local minima of the mean
square error, augmented with a smooth version of a group Lasso regularizer,
. In particular, we provably show that pruning all
the columns below a certain explicit -norm threshold results in a
solution which has the minimum number of columns , yet
close to the ground truth in training loss. Moreover, in the subsequent
fine-tuning phase, gradient descent initialized at converges
at a linear rate to its limit. While our analysis provides insights into the
role of regularization in pruning, we also show that running gradient descent
in the absence of regularization results in models which {are not suitable for
greedy pruning}, i.e., many columns could have their norm comparable
to that of the maximum. To the best of our knowledge, our results provide the
first rigorous insights on why greedy pruning + fine-tuning leads to smaller
models which also generalize well.Comment: 49 pages, 2 figure
Statistical Complexity and Optimal Algorithms for Non-linear Ridge Bandits
We consider the sequential decision-making problem where the mean outcome is
a non-linear function of the chosen action. Compared with the linear model, two
curious phenomena arise in non-linear models: first, in addition to the
"learning phase" with a standard parametric rate for estimation or regret,
there is an "burn-in period" with a fixed cost determined by the non-linear
function; second, achieving the smallest burn-in cost requires new exploration
algorithms. For a special family of non-linear functions named ridge functions
in the literature, we derive upper and lower bounds on the optimal burn-in
cost, and in addition, on the entire learning trajectory during the burn-in
period via differential equations. In particular, a two-stage algorithm that
first finds a good initial action and then treats the problem as locally linear
is statistically optimal. In contrast, several classical algorithms, such as
UCB and algorithms relying on regression oracles, are provably suboptimal.Comment: Title change; add a new lower bound for linear bandits in Theorem 1
Minimax Optimal Online Imitation Learning via Replay Estimation
Online imitation learning is the problem of how best to mimic expert
demonstrations, given access to the environment or an accurate simulator. Prior
work has shown that in the infinite sample regime, exact moment matching
achieves value equivalence to the expert policy. However, in the finite sample
regime, even if one has no optimization error, empirical variance can lead to a
performance gap that scales with for behavioral cloning and for online moment matching, where is the horizon and is the
size of the expert dataset. We introduce the technique of replay estimation to
reduce this empirical variance: by repeatedly executing cached expert actions
in a stochastic simulator, we compute a smoother expert visitation distribution
estimate to match. In the presence of general function approximation, we prove
a meta theorem reducing the performance gap of our approach to the parameter
estimation error for offline classification (i.e. learning the expert policy).
In the tabular setting or with linear function approximation, our meta theorem
shows that the performance gap incurred by our approach achieves the optimal
dependency, under significantly weaker assumptions compared to prior work. We
implement multiple instantiations of our approach on several continuous control
tasks and find that we are able to significantly improve policy performance
across a variety of dataset sizes