12 research outputs found
Correlation Clustering with Noisy Partial Information
In this paper, we propose and study a semi-random model for the Correlation
Clustering problem on arbitrary graphs G. We give two approximation algorithms
for Correlation Clustering instances from this model. The first algorithm finds
a solution of value with high
probability, where is the value of the optimal solution (for every
). The second algorithm finds the ground truth clustering with an
arbitrarily small classification error (under some additional
assumptions on the instance).Comment: To appear at Conference on Learning Theory (COLT) 2015. Substantial
changes from previous version, including a new section on recovery of the
ground truth clustering. 20 page
Approximate Correlation Clustering Using Same-Cluster Queries
Ashtiani et al. (NIPS 2016) introduced a semi-supervised framework for
clustering (SSAC) where a learner is allowed to make same-cluster queries. More
specifically, in their model, there is a query oracle that answers queries of
the form given any two vertices, do they belong to the same optimal cluster?.
Ashtiani et al. showed the usefulness of such a query framework by giving a
polynomial time algorithm for the k-means clustering problem where the input
dataset satisfies some separation condition. Ailon et al. extended the above
work to the approximation setting by giving an efficient (1+\eps)-approximation
algorithm for k-means for any small \eps > 0 and any dataset within the SSAC
framework. In this work, we extend this line of study to the correlation
clustering problem. Correlation clustering is a graph clustering problem where
pairwise similarity (or dissimilarity) information is given for every pair of
vertices and the objective is to partition the vertices into clusters that
minimise the disagreement (or maximises agreement) with the pairwise
information given as input. These problems are popularly known as MinDisAgree
and MaxAgree problems, and MinDisAgree[k] and MaxAgree[k] are versions of these
problems where the number of optimal clusters is at most k. There exist
Polynomial Time Approximation Schemes (PTAS) for MinDisAgree[k] and MaxAgree[k]
where the approximation guarantee is (1+\eps) for any small \eps and the
running time is polynomial in the input parameters but exponential in k and
1/\eps. We obtain an (1+\eps)-approximation algorithm for any small \eps with
running time that is polynomial in the input parameters and also in k and
1/\eps. We also give non-trivial upper and lower bounds on the number of
same-cluster queries, the lower bound being based on the Exponential Time
Hypothesis (ETH).Comment: To appear in LATIN 201
Predicting Positive and Negative Links with Noisy Queries: Theory & Practice
Social networks involve both positive and negative relationships, which can
be captured in signed graphs. The {\em edge sign prediction problem} aims to
predict whether an interaction between a pair of nodes will be positive or
negative. We provide theoretical results for this problem that motivate natural
improvements to recent heuristics.
The edge sign prediction problem is related to correlation clustering; a
positive relationship means being in the same cluster. We consider the
following model for two clusters: we are allowed to query any pair of nodes
whether they belong to the same cluster or not, but the answer to the query is
corrupted with some probability . Let be the
bias. We provide an algorithm that recovers all signs correctly with high
probability in the presence of noise with queries. This is the best known result
for this problem for all but tiny , improving on the recent work of
Mazumdar and Saha \cite{mazumdar2017clustering}. We also provide an algorithm
that performs queries, and uses breadth first
search as its main algorithmic primitive. While both the running time and the
number of queries for this algorithm are sub-optimal, our result relies on
novel theoretical techniques, and naturally suggests the use of edge-disjoint
paths as a feature for predicting signs in online social networks.
Correspondingly, we experiment with using edge disjoint paths of short
length as a feature for predicting the sign of edge in real-world
signed networks. Empirical findings suggest that the use of such paths improves
the classification accuracy, especially for pairs of nodes with no common
neighbors.Comment: arXiv admin note: text overlap with arXiv:1609.0075
Clustering Semi-Random Mixtures of Gaussians
Gaussian mixture models (GMM) are the most widely used statistical model for
the -means clustering problem and form a popular framework for clustering in
machine learning and data analysis. In this paper, we propose a natural
semi-random model for -means clustering that generalizes the Gaussian
mixture model, and that we believe will be useful in identifying robust
algorithms. In our model, a semi-random adversary is allowed to make arbitrary
"monotone" or helpful changes to the data generated from the Gaussian mixture
model.
Our first contribution is a polynomial time algorithm that provably recovers
the ground-truth up to small classification error w.h.p., assuming certain
separation between the components. Perhaps surprisingly, the algorithm we
analyze is the popular Lloyd's algorithm for -means clustering that is the
method-of-choice in practice. Our second result complements the upper bound by
giving a nearly matching information-theoretic lower bound on the number of
misclassified points incurred by any -means clustering algorithm on the
semi-random model
Inference in Sparse Graphs with Pairwise Measurements and Side Information
We consider the statistical problem of recovering a hidden "ground truth"
binary labeling for the vertices of a graph up to low Hamming error from noisy
edge and vertex measurements. We present new algorithms and a sharp
finite-sample analysis for this problem on trees and sparse graphs with poor
expansion properties such as hypergrids and ring lattices. Our method
generalizes and improves over that of Globerson et al. (2015), who introduced
the problem for two-dimensional grid lattices.
For trees we provide a simple, efficient, algorithm that infers the ground
truth with optimal Hamming error has optimal sample complexity and implies
recovery results for all connected graphs. Here, the presence of side
information is critical to obtain a non-trivial recovery rate. We then show how
to adapt this algorithm to tree decompositions of edge-subgraphs of certain
graph families such as lattices, resulting in optimal recovery error rates that
can be obtained efficiently
The thrust of our analysis is to 1) use the tree decomposition along with
edge measurements to produce a small class of viable vertex labelings and 2)
apply an analysis influenced by statistical learning theory to show that we can
infer the ground truth from this class using vertex measurements. We show the
power of our method in several examples including hypergrids, ring lattices,
and the Newman-Watts model for small world graphs. For two-dimensional grids,
our results improve over Globerson et al. (2015) by obtaining optimal recovery
in the constant-height regime.Comment: AISTATS 201
Learning Communities in the Presence of Errors
We study the problem of learning communities in the presence of modeling
errors and give robust recovery algorithms for the Stochastic Block Model
(SBM). This model, which is also known as the Planted Partition Model, is
widely used for community detection and graph partitioning in various fields,
including machine learning, statistics, and social sciences. Many algorithms
exist for learning communities in the Stochastic Block Model, but they do not
work well in the presence of errors.
In this paper, we initiate the study of robust algorithms for partial
recovery in SBM with modeling errors or noise. We consider graphs generated
according to the Stochastic Block Model and then modified by an adversary. We
allow two types of adversarial errors, Feige---Kilian or monotone errors, and
edge outlier errors. Mossel, Neeman and Sly (STOC 2015) posed an open question
about whether an almost exact recovery is possible when the adversary is
allowed to add edges. Our work answers this question affirmatively even
in the case of communities.
We then show that our algorithms work not only when the instances come from
SBM, but also work when the instances come from any distribution of graphs that
is close to SBM in the Kullback---Leibler divergence. This result
also works in the presence of adversarial errors. Finally, we present almost
tight lower bounds for two communities.Comment: 34 pages. Appearing in the Conference on Learning Theory (COLT)'1
Generalizing the Hypergraph Laplacian via a Diffusion Process with Mediators
In a recent breakthrough STOC~2015 paper, a continuous diffusion process was
considered on hypergraphs (which has been refined in a recent JACM 2018 paper)
to define a Laplacian operator, whose spectral properties satisfy the
celebrated Cheeger's inequality. However, one peculiar aspect of this diffusion
process is that each hyperedge directs flow only from vertices with the maximum
density to those with the minimum density, while ignoring vertices having
strict in-beween densities. In this work, we consider a generalized diffusion
process, in which vertices in a hyperedge can act as mediators to receive flow
from vertices with maximum density and deliver flow to those with minimum
density. We show that the resulting Laplacian operator still has a second
eigenvalue satsifying the Cheeger's inequality. Our generalized diffusion model
shows that there is a family of operators whose spectral properties are related
to hypergraph conductance, and provides a powerful tool to enhance the
development of spectral hypergraph theory. Moreover, since every vertex can
participate in the new diffusion model at every instant, this can potentially
have wider practical applications.Comment: arXiv admin note: text overlap with arXiv:1605.0148
Clustering Via Crowdsourcing
In recent years, crowdsourcing, aka human aided computation has emerged as an
effective platform for solving problems that are considered complex for
machines alone. Using human is time-consuming and costly due to monetary
compensations. Therefore, a crowd based algorithm must judiciously use any
information computed through an automated process, and ask minimum number of
questions to the crowd adaptively.
One such problem which has received significant attention is {\em entity
resolution}. Formally, we are given a graph with unknown edge set
where is a union of (again unknown, but typically large ,
for ) disjoint cliques , . The goal is
to retrieve the sets s by making minimum number of pair-wise queries to an oracle (the crowd). When the answer to each query is
correct, e.g. via resampling, then this reduces to finding connected components
in a graph. On the other hand, when crowd answers may be incorrect, it
corresponds to clustering over minimum number of noisy inputs. Even, with
perfect answers, a simple lower and upper bound of on query
complexity can be shown. A major contribution of this paper is to reduce the
query complexity to linear or even sublinear in when mild side information
is provided by a machine, and even in presence of crowd errors which are not
correctable via resampling. We develop new information theoretic lower bounds
on the query complexity of clustering with side information and errors, and our
upper bounds closely match with them. Our algorithms are naturally
parallelizable, and also give near-optimal bounds on the number of adaptive
rounds required to match the query complexity.Comment: 36 page
Towards Learning Sparsely Used Dictionaries with Arbitrary Supports
Dictionary learning is a popular approach for inferring a hidden basis or
dictionary in which data has a sparse representation. Data generated from the
dictionary A (an n by m matrix, with m > n in the over-complete setting) is
given by Y = AX where X is a matrix whose columns have supports chosen from a
distribution over k-sparse vectors, and the non-zero values chosen from a
symmetric distribution. Given Y, the goal is to recover A and X in polynomial
time. Existing algorithms give polytime guarantees for recovering incoherent
dictionaries, under strong distributional assumptions both on the supports of
the columns of X, and on the values of the non-zero entries. In this work, we
study the following question: Can we design efficient algorithms for recovering
dictionaries when the supports of the columns of X are arbitrary?
To address this question while circumventing the issue of
non-identifiability, we study a natural semirandom model for dictionary
learning where there are a large number of samples with arbitrary
k-sparse supports for x, along with a few samples where the sparse supports are
chosen uniformly at random. While the few samples with random supports ensures
identifiability, the support distribution can look almost arbitrary in
aggregate. Hence existing algorithmic techniques seem to break down as they
make strong assumptions on the supports.
Our main contribution is a new polynomial time algorithm for learning
incoherent over-complete dictionaries that works under the semirandom model.
Additionally the same algorithm provides polynomial time guarantees in new
parameter regimes when the supports are fully random. Finally using these
techniques, we also identify a minimal set of conditions on the supports under
which the dictionary can be (information theoretically) recovered from
polynomial samples for almost linear sparsity, i.e., .Comment: 72 pages, fixed minor typos, and added a new reference in related
wor
Non-Convex Matrix Completion Against a Semi-Random Adversary
Matrix completion is a well-studied problem with many machine learning
applications. In practice, the problem is often solved by non-convex
optimization algorithms. However, the current theoretical analysis for
non-convex algorithms relies heavily on the assumption that every entry is
observed with exactly the same probability , which is not realistic in
practice.
In this paper, we investigate a more realistic semi-random model, where the
probability of observing each entry is at least . Even with this mild
semi-random perturbation, we can construct counter-examples where existing
non-convex algorithms get stuck in bad local optima.
In light of the negative results, we propose a pre-processing step that tries
to re-weight the semi-random input, so that it becomes "similar" to a random
input. We give a nearly-linear time algorithm for this problem, and show that
after our pre-processing, all the local minima of the non-convex objective can
be used to approximately recover the underlying ground-truth matrix.Comment: added references and fixed typo