12 research outputs found

    Correlation Clustering with Noisy Partial Information

    Full text link
    In this paper, we propose and study a semi-random model for the Correlation Clustering problem on arbitrary graphs G. We give two approximation algorithms for Correlation Clustering instances from this model. The first algorithm finds a solution of value (1+δ)optcost+Oδ(nlog3n)(1+ \delta) optcost + O_{\delta}(n\log^3 n) with high probability, where optcostoptcost is the value of the optimal solution (for every δ>0\delta > 0). The second algorithm finds the ground truth clustering with an arbitrarily small classification error η\eta (under some additional assumptions on the instance).Comment: To appear at Conference on Learning Theory (COLT) 2015. Substantial changes from previous version, including a new section on recovery of the ground truth clustering. 20 page

    Approximate Correlation Clustering Using Same-Cluster Queries

    Full text link
    Ashtiani et al. (NIPS 2016) introduced a semi-supervised framework for clustering (SSAC) where a learner is allowed to make same-cluster queries. More specifically, in their model, there is a query oracle that answers queries of the form given any two vertices, do they belong to the same optimal cluster?. Ashtiani et al. showed the usefulness of such a query framework by giving a polynomial time algorithm for the k-means clustering problem where the input dataset satisfies some separation condition. Ailon et al. extended the above work to the approximation setting by giving an efficient (1+\eps)-approximation algorithm for k-means for any small \eps > 0 and any dataset within the SSAC framework. In this work, we extend this line of study to the correlation clustering problem. Correlation clustering is a graph clustering problem where pairwise similarity (or dissimilarity) information is given for every pair of vertices and the objective is to partition the vertices into clusters that minimise the disagreement (or maximises agreement) with the pairwise information given as input. These problems are popularly known as MinDisAgree and MaxAgree problems, and MinDisAgree[k] and MaxAgree[k] are versions of these problems where the number of optimal clusters is at most k. There exist Polynomial Time Approximation Schemes (PTAS) for MinDisAgree[k] and MaxAgree[k] where the approximation guarantee is (1+\eps) for any small \eps and the running time is polynomial in the input parameters but exponential in k and 1/\eps. We obtain an (1+\eps)-approximation algorithm for any small \eps with running time that is polynomial in the input parameters and also in k and 1/\eps. We also give non-trivial upper and lower bounds on the number of same-cluster queries, the lower bound being based on the Exponential Time Hypothesis (ETH).Comment: To appear in LATIN 201

    Predicting Positive and Negative Links with Noisy Queries: Theory & Practice

    Full text link
    Social networks involve both positive and negative relationships, which can be captured in signed graphs. The {\em edge sign prediction problem} aims to predict whether an interaction between a pair of nodes will be positive or negative. We provide theoretical results for this problem that motivate natural improvements to recent heuristics. The edge sign prediction problem is related to correlation clustering; a positive relationship means being in the same cluster. We consider the following model for two clusters: we are allowed to query any pair of nodes whether they belong to the same cluster or not, but the answer to the query is corrupted with some probability 0<q<120<q<\frac{1}{2}. Let δ=12q\delta=1-2q be the bias. We provide an algorithm that recovers all signs correctly with high probability in the presence of noise with O(nlognδ2+log2nδ6)O(\frac{n\log n}{\delta^2}+\frac{\log^2 n}{\delta^6}) queries. This is the best known result for this problem for all but tiny δ\delta, improving on the recent work of Mazumdar and Saha \cite{mazumdar2017clustering}. We also provide an algorithm that performs O(nlognδ4)O(\frac{n\log n}{\delta^4}) queries, and uses breadth first search as its main algorithmic primitive. While both the running time and the number of queries for this algorithm are sub-optimal, our result relies on novel theoretical techniques, and naturally suggests the use of edge-disjoint paths as a feature for predicting signs in online social networks. Correspondingly, we experiment with using edge disjoint sts-t paths of short length as a feature for predicting the sign of edge (s,t)(s,t) in real-world signed networks. Empirical findings suggest that the use of such paths improves the classification accuracy, especially for pairs of nodes with no common neighbors.Comment: arXiv admin note: text overlap with arXiv:1609.0075

    Clustering Semi-Random Mixtures of Gaussians

    Full text link
    Gaussian mixture models (GMM) are the most widely used statistical model for the kk-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural semi-random model for kk-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. In our model, a semi-random adversary is allowed to make arbitrary "monotone" or helpful changes to the data generated from the Gaussian mixture model. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd's algorithm for kk-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching information-theoretic lower bound on the number of misclassified points incurred by any kk-means clustering algorithm on the semi-random model

    Inference in Sparse Graphs with Pairwise Measurements and Side Information

    Full text link
    We consider the statistical problem of recovering a hidden "ground truth" binary labeling for the vertices of a graph up to low Hamming error from noisy edge and vertex measurements. We present new algorithms and a sharp finite-sample analysis for this problem on trees and sparse graphs with poor expansion properties such as hypergrids and ring lattices. Our method generalizes and improves over that of Globerson et al. (2015), who introduced the problem for two-dimensional grid lattices. For trees we provide a simple, efficient, algorithm that infers the ground truth with optimal Hamming error has optimal sample complexity and implies recovery results for all connected graphs. Here, the presence of side information is critical to obtain a non-trivial recovery rate. We then show how to adapt this algorithm to tree decompositions of edge-subgraphs of certain graph families such as lattices, resulting in optimal recovery error rates that can be obtained efficiently The thrust of our analysis is to 1) use the tree decomposition along with edge measurements to produce a small class of viable vertex labelings and 2) apply an analysis influenced by statistical learning theory to show that we can infer the ground truth from this class using vertex measurements. We show the power of our method in several examples including hypergrids, ring lattices, and the Newman-Watts model for small world graphs. For two-dimensional grids, our results improve over Globerson et al. (2015) by obtaining optimal recovery in the constant-height regime.Comment: AISTATS 201

    Learning Communities in the Presence of Errors

    Full text link
    We study the problem of learning communities in the presence of modeling errors and give robust recovery algorithms for the Stochastic Block Model (SBM). This model, which is also known as the Planted Partition Model, is widely used for community detection and graph partitioning in various fields, including machine learning, statistics, and social sciences. Many algorithms exist for learning communities in the Stochastic Block Model, but they do not work well in the presence of errors. In this paper, we initiate the study of robust algorithms for partial recovery in SBM with modeling errors or noise. We consider graphs generated according to the Stochastic Block Model and then modified by an adversary. We allow two types of adversarial errors, Feige---Kilian or monotone errors, and edge outlier errors. Mossel, Neeman and Sly (STOC 2015) posed an open question about whether an almost exact recovery is possible when the adversary is allowed to add o(n)o(n) edges. Our work answers this question affirmatively even in the case of k>2k>2 communities. We then show that our algorithms work not only when the instances come from SBM, but also work when the instances come from any distribution of graphs that is ϵm\epsilon m close to SBM in the Kullback---Leibler divergence. This result also works in the presence of adversarial errors. Finally, we present almost tight lower bounds for two communities.Comment: 34 pages. Appearing in the Conference on Learning Theory (COLT)'1

    Generalizing the Hypergraph Laplacian via a Diffusion Process with Mediators

    Full text link
    In a recent breakthrough STOC~2015 paper, a continuous diffusion process was considered on hypergraphs (which has been refined in a recent JACM 2018 paper) to define a Laplacian operator, whose spectral properties satisfy the celebrated Cheeger's inequality. However, one peculiar aspect of this diffusion process is that each hyperedge directs flow only from vertices with the maximum density to those with the minimum density, while ignoring vertices having strict in-beween densities. In this work, we consider a generalized diffusion process, in which vertices in a hyperedge can act as mediators to receive flow from vertices with maximum density and deliver flow to those with minimum density. We show that the resulting Laplacian operator still has a second eigenvalue satsifying the Cheeger's inequality. Our generalized diffusion model shows that there is a family of operators whose spectral properties are related to hypergraph conductance, and provides a powerful tool to enhance the development of spectral hypergraph theory. Moreover, since every vertex can participate in the new diffusion model at every instant, this can potentially have wider practical applications.Comment: arXiv admin note: text overlap with arXiv:1605.0148

    Clustering Via Crowdsourcing

    Full text link
    In recent years, crowdsourcing, aka human aided computation has emerged as an effective platform for solving problems that are considered complex for machines alone. Using human is time-consuming and costly due to monetary compensations. Therefore, a crowd based algorithm must judiciously use any information computed through an automated process, and ask minimum number of questions to the crowd adaptively. One such problem which has received significant attention is {\em entity resolution}. Formally, we are given a graph G=(V,E)G=(V,E) with unknown edge set EE where GG is a union of kk (again unknown, but typically large O(nα)O(n^\alpha), for α>0\alpha>0) disjoint cliques Gi(Vi,Ei)G_i(V_i, E_i), i=1,,ki =1, \dots, k. The goal is to retrieve the sets ViV_is by making minimum number of pair-wise queries V×V{±1}V \times V\to\{\pm1\} to an oracle (the crowd). When the answer to each query is correct, e.g. via resampling, then this reduces to finding connected components in a graph. On the other hand, when crowd answers may be incorrect, it corresponds to clustering over minimum number of noisy inputs. Even, with perfect answers, a simple lower and upper bound of Θ(nk)\Theta(nk) on query complexity can be shown. A major contribution of this paper is to reduce the query complexity to linear or even sublinear in nn when mild side information is provided by a machine, and even in presence of crowd errors which are not correctable via resampling. We develop new information theoretic lower bounds on the query complexity of clustering with side information and errors, and our upper bounds closely match with them. Our algorithms are naturally parallelizable, and also give near-optimal bounds on the number of adaptive rounds required to match the query complexity.Comment: 36 page

    Towards Learning Sparsely Used Dictionaries with Arbitrary Supports

    Full text link
    Dictionary learning is a popular approach for inferring a hidden basis or dictionary in which data has a sparse representation. Data generated from the dictionary A (an n by m matrix, with m > n in the over-complete setting) is given by Y = AX where X is a matrix whose columns have supports chosen from a distribution over k-sparse vectors, and the non-zero values chosen from a symmetric distribution. Given Y, the goal is to recover A and X in polynomial time. Existing algorithms give polytime guarantees for recovering incoherent dictionaries, under strong distributional assumptions both on the supports of the columns of X, and on the values of the non-zero entries. In this work, we study the following question: Can we design efficient algorithms for recovering dictionaries when the supports of the columns of X are arbitrary? To address this question while circumventing the issue of non-identifiability, we study a natural semirandom model for dictionary learning where there are a large number of samples y=Axy=Ax with arbitrary k-sparse supports for x, along with a few samples where the sparse supports are chosen uniformly at random. While the few samples with random supports ensures identifiability, the support distribution can look almost arbitrary in aggregate. Hence existing algorithmic techniques seem to break down as they make strong assumptions on the supports. Our main contribution is a new polynomial time algorithm for learning incoherent over-complete dictionaries that works under the semirandom model. Additionally the same algorithm provides polynomial time guarantees in new parameter regimes when the supports are fully random. Finally using these techniques, we also identify a minimal set of conditions on the supports under which the dictionary can be (information theoretically) recovered from polynomial samples for almost linear sparsity, i.e., k=O~(n)k=\tilde{O}(n).Comment: 72 pages, fixed minor typos, and added a new reference in related wor

    Non-Convex Matrix Completion Against a Semi-Random Adversary

    Full text link
    Matrix completion is a well-studied problem with many machine learning applications. In practice, the problem is often solved by non-convex optimization algorithms. However, the current theoretical analysis for non-convex algorithms relies heavily on the assumption that every entry is observed with exactly the same probability pp, which is not realistic in practice. In this paper, we investigate a more realistic semi-random model, where the probability of observing each entry is at least pp. Even with this mild semi-random perturbation, we can construct counter-examples where existing non-convex algorithms get stuck in bad local optima. In light of the negative results, we propose a pre-processing step that tries to re-weight the semi-random input, so that it becomes "similar" to a random input. We give a nearly-linear time algorithm for this problem, and show that after our pre-processing, all the local minima of the non-convex objective can be used to approximately recover the underlying ground-truth matrix.Comment: added references and fixed typo
    corecore