21 research outputs found

    Correlation Clustering with Adaptive Similarity Queries

    Get PDF
    In correlation clustering, we are given nn objects together with a binary similarity score between each pair of them. The goal is to partition the objects into clusters so to minimise the disagreements with the scores. In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disagreements and the total number of queries. On the one hand, we describe simple active learning algorithms, which provably achieve an almost optimal trade-off while giving cluster recovery guarantees, and we test them on different datasets. On the other hand, we prove information-theoretical bounds on the number of queries necessary to guarantee a prescribed disagreement bound. These results give a rich characterization of the trade-off between queries and clustering error

    One Partition Approximating All ℓp\ell_p-norm Objectives in Correlation Clustering

    Full text link
    This paper considers correlation clustering on unweighted complete graphs. We give a combinatorial algorithm that returns a single clustering solution that is simultaneously O(1)O(1)-approximate for all ℓp\ell_p-norms of the disagreement vector. This proves that minimal sacrifice is needed in order to optimize different norms of the disagreement vector. Our algorithm is the first combinatorial approximation algorithm for the ℓ2\ell_2-norm objective, and more generally the first combinatorial algorithm for the ℓp\ell_p-norm objective when 2≀p<∞2 \leq p < \infty. It is also faster than all previous algorithms that minimize the ℓp\ell_p-norm of the disagreement vector, with run-time O(nω)O(n^\omega), where O(nω)O(n^\omega) is the time for matrix multiplication on n×nn \times n matrices. When the maximum positive degree in the graph is at most Δ\Delta, this can be improved to a run-time of O(nΔ2log⁥n)O(n\Delta^2 \log n).Comment: 27 pages, 2 figure

    Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!

    Full text link
    We show that a simple single-pass semi-streaming variant of the Pivot algorithm for Correlation Clustering gives a (3 + {\epsilon})-approximation using O(n/{\epsilon}) words of memory. This is a slight improvement over the recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 + {\epsilon})-approximation using O(n log n) words of memory, and Behnezhad, Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory. One of the main contributions of this paper is that both the algorithm and its analysis are very simple, and also the algorithm is easy to implement

    Efficient Correlation Clustering Methods for Large Consensus Clustering Instances

    Full text link
    Consensus clustering (or clustering aggregation) inputs kk partitions of a given ground set VV, and seeks to create a single partition that minimizes disagreement with all input partitions. State-of-the-art algorithms for consensus clustering are based on correlation clustering methods like the popular Pivot algorithm. Unfortunately these methods have not proved to be practical for consensus clustering instances where either kk or VV gets large. In this paper we provide practical run time improvements for correlation clustering solvers when VV is large. We reduce the time complexity of Pivot from O(∣V∣2k)O(|V|^2 k) to O(∣V∣k)O(|V| k), and its space complexity from O(∣V∣2)O(|V|^2) to O(∣V∣k)O(|V| k) -- a significant savings since in practice kk is much less than ∣V∣|V|. We also analyze a sampling method for these algorithms when kk is large, bridging the gap between running Pivot on the full set of input partitions (an expected 1.57-approximation) and choosing a single input partition at random (an expected 2-approximation). We show experimentally that algorithms like Pivot do obtain quality clustering results in practice even on small samples of input partitions

    Advances in correlation clustering

    Get PDF
    The task of clustering is to partition a given dataset in such a way that objects within a cluster are similar to each other while being dissimilar to objects from other clusters. One challenge to this task arises when dealing with datasets where the objects are characterized by an increased number of features. Objects within a cluster may exhibit correlations among a subset of features. In order to detect such clusters, within the past two decades significant contributions have been made which yielded a wealth of literature presenting algorithms for detecting clusters in arbitrarily oriented subspaces. Each of them approaches the correlation clustering task differently, by relying on different underlying models and techniques. Building on the current progress made, this work addresses the following aspects: First, it is dedicated to the research question of how to actually measure and therefore evaluate the quality of a correlation clustering. As an initial endeavor, it is investigated how far objectives for internal evaluation criteria can be derived from existing correlation clustering algorithms. The results from this approach, however, exhibited limitations rendering the derived internal evaluation measures not suitable. As a consequence endeavors have been made to identify commonalities among correlation clustering algorithms leading to a cost function that is introduced as an internal evaluation measure. Experiments illustrate its capability to assess clusterings based on aspects that are inherent to all correlation clustering algorithms studied so far. Second, among the existing correlation clustering algorithms, one takes a unique approach. Clusters are detected in a space spanned by the parameters of a given function, known as Hough space. The detection itself is achieved by finding so-called regions of interest (ROI) in Hough space. While the de- tection of ROIs in the existing algorithm performs well in most cases, there are conditions under which the runtime deteriorates, especially in data sets with high amounts of noise. In this work, two different novel strategies are proposed for ROI detection in Hough space, where it is elaborated on their individual strengths and weaknesses. Besides the aspect of ROI detection, endeavors are made to go beyond linearity by proposing approaches for detecting quadratic and periodic correlated clusters using Hough transform. Third, while there exist different views, like local and global correlated clusters, explorations are made in this work with the question in mind, in how far both views can be unified under a single concept. Finally, approaches are proposed and investigated that enhance the resilience of correlation clustering methods against outliers.Die Aufgabe von Clustering besteht darin einen gegebenen Datensatz so zu partitionieren dass Objekte innerhalb eines Clusters Ă€hnlich zueinander sind, wĂ€hrend diese unĂ€hnlich zu Objekten aus anderen Clustern sind. Eine Herausforderung bei dieser Aufgabe kommt auf, wenn man mit Daten umgeht, die sich durch eine erhöhte Anzahl an Merkmalen auszeichnen. Objekte innerhalb eines Clusters können Korrelationen zwischen Teilmengen von Merkmalen aufweisen. Um solche Cluster erkennen zu können, wurden innerhalb der vergangenen zwei Dekaden signifikante BeitrĂ€ge geleistet. Darin werden Algorithmen vorgestellt, mit denen Cluster in beliebig ausgerichteten UnterrĂ€umen erkannt werden können. Jedes der Verfahren verfolgt zur Lösung der Correlation Clustering Aufgabenstellung unterschiedliche AnsĂ€tze indem sie sich auf unterschiedliche zugrunde liegende Modelle und Techniken stĂŒtzen. Aufbauend auf die bislang gemachten Fortschritte, adressiert diese Arbeit die folgenden Aspekte: ZunĂ€chst wurde sich der Forschungsfrage gewidmet wie die GĂŒte eines Correlation Clustering Ergebnisses bestimmt werden kann. In einer ersten Bestrebung wurde ermittelt in wie fern Ziele fĂŒr interne Evaluationskriterien von bereits bestehenden Correlation Clustering Algorithmen abgeleitet werden können. Die Ergebnisse von dieser Vorgehensweise offenbarten Limitationen die einen Einsatz als interne Evaluations- maße ungeeignet erschienen ließen. Als Konsequenz wurden Bestrebungen unternommen Gemeinsamkeiten zwischen Correlation Clustering Algorithmen zu identifizieren, welche zu einer Kostenfunktion fĂŒhrten die als ein internes Evaluationsmaß eingefĂŒhrt wurde. Die Experimente illustrieren die FĂ€higkeit, Clusterings auf Basis von Aspekten die inherent in allen bislang studierten Correlation Clustering Algorithmen vorliegen zu bewerten. Als einen zweiten Punkt nimmt ein Correlation Clustering Verfahren unter den bislang existierenden Methoden eine Sonderstellung ein. Die Cluster werden in einem Raum erkannt welches von den parmetern einer gegebenen Funktion aufgespannt werden welches als Hough Raum bekannt ist. Die Erkennung selbst wird durch das Finden von sogenannten "Regions of Interest" (ROI) im Hough Raum erreicht. WĂ€hrend die Erkennung von ROIs in dem bestehenden Verfahren in den meisten FĂ€llen gut verlĂ€uft, gibt es Bedingungen, unter welchen die Laufzeit sich verschlechtert, insbesondere bei DatensĂ€tzen mit großen Mengen von Rauschen. In dieser Arbeit werden zwei verschiedene neue Strategien fĂŒr die ROI Erkennung im Hough Raum vorgeschlagen, wobei auf die individuellen StĂ€rken und SchwĂ€chen eingegangen wird. Neben dem Aspekt der ROI Erkennung sind Forschungen unternommen worden um ĂŒber die LinearitĂ€t der Correlation Cluster hinaus zu gehen, indem Verfahren entwickelt wurden, mit denen quadratisch- und periodisch korrelierte Cluster mittels Hough Transform erkannt werden können. Der dritte Aspekt dieser Arbeit widmet sich den sogenannten "views". WĂ€hrend es verschiedene views gibt wie z.B. bei lokal oder global korrelierten Clustern, wurden Forschungen unternommen mit der Fragestellung, in wie fern beide Ansichten unter einem einzigen gemeinsamen Konzept vereinigt werden können. Zuletzt sind AnsĂ€tze vorgeschlagen und untersucht worden welche die Resilienz von Correlation Clustering Methoden hinsichtlich Ausreißer erhöhen

    Correlation Clustering with Adaptive Similarity Queries

    Get PDF
    In correlation clustering, we are givennobjects together with a binary similarityscore between each pair of them. The goal is to partition the objects into clustersso to minimise the disagreements with the scores. In this work we investigatecorrelation clustering as an active learning problem: each similarity score can belearned by making a query, and the goal is to minimise both the disagreementsand the total number of queries. On the one hand, we describe simple activelearning algorithms, which provably achieve an almost optimal trade-off whilegiving cluster recovery guarantees, and we test them on different datasets. On theother hand, we prove information-theoretical bounds on the number of queriesnecessary to guarantee a prescribed disagreement bound. These results give a richcharacterization of the trade-off between queries and clustering error

    MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data

    Get PDF
    Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopulations based on the local pairwise correlation between covariates, without needing to define an a priori interaction scale. We demonstrate that MCA facilitates the identification of differentially regulated subpopulations in simulated data from a small gene regulatory network, followed by application to previously published single-cell qPCR data from mouse embryonic stem cells. We show that MCA recovers previously identified subpopulations, provides additional insight into the underlying correlation structure, reveals potentially spurious compartmentalizations, and provides insight into novel subpopulations. Conclusions: MCA is a useful method for the identification of subpopulations in low-dimensional expression data, as emerging from qPCR or FACS measurements. With MCA it is possible to investigate the robustness of covariate correlations with respect subpopulations, graphically identify outliers, and identify factors contributing to differential regulation between pairs of covariates. MCA thus provides a framework for investigation of expression correlations for genes of interests and biological hypothesis generation.Comment: BioVis 2014 conferenc

    Sublinear Time and Space Algorithms for Correlation Clustering via Sparse-Dense Decompositions

    Get PDF
    We present a new approach for solving (minimum disagreement) correlation clustering that results in sublinear algorithms with highly efficient time and space complexity for this problem. In particular, we obtain the following algorithms for nn-vertex (+/−)(+/-)-labeled graphs GG: -- A sublinear-time algorithm that with high probability returns a constant approximation clustering of GG in O(nlog⁡2n)O(n\log^2{n}) time assuming access to the adjacency list of the (+)(+)-labeled edges of GG (this is almost quadratically faster than even reading the input once). Previously, no sublinear-time algorithm was known for this problem with any multiplicative approximation guarantee. -- A semi-streaming algorithm that with high probability returns a constant approximation clustering of GG in O(nlog⁡n)O(n\log{n}) space and a single pass over the edges of the graph GG (this memory is almost quadratically smaller than input size). Previously, no single-pass algorithm with o(n2)o(n^2) space was known for this problem with any approximation guarantee. The main ingredient of our approach is a novel connection to sparse-dense graph decompositions that are used extensively in the graph coloring literature. To our knowledge, this connection is the first application of these decompositions beyond graph coloring, and in particular for the correlation clustering problem, and can be of independent interest
    corecore