21 research outputs found
Correlation Clustering with Adaptive Similarity Queries
In correlation clustering, we are given objects together with a binary
similarity score between each pair of them. The goal is to partition the
objects into clusters so to minimise the disagreements with the scores. In this
work we investigate correlation clustering as an active learning problem: each
similarity score can be learned by making a query, and the goal is to minimise
both the disagreements and the total number of queries. On the one hand, we
describe simple active learning algorithms, which provably achieve an almost
optimal trade-off while giving cluster recovery guarantees, and we test them on
different datasets. On the other hand, we prove information-theoretical bounds
on the number of queries necessary to guarantee a prescribed disagreement
bound. These results give a rich characterization of the trade-off between
queries and clustering error
One Partition Approximating All -norm Objectives in Correlation Clustering
This paper considers correlation clustering on unweighted complete graphs. We
give a combinatorial algorithm that returns a single clustering solution that
is simultaneously -approximate for all -norms of the disagreement
vector. This proves that minimal sacrifice is needed in order to optimize
different norms of the disagreement vector. Our algorithm is the first
combinatorial approximation algorithm for the -norm objective, and more
generally the first combinatorial algorithm for the -norm objective
when . It is also faster than all previous algorithms that
minimize the -norm of the disagreement vector, with run-time
, where is the time for matrix multiplication on matrices. When the maximum positive degree in the graph is at most
, this can be improved to a run-time of .Comment: 27 pages, 2 figure
Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!
We show that a simple single-pass semi-streaming variant of the Pivot
algorithm for Correlation Clustering gives a (3 + {\epsilon})-approximation
using O(n/{\epsilon}) words of memory. This is a slight improvement over the
recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 +
{\epsilon})-approximation using O(n log n) words of memory, and Behnezhad,
Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory.
One of the main contributions of this paper is that both the algorithm and its
analysis are very simple, and also the algorithm is easy to implement
Efficient Correlation Clustering Methods for Large Consensus Clustering Instances
Consensus clustering (or clustering aggregation) inputs partitions of a
given ground set , and seeks to create a single partition that minimizes
disagreement with all input partitions. State-of-the-art algorithms for
consensus clustering are based on correlation clustering methods like the
popular Pivot algorithm. Unfortunately these methods have not proved to be
practical for consensus clustering instances where either or gets
large.
In this paper we provide practical run time improvements for correlation
clustering solvers when is large. We reduce the time complexity of Pivot
from to , and its space complexity from to
-- a significant savings since in practice is much less than
. We also analyze a sampling method for these algorithms when is
large, bridging the gap between running Pivot on the full set of input
partitions (an expected 1.57-approximation) and choosing a single input
partition at random (an expected 2-approximation). We show experimentally that
algorithms like Pivot do obtain quality clustering results in practice even on
small samples of input partitions
Advances in correlation clustering
The task of clustering is to partition a given dataset in such a way that objects within a cluster are similar to each other while being dissimilar to objects from other clusters. One challenge to this task arises when dealing with datasets where the objects are characterized by an increased number of features. Objects within a cluster may exhibit correlations among a subset of features. In order to detect such clusters, within the past two decades significant contributions have been made which yielded a wealth of literature presenting algorithms for detecting clusters in arbitrarily oriented subspaces. Each of them approaches the correlation clustering task differently, by relying on different underlying models and techniques. Building on the current progress made, this work addresses the following aspects: First, it is dedicated to the research question of how to actually measure and therefore evaluate the quality of a correlation clustering. As an initial endeavor, it is investigated how far objectives for internal evaluation criteria can be derived from existing correlation clustering algorithms. The results from this approach, however, exhibited limitations rendering the derived internal evaluation measures not suitable. As a consequence endeavors have been made to identify commonalities among correlation clustering algorithms leading to a cost function that is introduced as an internal evaluation measure. Experiments illustrate its capability to assess clusterings based on aspects that are inherent to all correlation clustering algorithms studied so far. Second, among the existing correlation clustering algorithms, one takes a unique approach. Clusters are detected in a space spanned by the parameters of a given function, known as Hough space. The detection itself is achieved by finding so-called regions of interest (ROI) in Hough space. While the de- tection of ROIs in the existing algorithm performs well in most cases, there are conditions under which the runtime deteriorates, especially in data sets with high amounts of noise. In this work, two different novel strategies are proposed for ROI detection in Hough space, where it is elaborated on their individual strengths and weaknesses. Besides the aspect of ROI detection, endeavors are made to go beyond linearity by proposing approaches for detecting quadratic and periodic correlated clusters using Hough transform. Third, while there exist different views, like local and global correlated clusters, explorations are made in this work with the question in mind, in how far both views can be unified under a single concept. Finally, approaches are proposed and investigated that enhance the resilience of correlation clustering methods against outliers.Die Aufgabe von Clustering besteht darin einen gegebenen Datensatz so zu partitionieren dass Objekte innerhalb eines Clusters Ă€hnlich zueinander sind, wĂ€hrend diese unĂ€hnlich zu Objekten aus anderen Clustern sind. Eine Herausforderung bei dieser Aufgabe kommt auf, wenn man mit Daten umgeht, die sich durch eine erhöhte Anzahl an Merkmalen auszeichnen. Objekte innerhalb eines Clusters können Korrelationen zwischen Teilmengen von Merkmalen aufweisen. Um solche Cluster erkennen zu können, wurden innerhalb der vergangenen zwei Dekaden signifikante BeitrĂ€ge geleistet. Darin werden Algorithmen vorgestellt, mit denen Cluster in beliebig ausgerichteten UnterrĂ€umen erkannt werden können. Jedes der Verfahren verfolgt zur Lösung der Correlation Clustering Aufgabenstellung unterschiedliche AnsĂ€tze indem sie sich auf unterschiedliche zugrunde liegende Modelle und Techniken stĂŒtzen. Aufbauend auf die bislang gemachten Fortschritte, adressiert diese Arbeit die folgenden Aspekte: ZunĂ€chst wurde sich der Forschungsfrage gewidmet wie die GĂŒte eines Correlation Clustering Ergebnisses bestimmt werden kann. In einer ersten Bestrebung wurde ermittelt in wie fern Ziele fĂŒr interne Evaluationskriterien von bereits bestehenden Correlation Clustering Algorithmen abgeleitet werden können. Die Ergebnisse von dieser Vorgehensweise offenbarten Limitationen die einen Einsatz als interne Evaluations- maĂe ungeeignet erschienen lieĂen. Als Konsequenz wurden Bestrebungen unternommen Gemeinsamkeiten zwischen Correlation Clustering Algorithmen zu identifizieren, welche zu einer Kostenfunktion fĂŒhrten die als ein internes EvaluationsmaĂ eingefĂŒhrt wurde. Die Experimente illustrieren die FĂ€higkeit, Clusterings auf Basis von Aspekten die inherent in allen bislang studierten Correlation Clustering Algorithmen vorliegen zu bewerten. Als einen zweiten Punkt nimmt ein Correlation Clustering Verfahren unter den bislang existierenden Methoden eine Sonderstellung ein. Die Cluster werden in einem Raum erkannt welches von den parmetern einer gegebenen Funktion aufgespannt werden welches als Hough Raum bekannt ist. Die Erkennung selbst wird durch das Finden von sogenannten "Regions of Interest" (ROI) im Hough Raum erreicht. WĂ€hrend die Erkennung von ROIs in dem bestehenden Verfahren in den meisten FĂ€llen gut verlĂ€uft, gibt es Bedingungen, unter welchen die Laufzeit sich verschlechtert, insbesondere bei DatensĂ€tzen mit groĂen Mengen von Rauschen. In dieser Arbeit werden zwei verschiedene neue Strategien fĂŒr die ROI Erkennung im Hough Raum vorgeschlagen, wobei auf die individuellen StĂ€rken und SchwĂ€chen eingegangen wird. Neben dem Aspekt der ROI Erkennung sind Forschungen unternommen worden um ĂŒber die LinearitĂ€t der Correlation Cluster hinaus zu gehen, indem Verfahren entwickelt wurden, mit denen quadratisch- und periodisch korrelierte Cluster mittels Hough Transform erkannt werden können. Der dritte Aspekt dieser Arbeit widmet sich den sogenannten "views". WĂ€hrend es verschiedene views gibt wie z.B. bei lokal oder global korrelierten Clustern, wurden Forschungen unternommen mit der Fragestellung, in wie fern beide Ansichten unter einem einzigen gemeinsamen Konzept vereinigt werden können. Zuletzt sind AnsĂ€tze vorgeschlagen und untersucht worden welche die Resilienz von Correlation Clustering Methoden hinsichtlich AusreiĂer erhöhen
Correlation Clustering with Adaptive Similarity Queries
In correlation clustering, we are givennobjects together with a binary similarityscore between each pair of them. The goal is to partition the objects into clustersso to minimise the disagreements with the scores. In this work we investigatecorrelation clustering as an active learning problem: each similarity score can belearned by making a query, and the goal is to minimise both the disagreementsand the total number of queries. On the one hand, we describe simple activelearning algorithms, which provably achieve an almost optimal trade-off whilegiving cluster recovery guarantees, and we test them on different datasets. On theother hand, we prove information-theoretical bounds on the number of queriesnecessary to guarantee a prescribed disagreement bound. These results give a richcharacterization of the trade-off between queries and clustering error
MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data
Background: Biological data often originate from samples containing mixtures
of subpopulations, corresponding e.g. to distinct cellular phenotypes. However,
identification of distinct subpopulations may be difficult if biological
measurements yield distributions that are not easily separable. Results: We
present Multiresolution Correlation Analysis (MCA), a method for visually
identifying subpopulations based on the local pairwise correlation between
covariates, without needing to define an a priori interaction scale. We
demonstrate that MCA facilitates the identification of differentially regulated
subpopulations in simulated data from a small gene regulatory network, followed
by application to previously published single-cell qPCR data from mouse
embryonic stem cells. We show that MCA recovers previously identified
subpopulations, provides additional insight into the underlying correlation
structure, reveals potentially spurious compartmentalizations, and provides
insight into novel subpopulations. Conclusions: MCA is a useful method for the
identification of subpopulations in low-dimensional expression data, as
emerging from qPCR or FACS measurements. With MCA it is possible to investigate
the robustness of covariate correlations with respect subpopulations,
graphically identify outliers, and identify factors contributing to
differential regulation between pairs of covariates. MCA thus provides a
framework for investigation of expression correlations for genes of interests
and biological hypothesis generation.Comment: BioVis 2014 conferenc
Sublinear Time and Space Algorithms for Correlation Clustering via Sparse-Dense Decompositions
We present a new approach for solving (minimum disagreement) correlation
clustering that results in sublinear algorithms with highly efficient time and
space complexity for this problem. In particular, we obtain the following
algorithms for -vertex -labeled graphs :
-- A sublinear-time algorithm that with high probability returns a constant
approximation clustering of in time assuming access to the
adjacency list of the -labeled edges of (this is almost quadratically
faster than even reading the input once). Previously, no sublinear-time
algorithm was known for this problem with any multiplicative approximation
guarantee.
-- A semi-streaming algorithm that with high probability returns a constant
approximation clustering of in space and a single pass over
the edges of the graph (this memory is almost quadratically smaller than
input size). Previously, no single-pass algorithm with space was known
for this problem with any approximation guarantee.
The main ingredient of our approach is a novel connection to sparse-dense
graph decompositions that are used extensively in the graph coloring
literature. To our knowledge, this connection is the first application of these
decompositions beyond graph coloring, and in particular for the correlation
clustering problem, and can be of independent interest