Search CORE

21 research outputs found

Correlation Clustering with Adaptive Similarity Queries

Author: Bressan Marco
Cesa-Bianchi Nicolò
Paudice Andrea
Vitale Fabio
Publication venue
Publication date: 01/01/2019
Field of study

In correlation clustering, we are given

n

objects together with a binary similarity score between each pair of them. The goal is to partition the objects into clusters so to minimise the disagreements with the scores. In this work we investigate correlation clustering as an active learning problem: each similarity score can be learned by making a query, and the goal is to minimise both the disagreements and the total number of queries. On the one hand, we describe simple active learning algorithms, which provably achieve an almost optimal trade-off while giving cluster recovery guarantees, and we test them on different datasets. On the other hand, we prove information-theoretical bounds on the number of queries necessary to guarantee a prescribed disagreement bound. These results give a rich characterization of the trade-off between queries and clustering error

arXiv.org e-Print Archive

AIR Universita degli studi di Milano

INRIA a CCSD electronic archive server

HAL Descartes

Archivio della ricerca- Università di Roma La Sapienza

Hal-Diderot

One Partition Approximating All $\ell_p$ -norm Objectives in Correlation Clustering

Author: Davies Sami
Moseley Benjamin
Newman Heather
Publication venue
Publication date: 17/10/2023
Field of study

This paper considers correlation clustering on unweighted complete graphs. We give a combinatorial algorithm that returns a single clustering solution that is simultaneously

O(1)

-approximate for all

\ell_p

-norms of the disagreement vector. This proves that minimal sacrifice is needed in order to optimize different norms of the disagreement vector. Our algorithm is the first combinatorial approximation algorithm for the

\ell_2

-norm objective, and more generally the first combinatorial algorithm for the

\ell_p

-norm objective when

2 \leq p < \infty

. It is also faster than all previous algorithms that minimize the

\ell_p

-norm of the disagreement vector, with run-time

O(n^\omega)

, where

O(n^\omega)

is the time for matrix multiplication on

n \times n

matrices. When the maximum positive degree in the graph is at most

\Delta

, this can be improved to a run-time of

O(n\Delta^2 \log n)

.Comment: 27 pages, 2 figure

arXiv.org e-Print Archive

Single-Pass Pivot Algorithm for Correlation Clustering. Keep it simple!

Author: Chakrabarty Sayak
Makarychev Konstantin
Publication venue
Publication date: 22/05/2023
Field of study

We show that a simple single-pass semi-streaming variant of the Pivot algorithm for Correlation Clustering gives a (3 + {\epsilon})-approximation using O(n/{\epsilon}) words of memory. This is a slight improvement over the recent results of Cambus, Kuhn, Lindy, Pai, and Uitto, who gave a (3 + {\epsilon})-approximation using O(n log n) words of memory, and Behnezhad, Charikar, Ma, and Tan, who gave a 5-approximation using O(n) words of memory. One of the main contributions of this paper is that both the algorithm and its analysis are very simple, and also the algorithm is easy to implement

arXiv.org e-Print Archive

Efficient Correlation Clustering Methods for Large Consensus Clustering Instances

Author: Cordner Nathan
Kollios George
Publication venue
Publication date: 07/07/2023
Field of study

Consensus clustering (or clustering aggregation) inputs

k

partitions of a given ground set

V

, and seeks to create a single partition that minimizes disagreement with all input partitions. State-of-the-art algorithms for consensus clustering are based on correlation clustering methods like the popular Pivot algorithm. Unfortunately these methods have not proved to be practical for consensus clustering instances where either

k

V

gets large. In this paper we provide practical run time improvements for correlation clustering solvers when

V

is large. We reduce the time complexity of Pivot from

O(|V|^2 k)

O(|V| k)

, and its space complexity from

O(|V|^2)

O(|V| k)

-- a significant savings since in practice

k

is much less than

|V|

. We also analyze a sampling method for these algorithms when

k

is large, bridging the gap between running Pivot on the full set of input partitions (an expected 1.57-approximation) and choosing a single input partition at random (an expected 2-approximation). We show experimentally that algorithms like Pivot do obtain quality clustering results in practice even on small samples of input partitions

arXiv.org e-Print Archive

Advances in correlation clustering

Author: Kazempour Daniyal
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/03/2022
Field of study

The task of clustering is to partition a given dataset in such a way that objects within a cluster are similar to each other while being dissimilar to objects from other clusters. One challenge to this task arises when dealing with datasets where the objects are characterized by an increased number of features. Objects within a cluster may exhibit correlations among a subset of features. In order to detect such clusters, within the past two decades significant contributions have been made which yielded a wealth of literature presenting algorithms for detecting clusters in arbitrarily oriented subspaces. Each of them approaches the correlation clustering task differently, by relying on different underlying models and techniques. Building on the current progress made, this work addresses the following aspects: First, it is dedicated to the research question of how to actually measure and therefore evaluate the quality of a correlation clustering. As an initial endeavor, it is investigated how far objectives for internal evaluation criteria can be derived from existing correlation clustering algorithms. The results from this approach, however, exhibited limitations rendering the derived internal evaluation measures not suitable. As a consequence endeavors have been made to identify commonalities among correlation clustering algorithms leading to a cost function that is introduced as an internal evaluation measure. Experiments illustrate its capability to assess clusterings based on aspects that are inherent to all correlation clustering algorithms studied so far. Second, among the existing correlation clustering algorithms, one takes a unique approach. Clusters are detected in a space spanned by the parameters of a given function, known as Hough space. The detection itself is achieved by finding so-called regions of interest (ROI) in Hough space. While the de- tection of ROIs in the existing algorithm performs well in most cases, there are conditions under which the runtime deteriorates, especially in data sets with high amounts of noise. In this work, two different novel strategies are proposed for ROI detection in Hough space, where it is elaborated on their individual strengths and weaknesses. Besides the aspect of ROI detection, endeavors are made to go beyond linearity by proposing approaches for detecting quadratic and periodic correlated clusters using Hough transform. Third, while there exist different views, like local and global correlated clusters, explorations are made in this work with the question in mind, in how far both views can be unified under a single concept. Finally, approaches are proposed and investigated that enhance the resilience of correlation clustering methods against outliers.Die Aufgabe von Clustering besteht darin einen gegebenen Datensatz so zu partitionieren dass Objekte innerhalb eines Clusters ähnlich zueinander sind, während diese unähnlich zu Objekten aus anderen Clustern sind. Eine Herausforderung bei dieser Aufgabe kommt auf, wenn man mit Daten umgeht, die sich durch eine erhöhte Anzahl an Merkmalen auszeichnen. Objekte innerhalb eines Clusters können Korrelationen zwischen Teilmengen von Merkmalen aufweisen. Um solche Cluster erkennen zu können, wurden innerhalb der vergangenen zwei Dekaden signifikante Beiträge geleistet. Darin werden Algorithmen vorgestellt, mit denen Cluster in beliebig ausgerichteten Unterräumen erkannt werden können. Jedes der Verfahren verfolgt zur Lösung der Correlation Clustering Aufgabenstellung unterschiedliche Ansätze indem sie sich auf unterschiedliche zugrunde liegende Modelle und Techniken stützen. Aufbauend auf die bislang gemachten Fortschritte, adressiert diese Arbeit die folgenden Aspekte: Zunächst wurde sich der Forschungsfrage gewidmet wie die Güte eines Correlation Clustering Ergebnisses bestimmt werden kann. In einer ersten Bestrebung wurde ermittelt in wie fern Ziele für interne Evaluationskriterien von bereits bestehenden Correlation Clustering Algorithmen abgeleitet werden können. Die Ergebnisse von dieser Vorgehensweise offenbarten Limitationen die einen Einsatz als interne Evaluations- maße ungeeignet erschienen ließen. Als Konsequenz wurden Bestrebungen unternommen Gemeinsamkeiten zwischen Correlation Clustering Algorithmen zu identifizieren, welche zu einer Kostenfunktion führten die als ein internes Evaluationsmaß eingeführt wurde. Die Experimente illustrieren die Fähigkeit, Clusterings auf Basis von Aspekten die inherent in allen bislang studierten Correlation Clustering Algorithmen vorliegen zu bewerten. Als einen zweiten Punkt nimmt ein Correlation Clustering Verfahren unter den bislang existierenden Methoden eine Sonderstellung ein. Die Cluster werden in einem Raum erkannt welches von den parmetern einer gegebenen Funktion aufgespannt werden welches als Hough Raum bekannt ist. Die Erkennung selbst wird durch das Finden von sogenannten "Regions of Interest" (ROI) im Hough Raum erreicht. Während die Erkennung von ROIs in dem bestehenden Verfahren in den meisten Fällen gut verläuft, gibt es Bedingungen, unter welchen die Laufzeit sich verschlechtert, insbesondere bei Datensätzen mit großen Mengen von Rauschen. In dieser Arbeit werden zwei verschiedene neue Strategien für die ROI Erkennung im Hough Raum vorgeschlagen, wobei auf die individuellen Stärken und Schwächen eingegangen wird. Neben dem Aspekt der ROI Erkennung sind Forschungen unternommen worden um über die Linearität der Correlation Cluster hinaus zu gehen, indem Verfahren entwickelt wurden, mit denen quadratisch- und periodisch korrelierte Cluster mittels Hough Transform erkannt werden können. Der dritte Aspekt dieser Arbeit widmet sich den sogenannten "views". Während es verschiedene views gibt wie z.B. bei lokal oder global korrelierten Clustern, wurden Forschungen unternommen mit der Fragestellung, in wie fern beide Ansichten unter einem einzigen gemeinsamen Konzept vereinigt werden können. Zuletzt sind Ansätze vorgeschlagen und untersucht worden welche die Resilienz von Correlation Clustering Methoden hinsichtlich Ausreißer erhöhen

Digitale Hochschulschriften der LMU

Correlation Clustering with Adaptive Similarity Queries

Author: A. Paudice
F. Vitale
M. Bressan
N. Cesa-Bianchi
Publication venue: Curran Associates, Inc.
Publication date: 01/01/2019
Field of study

In correlation clustering, we are givennobjects together with a binary similarityscore between each pair of them. The goal is to partition the objects into clustersso to minimise the disagreements with the scores. In this work we investigatecorrelation clustering as an active learning problem: each similarity score can belearned by making a query, and the goal is to minimise both the disagreementsand the total number of queries. On the one hand, we describe simple activelearning algorithms, which provably achieve an almost optimal trade-off whilegiving cluster recovery guarantees, and we test them on different datasets. On theother hand, we prove information-theoretical bounds on the number of queriesnecessary to guarantee a prescribed disagreement bound. These results give a richcharacterization of the trade-off between queries and clustering error

AIR Universita degli studi di Milano

MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data

Author: Feigelman Justin
Marr Carsten
Theis Fabian J.
Publication venue
Publication date: 01/01/2014
Field of study

Background: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. Results: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopulations based on the local pairwise correlation between covariates, without needing to define an a priori interaction scale. We demonstrate that MCA facilitates the identification of differentially regulated subpopulations in simulated data from a small gene regulatory network, followed by application to previously published single-cell qPCR data from mouse embryonic stem cells. We show that MCA recovers previously identified subpopulations, provides additional insight into the underlying correlation structure, reveals potentially spurious compartmentalizations, and provides insight into novel subpopulations. Conclusions: MCA is a useful method for the identification of subpopulations in low-dimensional expression data, as emerging from qPCR or FACS measurements. With MCA it is possible to investigate the robustness of covariate correlations with respect subpopulations, graphically identify outliers, and identify factors contributing to differential regulation between pairs of covariates. MCA thus provides a framework for investigation of expression correlations for genes of interests and biological hypothesis generation.Comment: BioVis 2014 conferenc

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

PubMed Central

PuSH

Sublinear Time and Space Algorithms for Correlation Clustering via Sparse-Dense Decompositions

Author: Assadi Sepehr
Wang Chen
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 13th Innovations in Theoretical Computer Science Conference (ITCS 2022)
Publication date: 29/09/2021
Field of study

We present a new approach for solving (minimum disagreement) correlation clustering that results in sublinear algorithms with highly efficient time and space complexity for this problem. In particular, we obtain the following algorithms for

n

-vertex

(+/-)

-labeled graphs

G

: -- A sublinear-time algorithm that with high probability returns a constant approximation clustering of

G

O(n\log^2{n})

time assuming access to the adjacency list of the

(+)

-labeled edges of

G

(this is almost quadratically faster than even reading the input once). Previously, no sublinear-time algorithm was known for this problem with any multiplicative approximation guarantee. -- A semi-streaming algorithm that with high probability returns a constant approximation clustering of

G

O(n\log{n})

space and a single pass over the edges of the graph

G

(this memory is almost quadratically smaller than input size). Previously, no single-pass algorithm with

o(n^2)

space was known for this problem with any approximation guarantee. The main ingredient of our approach is a novel connection to sparse-dense graph decompositions that are used extensively in the graph coloring literature. To our knowledge, this connection is the first application of these decompositions beyond graph coloring, and in particular for the correlation clustering problem, and can be of independent interest

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server