4,273 research outputs found
Clustering and Community Detection with Imbalanced Clusters
Spectral clustering methods which are frequently used in clustering and
community detection applications are sensitive to the specific graph
constructions particularly when imbalanced clusters are present. We show that
ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to
imbalanced cluster sizes since they tend to emphasize cut sizes over cut
values. We propose a graph partitioning problem that seeks minimum cut
partitions under minimum size constraints on partitions to deal with imbalanced
cluster sizes. Our approach parameterizes a family of graphs by adaptively
modulating node degrees on a fixed node set, yielding a set of parameter
dependent cuts reflecting varying levels of imbalance. The solution to our
problem is then obtained by optimizing over these parameters. We present
rigorous limit cut analysis results to justify our approach and demonstrate the
superiority of our method through experiments on synthetic and real datasets
for data clustering, semi-supervised learning and community detection.Comment: Extended version of arXiv:1309.2303 with new applications. Accepted
to IEEE TSIP
Unsupervised Feature Selection with Adaptive Structure Learning
The problem of feature selection has raised considerable interests in the
past decade. Traditional unsupervised methods select the features which can
faithfully preserve the intrinsic structures of data, where the intrinsic
structures are estimated using all the input features of data. However, the
estimated intrinsic structures are unreliable/inaccurate when the redundant and
noisy features are not removed. Therefore, we face a dilemma here: one need the
true structures of data to identify the informative features, and one need the
informative features to accurately estimate the true structures of data. To
address this, we propose a unified learning framework which performs structure
learning and feature selection simultaneously. The structures are adaptively
learned from the results of feature selection, and the informative features are
reselected to preserve the refined structures of data. By leveraging the
interactions between these two essential tasks, we are able to capture accurate
structures and select more informative features. Experimental results on many
benchmark data sets demonstrate that the proposed method outperforms many state
of the art unsupervised feature selection methods
Compressive Embedding and Visualization using Graphs
Visualizing high-dimensional data has been a focus in data analysis
communities for decades, which has led to the design of many algorithms, some
of which are now considered references (such as t-SNE for example). In our era
of overwhelming data volumes, the scalability of such methods have become more
and more important. In this work, we present a method which allows to apply any
visualization or embedding algorithm on very large datasets by considering only
a fraction of the data as input and then extending the information to all data
points using a graph encoding its global similarity. We show that in most
cases, using only samples is sufficient to diffuse the
information to all data points. In addition, we propose quantitative
methods to measure the quality of embeddings and demonstrate the validity of
our technique on both synthetic and real-world datasets
Unifying Sparsest Cut, Cluster Deletion, and Modularity Clustering Objectives with Correlation Clustering
Graph clustering, or community detection, is the task of identifying groups
of closely related objects in a large network. In this paper we introduce a new
community-detection framework called LambdaCC that is based on a specially
weighted version of correlation clustering. A key component in our methodology
is a clustering resolution parameter, , which implicitly controls the
size and structure of clusters formed by our framework. We show that, by
increasing this parameter, our objective effectively interpolates between two
different strategies in graph clustering: finding a sparse cut and forming
dense subgraphs. Our methodology unifies and generalizes a number of other
important clustering quality functions including modularity, sparsest cut, and
cluster deletion, and places them all within the context of an optimization
problem that has been well studied from the perspective of approximation
algorithms. Our approach is particularly relevant in the regime of finding
dense clusters, as it leads to a 2-approximation for the cluster deletion
problem. We use our approach to cluster several graphs, including large
collaboration networks and social networks
Location Dependent Dirichlet Processes
Dirichlet processes (DP) are widely applied in Bayesian nonparametric
modeling. However, in their basic form they do not directly integrate
dependency information among data arising from space and time. In this paper,
we propose location dependent Dirichlet processes (LDDP) which incorporate
nonparametric Gaussian processes in the DP modeling framework to model such
dependencies. We develop the LDDP in the context of mixture modeling, and
develop a mean field variational inference algorithm for this mixture model.
The effectiveness of the proposed modeling framework is shown on an image
segmentation task
Clustering and Community Detection in Directed Networks: A Survey
Networks (or graphs) appear as dominant structures in diverse domains,
including sociology, biology, neuroscience and computer science. In most of the
aforementioned cases graphs are directed - in the sense that there is
directionality on the edges, making the semantics of the edges non symmetric.
An interesting feature that real networks present is the clustering or
community structure property, under which the graph topology is organized into
modules commonly called communities or clusters. The essence here is that nodes
of the same community are highly similar while on the contrary, nodes across
communities present low similarity. Revealing the underlying community
structure of directed complex networks has become a crucial and
interdisciplinary topic with a plethora of applications. Therefore, naturally
there is a recent wealth of research production in the area of mining directed
graphs - with clustering being the primary method and tool for community
detection and evaluation. The goal of this paper is to offer an in-depth review
of the methods presented so far for clustering directed networks along with the
relevant necessary methodological background and also related applications. The
survey commences by offering a concise review of the fundamental concepts and
methodological base on which graph clustering algorithms capitalize on. Then we
present the relevant work along two orthogonal classifications. The first one
is mostly concerned with the methodological principles of the clustering
algorithms, while the second one approaches the methods from the viewpoint
regarding the properties of a good cluster in a directed network. Further, we
present methods and metrics for evaluating graph clustering results,
demonstrate interesting application domains and provide promising future
research directions.Comment: 86 pages, 17 figures. Physics Reports Journal (To Appear
- âŠ